## Central Limit Theorem

### Law of large numbers (LLN)
* The mean of the sample distribution approaches to the mean of the population distribution as the size of the sample would grow. So, collecting more data leads to a more representative sample.

### Central Limit Theorem (CLT)
* It claims that the distribution of sample means (calculated from re-sampling) will tend to normal, as the number of samples' increases, regardless of the shape of the population distribution.
* So, if you randomly draw a sample of your customers, say 1000 customers, this sample itself might not be normally distributed. But if you now repeat the experiment say 100 times, then the 100 means of those 100 samples will make up a normal distribution.

### How are these two related?
* CLT states that — as the sample size tends to infinity, the shape of the distribution resembles a bell shape (normal distribution). The center of this distribution of the sample means becomes very close to the population mean — which is essentially the law of large numbers.

### Expectations

* Data in the distribution should be randomly selected. Samples should be independent of each other. More formally, these two expectations are referred to as "independent and identically distributed", or iid.
* This estimate of the Gaussian distribution will be more accurate as the size of the samples drawn from the population is increased. This means that if we use our knowledge of the Gaussian distribution in general to start making inferences about the means of samples drawn from a population, that these inferences will become more useful as we increase our sample size.

### Conclusions
* The sampling distribution of means will be normal or close to normal.
* The mean of the sampling distribution will be equal to the population’s mean distribution.
* The standard error of sampling distribution is directly linked to the standard deviation of the original population. The standard deviation of the sample means is equal to the standard deviation of the population divided by the square-root of the sample size. So, the standard deviation of the sample mean distribution will change if you change the sample size. If we increase the samples drawn from the population, the standard deviation of sample means will decrease.

## CLT Simulation

### 0. Import necessary libraries

In [13]:
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.animation as animation
from wand.image import Image
from wand.display import display
%matplotlib notebook

### 1. Expected value

In [2]:
E=1/6*(1+2+3+4+5+6)
print("Expected value=",E)

Expected value= 3.5


### 2. Create 1000 simulations of a (1) fixed and a (2) growing sample of die rolls¶

In [3]:
# 1000 simulations of die roll
n = 1000

* The distribution of sample means, calculated from repeated sampling, will tend to normality as the number of samples gets larger.
* In each simulation, there is one trial more than the previous simulation (in 2nd type of simulation)

#### 2.1 Fixed sample size

In [4]:
avg = []
for i in range(2,n):
    avg.append(np.average(np.random.randint(1,7,10)))

In [5]:
y=np.average(avg[2:n])
print(y,"tends to",E)

3.505120481927711 tends to 3.5


#### 2.2 Growing sample size

* CLT states that — as the sample size tends to infinity, the shape of the distribution resembles a bell shape (normal distribution). The center of this distribution of the sample means becomes very close to the population mean — which is essentially the law of large numbers.

In [6]:
avg1 = []
for i in range(2,n):
    avg1.append(np.average(np.random.randint(1,7,i)))

In [7]:
y1=np.average(avg1[2:n])
print(y1,"tends to",E)

3.5055575394124565 tends to 3.5


### 3. Function to plot histogram and animation¶

#### 3.1

In [15]:
def clt(current):
    # if animation is at the last frame, stop it
    plt.cla()
    if current == 1000: 
        a.event_source.stop()

    plt.hist(avg[0:current])

    plt.gca().set_title('Expected value of die rolls')
    plt.gca().set_xlabel('Average from die roll')
    plt.gca().set_ylabel('Frequency')

    plt.annotate('Die roll = {}'.format(current), [3,27], color="red")

In [16]:
fig = plt.figure()
a = animation.FuncAnimation(fig, clt, interval=10)

<IPython.core.display.Javascript object>

#### 3.2

In [17]:
def clt1(current):
    # if animation is at the last frame, stop it
    plt.cla()
    if current == 1000: 
        a.event_source.stop()

    plt.hist(avg1[0:current])

    plt.gca().set_title('Expected value of die rolls')
    plt.gca().set_xlabel('Average from die roll')
    plt.gca().set_ylabel('Frequency')

    plt.annotate('Die roll = {}'.format(current), [3,27], color="red")

In [18]:
fig1 = plt.figure()
a1 = animation.FuncAnimation(fig1, clt1, interval=10)

<IPython.core.display.Javascript object>