# The Central Limit Theorem
In this exercise we will explore and demonstrate the Central Limit Theorem—a very important idea that arises in much of science and data analysis. The theorem states that the average of a series of numbers will always follow a Gaussian, or "normal," probability distribution regardless of the probability distribution that the data points were drawn from.

A Gaussian probability distribution (or density) function (PDF) is given by 
$$ p(x) = \frac{1}{\sqrt{2\pi}\sigma}\exp\left[-\frac{(x-x_0)^2}{2\sigma^2}\right] $$
where $x_0$ is the mean of the distribution, and $\sigma$ is the standard deviation of the distribution.

Let's first take a look at a gaussian function

In [None]:
# Load in useful packages and make plots display inline
import numpy as np
import matplotlib.pyplot as plt
%config InlineBackend.figure_format = 'retina'
%matplotlib inline

To make life easy, let's first make a function that creates a Gaussian curve

In [None]:
def gauss(sig=1,x0=0):
    x = np.linspace(x0-10*sig,x0+10*sig,1000)
    y = 1.0/(np.sqrt(2*np.pi)*sig)*np.exp(-(x-x0)**2/(2*sig**2))
    return x,y

In [None]:
x,y = gauss(sig=5,x0=12)
plt.plot(x,y)
plt.title('Gaussian Curve')

Go ahead and change the mean and standard deviation and see how the curve changes...

Now, what does it mean to say that we "draw" numbers from a normal distribution? It means that numbers are drawn randomly with probability equal to the value of a Gaussian function with a given mean and standard deviation. Luckily, there is a very nice numpy function that will do this for us.

In [None]:
# This draws 100 samples from a normal distribution with a mean of "0" and a standard deviation of "1"
rand = np.random.normal(0,1,100)
print(rand)

Hmm. Ok, it's just a bunch of values. But what are the frequency of the values?

In [None]:
hist = plt.hist(np.random.normal(0,1,100),bins=20,density=True)

Interesting! It (sort of) looks like a normal distribution. Why don't you try upping the number of samples drawn to 1000, or more! You can also increase the number of bins in your histogram. 

Next, let's overplot a Gaussian curve with the same mean and standard deviation.

In [None]:
hist = plt.hist(np.random.normal(0,1,10000),bins=50,density=True,edgecolor='none')
x,y = gauss(sig=1,x0=0)
plt.plot(x,y,'r--')

Wow! It follows the curve very well.

Ok. Now let's look at what it looks like to draw a bunch of samples from a different distribution. We'll use a lognormal form. 

In [None]:
# This line "draws" 100000 samples
dist = np.random.lognormal(0,1,100000)
plt.hist(dist[dist<10],bins=100,density=True)
# This is what the mean of the distribution should be
plt.axvline(x=np.sqrt(np.e),linestyle='-',color='red')
# This is the actual mean of the distribution
plt.axvline(x=np.mean(dist),linestyle='--',color='green')

print(np.sqrt(np.e),np.mean(dist))

Ok. So, there is a well defined mean. But this sure looks different than a normal distribution.

Now on to the Central Limit Theorem (the point of this whole exercise!). The CLT states that if we average many lognormal distributions that the result should be Gaussian! 

Let's start by averaging 3 lognormal distributions and see if it looks Gaussian...

In [None]:
size = 10000
dist1 = np.random.lognormal(0,1,size)
dist2 = np.random.lognormal(0,1,size)
dist3 = np.random.lognormal(0,1,size)
dist = (dist1+dist2+dist3)/3.0
plt.figure(1)
hist = plt.hist(dist1,bins=300,density=True,edgecolor='none',alpha=0.5,color='red')
plt.xlim(0,10)
plt.title('Lognormal Distribution #1')
plt.figure(2)
hist = plt.hist(dist2,bins=300,density=True,edgecolor='none',alpha=0.5,color='blue')
plt.xlim(0,10)
plt.title('Lognormal Distribution #2')

plt.figure(3)
hist = plt.hist(dist3,bins=300,density=True,edgecolor='none',alpha=0.5,color='green')
plt.xlim(0,10)
plt.title('Lognormal Distribution #3')

plt.figure(3)
hist = plt.hist(dist3,bins=300,density=True,edgecolor='none',alpha=0.5,color='green')
plt.xlim(0,10)
plt.title('Averaged Distribution')



Hmm. Not really. Well maybe we need to average a LOT of distributions

In [None]:
# The size of each distribution
size = 10000

# The number of distributions that we will average
ndist = 1000

# Create an array of zeros and then accumulate the values from each draw.
dist =  np.zeros(size)
for i in range(ndist):
    dist += np.random.lognormal(0,1,size)

# Now divide by the number of distributions to find the average values
dist /= np.float(ndist)

# Plot the resultant distribution
hist = plt.hist(dist,bins=100,density=True,edgecolor='none')

Holy moly! It sure looks Gaussian. But is it really?

In [None]:
hist = plt.hist(dist,bins=100,density=True,edgecolor='none')
x,y = gauss(x0=np.mean(dist),sig=np.std(dist))
plt.plot(x,y,'r--')
#xlim = plt.xlim(1.56,1.74)

## Homework
Demonstrate the CLT with a different distribution and save it in a python script or module. You may copy and edit code from this notebook. 

How might you go about "proving" that the distribution is consistent with a Gaussian?