<h1></h1>

<h1>Exploration of the Central Limit Theorem</h1>
<p>In this example we will first generate two large populations and from them randomly select a group of measurements to be our samples.</p>

In [48]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import ttest_ind
%matplotlib inline

#Generate populations
pop1 = np.random.binomial(10, 0.2, 10000)
pop2 = np.random.binomial(10, 0.5, 10000)
print('Population 1 mean: ', pop1.mean(), '\nPopulation 1 STD: ', pop2.std(),
     '\nPopulation 2 mean: ', pop2.mean(), '\nPopulation 2 STD: ', pop2.std())

print('\n')

#Generate samples
samp1 = np.random.choice(pop1, 100, replace = True)
samp2 = np.random.choice(pop2, 100, replace = True)

def find_mean_std(samp1, samp2):
    mean1 = samp1.mean()
    mean2 = samp2.mean()
    std1 = samp1.std()
    std2 = samp2.std()
    return [mean1, std1, mean2, std2]

stats1 = find_mean_std(samp1, samp2)

def print_stats(stats):
    print('Mean of Sample 1: ', stats[0], '\nSTD of Sample 1: ', stats[1], 
          '\nMean of Sample 2: ', stats[2], '\nSTD of Sample 2: ', stats[3])

print_stats(stats1)   

Population 1 mean:  1.9876 
Population 1 STD:  1.58018778631 
Population 2 mean:  5.0284 
Population 2 STD:  1.58018778631


Mean of Sample 1:  2.07 
STD of Sample 1:  1.38747972958 
Mean of Sample 2:  4.93 
STD of Sample 2:  1.55727325797


<p>As you can see, we have the mean and standard deviation of both of our samples. We are interested in seeing how the mean and standard deviation change when we increase our sample size.</p>

In [49]:
samp3 = np.random.choice(pop1, 1000, replace = True)
samp4 = np.random.choice(pop2, 1000, replace = True)

stats2 = find_mean_std(samp3, samp4)
print_stats(stats2)

Mean of Sample 1:  1.986 
STD of Sample 1:  1.22711205682 
Mean of Sample 2:  5.112 
STD of Sample 2:  1.6073132862


<p>With a larger sample size, we would expect the mean and standard deviation to more accurately reflect the mean and standard deviation of our population, and as we can see by the demonstration, this appears to be the case.</p>

<p>We shall now see what happens when we change the probability values for our binomail distributions. In particular, how they change the t-values and p-values of our samples.</p>

In [50]:
pop1 = np.random.binomial(10, 0.3, 10000)

samp1 = np.random.choice(pop1, 1000, replace = True)
samp2 = np.random.choice(pop2, 1000, replace = True)

print(ttest_ind(samp2, samp1, equal_var = False))

Ttest_indResult(statistic=30.740276217957586, pvalue=4.4200431600929489e-170)


In [51]:
pop2 = np.random.binomial(10, 0.4, 10000)

samp1 = np.random.choice(pop1, 1000, replace = True)
samp2 = np.random.choice(pop2, 1000, replace = True)

print(ttest_ind(samp2, samp1, equal_var = False))

Ttest_indResult(statistic=14.587869231730021, pvalue=7.0339364818910911e-46)


<p>As we can see, when we change the properties of the population distributions to be marginally more similar, the change is reflected exponentially in the p-values of their samples. Note how the p-value for the second set of samples is dozens of orders of magnitudes larger than the p-value for the first set of samples.</p>

<p>We shall now make observations on a different distribution.</p>

In [52]:
#Gumbel distribution is built differently (loc, scale, size)
pop1 = np.random.gumbel(10, 100, 10000)
pop2 = np.random.gumbel(10, 100, 10000)

print('Population stats: ')
print('Population 1 mean: ', pop1.mean(), '\nPopulation 1 STD: ', pop2.std(),
     '\nPopulation 2 mean: ', pop2.mean(), '\nPopulation 2 STD: ', pop2.std())
print('\n')

#Small Sample Size
samp1 = np.random.choice(pop1, 100, replace = True)
samp2 = np.random.choice(pop2, 100, replace = True)

print('Small Sample Size: ')
stats1 = find_mean_std(samp1, samp2)
print_stats(stats1)
print('\n')

#Large Sample Size
samp3 = np.random.choice(pop1, 1000, replace = True)
samp4 = np.random.choice(pop2, 1000, replace = True)

print('Large Sample Size: ')
stats2 = find_mean_std(samp3, samp4)
print_stats(stats2)




Population stats: 
Population 1 mean:  66.3667995385 
Population 1 STD:  126.441365497 
Population 2 mean:  66.2364186906 
Population 2 STD:  126.441365497


Small Sample Size: 
Mean of Sample 1:  55.6352760041 
STD of Sample 1:  122.52811921 
Mean of Sample 2:  65.1470438348 
STD of Sample 2:  119.944841151


Large Sample Size: 
Mean of Sample 1:  69.6410158674 
STD of Sample 1:  128.038406723 
Mean of Sample 2:  65.3023133865 
STD of Sample 2:  123.53205235


<p>As we can see, the sample mean and std accurately reflect the population mean and standard deviation regardless of the distribution used. And the accuracy of these representations are proportional to the sample sizes.</p>