# DRILL - Exploring the Central Limit Theorem

Using your own Jupyter notebook, or a copy of the notebook from the previous assignment, reproduce the pop1 and pop2 populations and samples. Specifically, create two binomially distributed populations with n equal to 10 and size equal to 10000. The p-value of pop1 should be 0.2 and the p-value of pop2 should be 0.5. Using a sample size of 100, calculate the means and standard deviations of your samples.

For each of the following tasks, first write what you expect will happen, then code the changes and observe what does happen. Discuss the results with your mentor.

1) Increase the size of your samples from 100 to 1000, then calculate the means and standard deviations for your new samples and create histograms for each. Repeat this again, decreasing the size of your samples to 20. What values change, and what remain the same?

2) Change the probability value (p in the NumPy documentation) for pop1 to 0.3, then take new samples and compute the t-statistic and p-value. Then change the probability value p for group 1 to 0.4, and do it again. What changes, and why?

3) Change the distribution of your populations from binomial to a distribution of your choice. Do the sample mean values still accurately represent the population values?

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pylab as plt
%matplotlib inline

In [2]:
pop1 = np.random.binomial(10, 0.2, 10000)
pop2 = np.random.binomial(10, 0.5, 10000)

In [14]:
initial = [pop1.mean(), pop1.std(), pop2.mean(), pop2.std()]
initial

[1.9957, 1.2602704114593821, 4.9897, 1.5757518554645589]

In [3]:
sample1 = np.random.choice(pop1, 100, replace=True)
sample2 = np.random.choice(pop2, 100, replace=True)

In [5]:
summary1 = (sample1.mean(), sample1.std())
summary2 = (sample2.mean(), sample2.std())
summary1, summary2

((1.78, 1.1625833303466895), (5.0199999999999996, 1.8707217858356169))

# Question 1

Increase the size of your samples from 100 to 1000, then calculate the means and standard deviations for your new samples and create histograms for each. Repeat this again, decreasing the size of your samples to 20. What values change, and what remain the same?

In [7]:
sample1_1000 = np.random.choice(pop1, 1000, replace=True)
sample2_1000 = np.random.choice(pop2, 1000, replace=True)

In [9]:
summary1_1000 = (sample1_1000.mean(), sample1_1000.std())
summary2_1000 = (sample2_1000.mean(), sample2_1000.std())
(summary1_1000, summary2_1000)

((2.0649999999999999, 1.3010668699186834),
 (5.0039999999999996, 1.5830300060327345))

Increasing the sample size has decreased the standard deviation of the sample as the bigger sample means that outliers have less impact on the distribution. The means have had some small changes - the higher sample size has made the means more accurate (i.e. closer to the mean of the populations).

In [17]:
sample1_20 = np.random.choice(pop1, 20, replace=True)
sample2_20 = np.random.choice(pop2, 20, replace=True)

In [19]:
summary1_20 = (sample1_20.mean(), sample1_20.std())
summary2_20 = (sample2_20.mean(), sample2_20.std())
(summary1_20, summary2_20)

((2.7000000000000002, 1.3820274961085255),
 (5.3499999999999996, 1.7684739183827394))

The (substantially) smaller sample sizes has led to the new means being  less accruate and the standard deviations are now wider. The smaller sample size therefore leads to less accurate results.

## Question 2

Change the probability value (p in the NumPy documentation) for pop1 to 0.3, then take new samples and compute the t-statistic and p-value. Then change the probability value p for group 1 to 0.4, and do it again. What changes, and why?

In [22]:
pop1_1 = np.random.binomial(10, 0.3, 10000)
sample1_1 = np.random.choice(pop1_1, 100, replace=True)

In [23]:
# calculating t stat between sample1 and sample1_1 (both have sample size 100)
diffmean = sample1.mean() - sample1_1.mean()
se = (sample1.std()**2/len(sample1) + sample1_1.std()**2/len(sample1_1))**0.5
tstat = diffmean/se
tstat

-6.474894434077382

The t-stat is large and therefore we can reject the null hypothesis that the samples are drawn from the same distribution.

In [26]:
pop1_2 = np.random.binomial(10, 0.4, 10000)
sample1_2 = np.random.choice(pop1_1, 100, replace=True)

In [27]:
# calculating t stat between sample1 and sample1_2 (both have sample size 100)
diffmean2 = sample1.mean() - sample1_2.mean()
se2 = (sample1.std()**2/len(sample1) + sample1_2.std()**2/len(sample1_2))**0.5
tstat2 = diffmean2/se2
tstat2

-7.0779593193008079

The t-stat has become larger which is to be expected since the populations which are being compared (pop1 and pop1_2 have greater differences. Note that a bigger t-stat compared to the earlier calculation will not always be the case, it still depends on the sample drawn. Both, however, should be consistently large.

## Question 3

Change the distribution of your populations from binomial to a distribution of your choice. Do the sample mean values still accurately represent the population values?

In [29]:
fpop = np.random.f(1, 5, 10000)
fpopmean = fpop.mean()
fpopstd = fpop.std()
fsample = np.random.choice(fpop, 100, replace=True)
fsamplemean = fsample.mean()
fsamplestd = fsample.std()

In [31]:
(fpopmean, fsamplemean)

(1.6390056835814248, 1.2422824442857381)

In [32]:
(fpopstd, fsamplestd)

(3.7960583374278425, 1.6897080819569739)

If the sample size is large enough then according to the CLT the mean of the samples will converge to the mean of the population values.