### Sampling Distributions Introduction

In order to gain a bit more comfort with this idea of sampling distributions, let's do some practice in python.

Below is an array that represents the students we saw in the previous videos, where 1 represents the students that drink coffee, and 0 represents the students that do not drink coffee.

In [12]:
import numpy as np
import matplotlib.pyplot as plot

np.random.seed(42)
students = np.array([1,0,1,1,1,1,0,0,0,0,1,1,1,1,1,1,1,1,1,1,0])

`1.` Find the proportion of students who drink coffee in the above array. Store this value in a variable **p**.

In [13]:
tot_stu = students.shape[0]
tot_stu

21

In [14]:
cofee_yes = students[students==1]
cofee_yes

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

In [15]:
p = cofee_yes.shape[0]/ tot_stu # or students.mean()
p

0.7142857142857143

`2.` Use numpy's **random.choice** to simulate 5 draws from the `students` array.  What is proportion of your sample drink coffee?

In [16]:
sample_students = np.random.choice(students, size=5)

sample_students_prop = sample_students.mean()
sample_students_prop

0.59999999999999998

`3.` Repeat the above to obtain 10,000 additional proportions, where each sample was of size 5.  Store these in a variable called `sample_props`.

In [17]:
sample_props = []
for _ in range (10000):
    sample_props.append(np.random.choice(students, size=5).mean())
sample_props = np.array(sample_props)

`4.` What is the mean proportion of all 10,000 of these proportions?  This is often called **the mean of the sampling distribution**.

In [18]:
sample_props_mean = sample_props.mean()
sample_props_mean

0.71399999999999997

`5.` What are the variance and standard deviation for the original 21 data values?

In [19]:
stu_var = np.var(students)
stU_sd = np.std(students)

print(f'Var of 21 students: {stu_var}')
print(f'Std of 21 students: {stU_sd}')

Var of 21 students: 0.20408163265306126
Std of 21 students: 0.45175395145262565


`6.` What are the variance and standard deviation for the 10,000 proportions you created?

In [20]:
sample_props_var = np.var(sample_props)
sample_props_sd = np.std(sample_props)

print(f'Var of 5 students props: {sample_props_var}')
print(f'Std of 5 students props: {sample_props_sd}')

Var of 5 students props: 0.041763999999999996
Std of 5 students props: 0.2043624231604235


`7.` Compute p(1-p), which of your answers does this most closely match?

In [21]:
Pa = p*(1-p)
Pa

0.20408163265306123

In [22]:
f'it matches var of 21 students - {(stu_var/Pa)*100}%'

'it matches var of 21 students - 100.00000000000003%'

`8.` Compute p(1-p)/n, which of your answers does this most closely match?

In [23]:
Pb = p*(1-p)/5
Pb

0.04081632653061225

In [25]:
f'it matches var of 10,000 props - {(sample_props_var/Pb)*100}%'

'it matches var of 10,000 props - 102.32179999999997%'

`9.` Notice that your answer to `8.` is commonly called the **variance of the sampling distribution**.  If you were to change your first sample to be 20, what would this do for the variance of the sampling distribution?  Simulate and calculate the new answers in `6.` and `8.` to check that the consistency you found before still holds.

In [26]:
##Simulate your 20 draws
sample_props2 = []
for _ in range (10000):
    sample_props2.append(np.random.choice(students,size=20).mean())

In [27]:
sample_props2 = np.array(sample_props2)
sample_props2_mean = sample_props2.mean()
sample_props2_mean

0.71492500000000003

In [28]:
Pc =  p*(1-p)/20
Pc

0.010204081632653062

In [29]:
##Compare your variance values as computed in 6 and 8, 
##but with your sample of 20 values

sample_props2_var = np.var(sample_props2)
sample_props2_std = np.std(sample_props2)


print(f'Var of 20 students prop: {sample_props2_var}')
print(f'Std of 20 students prop: {sample_props2_std}')

Var of 20 students prop: 0.010300994374999999
Std of 20 students prop: 0.10149381446669545


`10.` Finally, plot a histgram of the 10,000 draws from both the proportions with a sample size of 5 and the proportions with a sample size of 20.  Each of these distributions is a sampling distribution.  One is for the proportions of sample size 5 and the other a sampling distribution for proportions with sample size 20.

In [None]:
plot.hist(sample_props, alpha=.5)
plot.hist(sample_props2, alpha=.5);