### Lecture Notes: Sampling, Models and Statistic

**Helpful Resource:**
- [Python Reference](http://data8.org/sp22/python-reference.html)

**Recommended Readings:**
- [Sampling](https://www.inferentialthinking.com/chapters/10/Sampling_and_Empirical_Distributions.html)
- [Distribution of a Statistic](https://www.inferentialthinking.com/chapters/10/3/Empirical_Distribution_of_a_Statistic.html)
- [Random Sampling in Python](https://www.inferentialthinking.com/chapters/10/4/Random_Sampling_in_Python.html)

In [25]:
# import modules to be used in this notebook

from datascience import *
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')


In [24]:
prize_list = make_array(0, 2, 4, 8, 10, 18, 30, 88, 100, 888)
np.random.choice(prize_list, 6), np.random.choice(prize_list, 6, replace=False)

(array([888, 888, 100,  18,  10,  30]), array([ 30, 100,   8,   0,  10,   2]))

Assume that you are sampling from 10 values with replacement. What is the chance that you happen to draw a sample of distinct values? 

Similarly, `table.sample` can also be used to draw samples with or without replacement. 

In [None]:
students = Table().read_table('student_data.csv')
students

In [None]:
students.sample(10)

In [None]:
students.sample(10, with_replacement=False)

### Sampling Distribution of Means

Consider a population that is made of 3 values: 1, 2, 5. How many samples of 2 are possible if it is taken with replacement? List these samples. 

What are all the possible sample means based on these samples? 

If we take the mean of all these sample means, what will it be? Compare your result with the simulation. 

In [None]:
pop = make_array(1, 2, 5)
n = 2
iter = 1000
results = make_array()
for i in np.arange(iter):
    sample = np.random.choice(pop, 2)
    statistic = np.average(sample)
    results = np.append(results, statistic)

results

### Sampling from Tables 

By treating the student data as the population, we draw many samples of 2 and compute the sample mean from each sample. 

In [None]:
n = 2
iter = 10000
results = make_array()
for i in np.arange(iter):
    statistic = np.average(students.sample(n).column('HEIGHT'))
    results = np.append(results, statistic)

results

To be able to vary the sample size and the variable, we use the following function to sample from the student data. 

In [None]:
def sampling_distribution(n, col):
    iter = 10000
    results = make_array()
    for i in np.arange(iter):
        statistic = np.average(students.sample(n).column(col))
        results = np.append(results, statistic)
    return results

In [None]:
sample_5 = sampling_distribution(5, 'HEIGHT')

In [None]:
sample_10 = sampling_distribution(20, 'HEIGHT')

In [None]:
sample_50 = sampling_distribution(50, 'HEIGHT')

In [None]:
sampling_distributions = Table().with_columns('n=5', sample_5, 'n=10', sample_10, 'n=50', sample_50)

In [None]:
sampling_distributions.hist(bins = np.arange(62, 72, 0.5))

### Sampling Distribution of Proportions 

Consider the same population that is made of 3 values: 1, 2, 5. For each of the sample of 2, calculate the proportion of 1's within the sample. The following code simulates the sampling distribution of proportions based on these samples. 

In [None]:
pop = make_array(1, 2, 5)
n = 2
iter = 1000
results = make_array()
for i in np.arange(iter):
    sample = np.random.choice(pop, 2)
    statistic = np.count_nonzero(sample==1) / 2 
    results = np.append(results, statistic)

### Swain v.s. Alabama

The lecture cites the Supreme Court case of [Swain v.s Alabama (1965)](https://supreme.justia.com/cases/federal/us/380/202/). Instead of using `np.random.choice`, the `datascience` package uses the following function to draw a sample and use it to calculate sample proportions. For example, assuming that 26% of the eligible jurors are black, then the follow code simulates one sample of 100 jurors, and returns an array that contains `(proportion_of_black_jurors, proportion_of_non-black_jurors)`

In [None]:
sample_proportions(100, make_array(0.26, 0.74))

Based on the `sample_proportions` function, we can generate the sampling distribution of the proportion of black jurors for all juries of 100. 

In [None]:
def sampling_distribution_black(n, prop):
    results = [] 
    proportions = make_array(prop, 1-prop)   
    for i in np.arange(100000):
        results = np.append(results, sample_proportions(n, proportions).item(0)*100)
    return results

In [None]:
results = sampling_distribution_black(100, 0.26)
Table().with_column('NUmber of black jurors', results).hist()
np.count_nonzero(results < 10)