# Inference: Statistics, Models, Hypotheses


In [None]:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt

custom_palette = sns.color_palette('viridis', 2)
sns.set_palette(custom_palette)

### Flight delays data
Data on delays of __all__ United flights out of SFO in 2015

In [None]:
# read data
united_df = pd.read_csv('united.csv')
display(united_df.head())
united_df.shape[0]

We have data on the entire population, so we can look at a histogram for the probability distribution:

In [None]:
# plot probability distribution of delays based on the entire population
delay_bins = np.arange(-20, 301, 10)
ax = sns.displot(united_df, x='Delay', bins=delay_bins)
ax.set(ylabel='number of flights')

Let's take a simple random sample from the population, and look at the resulting emprircal distribution

In [None]:
# sample flights and examine the empirical distribution of the delays
def empirical_hist_delay(n):
    sampled_flights = united_df.sample(n)
    ax = sns.displot(sampled_flights, x='Delay', bins=delay_bins)
    
empirical_hist_delay(100)    

## Statistics
__A statistic__ is a quanitity that is a function of the sample (not the population) that we can caluclate and use as an estimate for a parameter in the population.

For example, say our parameter is the median delay in United flights out of SFO in 2015. Since we have data on the entire population, we know the value of this parameter:

In [None]:
# median delay: PARAMETER OF THE POPULATION
united_df['Delay'].median()

But, let's say we only have a sample. We can look at the median delay in the sample:

In [None]:
# statistic based on one sample of median delay
sample_flights = united_df.sample(100)
print(sample_flights['Delay'].median())

With different samples, we get a different value for the statistic (remember the value of the parameter in the population remains fixed). 

### Simulating a statistic

Because with each sample we take, we will get a different value for the statistic, the statistic itself has a distribution! We can therefore simulate taking a sample many times, calculating the statistic in each time, and plot the **empirical** distribution of the statistic.

We already saw how to simulate a random trial and summarize the findings.

Let's simulate the median delay in a random sample of 100 flights:

In [None]:
# value from one sample
def random_sample_median(n):
    sample_flight = united_df.sample(n)
    return sample_flight['Delay'].median()


# run many simulations
num_sims = 3000 # the number of simulations we run
median_values = []
sample_size = 100 # the number of samples drawn in each simulation: MUST be fixed within each simulation!
for i in range(num_sims):
    median_values.append(random_sample_median(sample_size))
    
ax = sns.displot(median_values, bins=np.arange(-2.125,11.1,0.25))
ax.set(xlabel='median delay in a sample of '+str(sample_size)+' flights', xticks=np.arange(-2, 11, 2))

## Jury selection example: First inference problem

In the early 1960’s, in Talladega County in Alabama, a black man called Robert Swain was convicted of raping a white woman and was sentenced to death. He appealed his sentence, citing among other factors the all-white jury. At the time, only men aged 21 or older were allowed to serve on juries in Talladega County. In the county, 26% of the eligible jurors were black, but there were only 8 black men among the 100 selected for the jury panel in Swain’s trial. No black man was selected for the trial jury.

In 1965, the Supreme Court of the United States denied Swain’s appeal. In its ruling, the Court wrote “… the overall percentage disparity has been small and reflects no studied attempt to include or exclude a specified number of Negroes.”

Jury panels are supposed to be selected at random from the eligible population. Because 26% of the eligible population was black, 8 black men on a panel of 100 might seem low. **But is it too low?**

### A Model
One view of the way the real world generated the data – a model, in other words – is that the panel was selected at random and ended up with a small number of black men just due to chance. This model is consistent with what the Supreme Court wrote in its ruling.

The model specifies the details of a chance process. It says the data are like a random sample from a population in which 26% of the people are black. We are in a good position to assess this model, because:

- We can simulate data based on the model. That is, we can simulate drawing at random from a population of whom 26% are black.
- Our simulation will show what a panel *would* look like *if* it were selected at random.
- We can then compare the results of the simulation with the composition of Robert Swain’s panel.
- If the results of our simulation are not consistent with the composition of Swain’s panel, that will be evidence against the model of random selection.

Steps:
1. What is the statistic that we want to simulate?
2. Code generating one value of that statistic
3. Run the code many times and store the output in a collection array

In [None]:
population = ['black', 'other']
prob_population = [0.26, 0.74] # probabilities of selection GIVEN THE MODEL IS TRUE
panel_size = 100 

# sample one value
def sample_jury_panel():
    sample_panel = np.random.choice(population, p=prob_population, size=panel_size)
    num_blacks = np.count_nonzero(sample_panel == 'black')
    return num_blacks

# run multiple simulations
num_repetitions = 10000
samples = np.empty(num_repetitions) # collection array
for i in range(num_repetitions):
    samples[i] = sample_jury_panel()
    
samples

Now, let's see the empirical distribution of the statsitic, under the assumptions of the model:

In [None]:
# plot the empirical distribution of the statistic
ax = sns.displot(samples, stat="density", bins=np.arange(0,50,1))
ax.set(xlabel='number of blacks out of '+ str(panel_size)+' jurors', ylabel='proportion in samples', xticks=np.arange(0, 51, 10));

This is the output of the simulation of the statistic: What we would expect the empirical disrtibution of the statistic to look like if the model is true

We also have data. Let's plot them together.

In [None]:
# plot the empirical distribution of the statistic
ax = sns.displot(samples, stat="density", bins=np.arange(0,50,1))
ax.set(xlabel='number of blacks out of '+ str(panel_size)+' jurors', ylabel='proportion in samples', xticks=np.arange(0, 51, 10));

# Add a red point on the plot marking our data
plt.scatter(8, 0, marker='.', s=200, color='red', clip_on=False)  # draw observed value on the x-axis (at (8,0))
plt.show()

What do you think? Is the model __consistent__ with the data? <br>
Does it look like the Supreme Court was right? 

### *P*-values

The P-value is the probability, under the null hypothesis, that the test statistic is equal to the value that was observed in the data or is even further in the direction of the alternative.

Let's compute the p-value for the jury case

In [None]:
count_eight_or_fewer = np.count_nonzero(samples <= 8)
print ('The p-value is', count_eight_or_fewer/len(samples))

Let's say that the jury panel in Swain had had 20 black men in it, rather then only 8 as it really did. This is still less than 26% of the sample, although the population has 26% black men. But is it sufficiently unlikely so that it provides us with evidence against a model stating the jury panel was selected at random from the population?
Compute the p-value.