#  Statistical Hypothesis Testing

# Reading

[Chapter 11: 11.1 - 11.4](https://inferentialthinking.com/chapters/11/Testing_Hypotheses.html)<br><br>

In this notebook we will apply our understanding of random samples and empirical distributions to check or test a _hypothesis_ about a dataset. A hypothesis is an assumption that we make on a dataset's parameter.

Examples of hypotheses that we test: the new website increases the chance of customers purchasing a product, or, the new drug lowers the patient's blood pressure better than an existing drug.

---

## Assessing a Model

Before we can test a hypothesis, we need to learn how to create a _model_ that can be used to test the hypothesis. In data science a _model_ is a set of assumptions about the data, including how the data was obtained. Our work is to determine whether the model is good or not.

We will evaluate a model by using the jury selection example discussed in the [textbook](https://inferentialthinking.com/chapters/11/1/Assessing_a_Model.html#jury-selection).

A jury pool is a group of people that are selected to make decisions about a civil or criminal trial. A jury pool is supposed to be racially diverse and representative of the different communities where the trial takes place.

The textbook discussed a real-life case in 1962, when a Black man named Robert Swank appealed his conviction on the grounds that Black people were often excluded from the jury pool, which led to an unfair trial for him. For Robert Swank's trial, 8 Black men were selected for the jury pool of 100 men, while Black men made up 26% of the population from which jurors were chosen. In the end, the court denied Robert Swank's appeal because it determined that the selection of jurors was fair.

We now determine whether or not the court was correct by:
- Making the assumption that the jury selection was a fair model where jurors were randomly selected from the population.
- With this model in mind, we simulate the selection of jurors and compare the simulated jury pool with the actual jury pool.
- If the results are not similar, then it is evidence that the jury selection was not fair.

The statistic or parameter of the data that we will measure is the number of Black men on the jury panel.

First we use the following function to simulate this statistic with a sample size of 100, the same size as the actual jury pool.

In [None]:
import numpy as np

# this function accepts a sample size and the proportions or ratio
# of each category of the population

def sample_proportions(sample_size, proportions):
    # draw random sample from a population with the appropriate proportions
    categories = np.arange(len(proportions))
    sample = np.random.choice(categories, size=sample_size, p=proportions)

    # count occurrences of each category
    counts = np.bincount(sample, minlength=len(proportions))

    # convert counts to proportions
    return counts / sample_size

We test the function by running it a few times with a sample of 100 and the [0.26, 0.74] ratio for Blacks and non-Blacks:

In [None]:
proportions = sample_proportions(100, [0.26, 0.74])
print("Proportions for Blacks and Non-Blacks:", proportions)
print("Number of Blacks:", int(100 * proportions[0]))

By running the Code cells above a few times, we can see that the number of Blacks selected is around 26.

Next we simulate the jury selection by running `sample_proportions` 10,000 times and plotting the empirical distribution.

In [None]:
L = []
for i in range(10000):
    proportions = sample_proportions(100, [0.24, 0.76])
    L.append(100 * proportions[0])
array = np.array(L)
print(len(array))

In [None]:
import matplotlib.pyplot as plt

plt.figure(figsize=(4,3))
plt.hist(array, density=True, bins=np.arange(5.5, 46.6), edgecolor='black')
plt.title("Empirical Distribution of Statistic")
plt.xlabel("Count of Black Jurors")
plt.grid()
plt.show()

We can see that with Black men being 26% of the population, if the jury selection was a fair random selection, we should end up between 20-30 Black men being selected, not 8 as with Robert Swank's trial.

---

## Models with multiple categories

In the previous example we worked with a model with one category: the number of Black men selected for the jury pool. With one category, it's simple to use the category as the statistic that we want to measure. Now we look at a model with multiple categories and see how to determine an appropriate statistic that we can measure.

Continuing with the jury selection process, we will look at selected jurors across multiple ethnicities.



We follow the [textbook](https://inferentialthinking.com/chapters/11/2/Multiple_Categories.html#composition-of-panels-in-alameda-county)'s example of jury selection in Alameda County, California. Between 2009 and 2010, there were 1453 people who reported for jury duty for 11 felony cases. Here are the data from the jury pools for these cases:

In [None]:
import pandas as pd
jury = pd.DataFrame ({'Ethnicity': ['Asian/PI', 'Black/AA', 'Caucasian', 'Hispanic', 'Other'],
                      'Eligible':[0.15, 0.18, 0.54, 0.12, 0.01],
                      'Selected':[0.26, 0.08, 0.54, 0.08, 0.04]})
jury

The data in the `Eligible` column are the proportions of population who are eligible to be jurors. The data in the `Selected` column are proportions of people who are actually selected for the jury pool.

Using data visualization, we plot the `Eligible` and the `Selected` data of each ethnicity with a bar chart.

In the plot we put the bars for `Eligible` and `Selected` columns next to each other for easy visual comparison. To do this we need to adjust the placements of the bars and the yticks as shown in the comments in the Code cell.

In [None]:
import matplotlib.pyplot as plt
import numpy as np

# You don't need to write the code below.
# The code shows a visual comparison between the Eligible and Selected columns

plt.figure(figsize=(4,3))
bar_width = 0.25
# get the index of each row
index = jury.index.values

# plot the row index vs the data for 'Eligible'
plt.barh(index, jury['Eligible'], bar_width, label='Eligible')

# plot the row index vs the data for 'Selected'
# and adjust the location of the bar by bar_width
plt.barh(index + bar_width, jury['Selected'], bar_width, label='Selected')

# convert the index numbers into the corresponding data in 'Ethnicity'
plt.yticks(index + bar_width / 2, jury['Ethnicity'])

plt.xlabel('Proportion')
plt.ylabel('Ethnicity')
plt.title('Comparison of Eligible and Selected Jurors')
plt.legend()
plt.grid()
plt.show()

We can see that fewer Black/AA and Hispanic people were selected compared to the eligible people in these two ethnic groups.

We now want to simulate the random selection of 1453 people, which would be a fair selection, and compare the result with the actual selection above.

First we need to determine an appropriate statistic that incorporates the multiple ethnicities in the jury selection.

- Step 1:<br>
We calculate the difference between the `Eligible` and `Selected` columns by subtracting the data between the 2 columns and find their absolute value. Then we store both the difference and the absolute value as 2 new columns of `jury`.

In [None]:
diff = jury['Eligible'] - jury['Selected']
jury['Difference'] = round(diff, 2)
jury['Abs Diff'] = np.abs(diff)
jury

In [None]:
print("Sum of Eligible column:", round(jury.Eligible.sum(),2))
print("Sum of Selected column:", round(jury.Selected.sum(),2))
print("Sum of Difference column:", round(jury.Difference.sum(),2))
print("Sum of Abs Diff column:", round(jury['Abs Diff'].sum(),2))

- Step 2:<br>
Observing the columns of `jury` above, we note that:<br>
> - The proportions in `Eligible` add up to 1 or 100% of the population.
>- Likewise, the proportions in `Selected` add up to 1 or 100% of the population.
>- The values in `Difference` add up to 0 since the positive differences add up to 0.14 and the negative differences add up to -0.14. This makes sense because if there are more jurors from one ethnicity, it means there has to be fewer jurors from a different ethnicity.
>- The values in `Abs Diff` add up to 0.28, which is twice of 0.14. The first 0.14 is the sum of the positive differences and the second 0.14 is the sum of the negative differences, after we remove the negative sign due to the absolute value.

The 0.14 is a significant value, it is called the _total variation distance (TVD)_. The TVD is a measurement of how close the distributions of the eligible jurors and the selected jurors are. The smaller the TVD value, the closer the two distributions are.

We will use the TVD as the statistic for our simulation of the jury selection process.

Observing how the sum of the `Abs Diff` column is twice the TVD, we can write a function for the TVD between 2 populations:


In [None]:
def get_TVD(distribution1, distribution2):
    return np.sum(np.abs(distribution1 - distribution2)) / 2

# test the function with the data from jury:
print(get_TVD(jury['Eligible'], jury['Selected']))

- Step 3:<br>
We now use the same `sample_proportions` function from earlier to create a sample of 1453 jurors with the proportions of ethnicity from the `Eligible` column. Then we apply `get_TVD` to the sample to get the simulated TVD value and compare it with 0.14.

In [None]:
proportions = sample_proportions(1453, jury['Eligible'])
print(round(get_TVD(proportions, jury['Eligible']),2))

The sample TVD value is much smaller than 0.14.

- Step 4:<br>
Now that we can create a sample of jurors and find the TVD, we can simulate the jury selection process by repeating the sampling 5000 times and recording the TVD values. Then we plot the resulting TVD values to get our empirical distribution.

In [None]:
L = []
for i in range(5000):
    proportions = sample_proportions(1453, jury['Eligible'])
    L.append(get_TVD(proportions, jury['Eligible']))
array = np.array(L)
print(len(array))

In [None]:
plt.figure(figsize=(4,3))
plt.hist(array, density=True, bins=np.arange(0, 0.2, 0.005), edgecolor='black')
plt.title("Empirical Distribution of Statistic")
plt.xlabel("TVD")
plt.scatter(0.14, 0, color='red', s=30) # plot the actual 0.14 TVD
plt.grid()
plt.show()

The plot shows that a fair jury selection will have a TVD closer to 0, between 0 and 0.05, compared to the 0.14 TVD value of the actual jury selections.

Our analysis of the jury selection for both Robert Swank's case and the Alameda County cases shows that the jury selection process certainly is not fair and should be improved. There are historical, social, and economic reasons for the bias in jury selection.The [textbook](https://inferentialthinking.com/chapters/11/2/Multiple_Categories.html#reasons-for-the-bias) covers these reasons as well as the impact of an unfair jury selection.

---

## Testing Hypotheses



In the previous sections we learn how to create a model about the data, and then run simulation to test whether the model is correct. In this next section we will use the textbook's model of Gregor Mendel's genetic experiments with pea plants to learn concepts and terminology in testing hypotheses.  

Gregor Mendel observed that for the pea plant variety that he worked with, about 75% of the plants have purple flowers and 25% of the plants have white flowers. To test whether his observation was valid, Mendel grew 929 pea plants, among which 705 (or 76%) had purple flowers.

The model from Mendel's experiment is:  any pea plant has a 75% chance of having purple flowers.

- <u>The Hypotheses</u>

Statistical hypothesis testing is a way to make decisions or inferences from data. Hypothesis testing include 2 hypotheses:
1. The _null hypothesis_ says that there is no significant difference between any simulated results and expected results. Any difference is due to random chance rather than a real phenomenon.<br>
The term _null_ means that any differences in the simulated result from the model's expected result is due to _nothing_ but chance.
2. The _alternative hypothesis_ says that any difference in the simulated result from the model's expected result is due to _some reason other than chance_.





Considering that Mendel's model is: a pea plant has a 75% chance of having purple flowers:

> The null hypothesis is: a pea plant has a 75% chance of having purple flowers when we simulate the model with randomly generated data.

> The alternative hypothesis is: the model is not correct. The simulated result will not show a 75% chance of having purple flowers.

- <u>The Test Statistic</u>

Since this is a model with 2 categories: purple and white flowers, the statistic will be the TVD.

We recall from the Alameda jury selection example above, the TVD of a model with 2 categories is the distance between the two proportions in one category. Since the proportions are [0.75, 0.25], we find the distance as the absolute value of the difference between the simulated purple flower percentage and 75, which is the expected purple flower percentage.

In [None]:
def find_TVD(percent_purple):
  return np.abs(percent_purple - 75)

Testing the `find_TVD` function by running it with the actual result from Mendel's experiment:

In [None]:
actual_percent_purple = 705 / 929 * 100   # = 76% as Mendel found
print(round(find_TVD(actual_percent_purple), 2))

As expected, the TVD is small, less than 1.

- <u>The Simulation and Empirical Distribution</u>

As with the jury selection models earlier, we now create one sample of 929 values with the proportions [0.75, 0.25] and find its TVD. Then we simulate with 10,000 samples and plot the resulting TVDs.

In [None]:
# create one sample and observe the TVD
def create_one_sample():
  proportions = sample_proportions(929, [0.75, 0.25])
  number_of_purple = int(proportions[0] * 929)
  percent_purple = number_of_purple / 929 * 100
  return round(find_TVD(percent_purple),2)

print("TVD for one sample:", create_one_sample())

In [None]:
# simulate with 10,000 samples
L = []
for i in range(10000):
    L.append(create_one_sample())
TVD_array = np.array(L)

# plot the TVD distribution
plt.figure(figsize=(4,3))
plt.hist(TVD_array, density=True, edgecolor='black')
plt.title("Empirical Distribution of Statistic")
plt.xlabel("TVD")
plt.grid()
plt.show()

Mendel's experiment has an actual TVD of 0.89 as we found in a previous step. This value agrees with the distribution: the majority of the samples have a TVD between 0 and 1. This means Mendel's model is good, and the data doesn't reject the null hypothesis.

## The p-Value

For the hypothesis testing example with Mendel's model above, what if another experiment has a TVD value of 3.5? Looking up where 3.5 is in the histogram above, we see that it is at the right tail end of the distribution, where fewer samples are.

Using the `TVD_array` or the list of all TVD values from the 10,000 samples, we can calculate the chance that an experiment will produce an observed statistic of 3.5 or more.

In [None]:
samples_above_3 = np.count_nonzero(TVD_array >= 3.5)
percent_above_3 = samples_above_3 / 10000 * 100
percent_above_3

There's only about 1% chance that the samples have TVD values at or above 3.5. Something unusual happened in that experiment because a TVD of 3.5 is unlikely.

The low chance of getting a TVD value at or above 3.5 brings us to the _p-value_.

The p-value of a model is the chance of getting the observed data in the simulated results.
- A large p-value (at or greater than 5%) means the observed data is in the expected range of the simulated results. This p-value doesn't reject the null hypothesis.
- A small p-value (less than 5%) means the observed data is at the tail end of the simulated results. This range of p-values is called _statistically significant_ and rejects the null hypothesis or supports the alternative hypothesis.
- A very small p-value (less that 1%) is called _highly statistically significant_ and rejects the null hypothesis.

Note that the 5% cut-off that we use to decide to reject or not reject the null hypothesis is a standard and not an absolute percentage. Depending on the data and hypotheses, the cut-off may be a lower or higher percentage.

In Mendel's experiment, the TVD was 0.89. We calculate the p-value for this observed data.

In [None]:
samples_above = np.count_nonzero(TVD_array >= 0.89)
percent_above = samples_above / 10000 * 100
percent_above

As expected, this is a large p-value and supports the null hypothesis.

---

## Error Probability

Suppose we create a model to test that in a fair coin toss, the chance that the coin lands heads first is 50%.

In the Code cell below are the 3 steps to test the model:
1. We create a sample of 2000 coin tosess with proportions [0.5, 0.5].
2. The statistic is the difference between the number of heads in the simulation and the expected number of heads, which is 1000.
3. We simulate with 10,000 samples to plot the distribution of the statistic.



In [None]:
# steps 1 and 2
def create_one_sample():
  proportions = sample_proportions(2000, [0.5, 0.5])
  number_of_heads = int(proportions[0] * 2000)
  return np.abs(number_of_heads - 1000)

# step 3:
L = []
for i in range(10000):
    L.append(create_one_sample())
statistic_array = np.array(L)

# plot the TVD distribution
plt.figure(figsize=(4,3))
plt.hist(statistic_array, density=True, edgecolor='black')
plt.title("Distribution of Heads in 2000 Coin Tosses")
plt.xlabel("|number of heads - 1000|")
plt.axvline(x=45, color='red', linestyle='--')
plt.grid()
plt.show()

The histogram shows that the data supports the null hypothesis: for the majority of samples, the number of heads is close to 1000 and the difference `|number of heads - 1000|` is close to 0.

The p-value cut off of 5% is when the difference is around 45 and is denoted by the dashed red line in the histogram.

In [None]:
np.count_nonzero(statistic_array >= 45) / 10000 * 100

Using our 5% p-value cut off, this means if a coin is tossed 2000 times and the number of heads is above 1045 or below 955, we will conclude that the coin is not a fair coin.

But conversely, suppose we have a coin that we know is fair, then the histogram above tells us that if we toss the coin 2000 times, there's a 5% chance that the number of heads is above 1045 or below 955. This would cause us to decide _incorrectly_ that the coin is not fair.

This leads us to the fact that: If we use a
_p_% cutoff for the p-value, and the null hypothesis is not rejected, then there is about a
_p_% chance that we will conclude _incorrectly_ that the alternative hypothesis is correct.

The p-value cut off is the _error probability_, or the probability that we could be in error with our conclusion.

Being aware of the error probability is important. For example, it means that any conclusion made on a breakthrough scientific research should not rely on a single experiment. The experiment needs to be replicated, in case the error probability leads to the wrong conclusion the first time.

---

In this notebook we applied sampling and simulation in models that are used to test hypotheses. The simulated data in our model can reject or fail to reject the null hypothesis, where the hypothesis is our assumption about a parameter of the data. In making conclusion about the hypothesis, we also keep in mind the error probability caused by working with randomly generated data.