# Inference for categorical data

The `race_justice` dataset are the results from a Yahoo! News poll conducted by YouGov on May 29-31, 2020. In total 1060 U.S. adults were asked a series of questions regarding race and justice in the wake of the killing of George Floyd by a police officer. Results in this data set are percentages for the question, "Do you think Blacks and Whites receive equal treatment from the police?" For this particular question there were 1059 respondents.

In this lab you'll be using the data from the poll to perform hypothesis tests using categorical data. This lab will make use of both simulations and mathematical models. This will allow you to compare the simulated distribution to the model's distribution. You will also be able to compare the results between them.

# Getting started

## Load packages

For this lab we will need the following packages.

```python
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as st
import pandas as pd
import seaborn as sns
```

There are two new packages being used here, **scipy** and **numpy**. In this lab you will be constructing both simulations and models. These two packages will assist in the construction of the mathematical models.

## Creating a reproducible lab report

You will be using Jupyter notebook to create reproducible lab reports. Download the lab report template and load the template into Jupyter notebook. These templates can be used for each of the labs.

## The data

The data we are working with is in the race_justice.csv file. Download and load the data frame into **python**. The `race_justice` data frame has two variables `race_eth` and `response`. The `race_eth` variable is the self-reported race/ethnicity of the respondent, with levels `White`, `Black`, `Hispanic`, and `Other`. The `response` variable is the respondents answer to the question "Do you think Black and White people receive equal treatment from the police?", with levels `Yes`, `No`, and `Not sure`.

<div class="alert alert-block alert-info">
<b>Exercise 1:</b> What do you expect to see in this data? Make some predictions about the statistics.</div>

# Inference with a single proportion

What percent of U.S. adults think Blacks and Whites receive equal treatment from police? You can attempt to answer this question using a confidence interval. In lab 5, you learned how to construct a bootstrapped distribution in order to create a confidence interval.

<div class="alert alert-block alert-info">
<b>Exercise 2:</b> Construct a bootstrap distribution using the variable name <code>bootstrap_dist_yes</code> in a histogram for the proportion of people who responded <code>yes</code> to the survey. Use this to find a 95% confidence interval.</div>

To compare the bootstrapped distribution to the mathematical model you will need to create a graph with both objects. The first piece of the puzzle is the bootstrap distribution.

```python
ax = sns.histplot(data=bootstrap_dist_yes, bins=30, label='bootstrap distribution', stat='density')
```

Similar to lab 5 where you created the multi-colored histogram with the vertical line at the observed value, here you are giving our histogram a variable name, `ax`, to superimpose different graphics on the same `plot`. There are a few additional parameters being used in the `histplot` function. The `label` parameter is giving the histogram a name for later reference. The `stat` parameter is scaling the histogram to the right size to fit with the mathematical model.

Now it is time to add in the mathematical model. Assuming the conditions are met, the sampling distribution for a single proportion is nearly normal with mean $p$ and standard error of $\hat{p}$ as $\text{SE} = \sqrt{\frac{\hat{p}(1 - \hat{p})}{n}}$. Let's get these values.

```python
p_yes = race_justice['response'].value_counts(normalize=True)['Yes']
SE = ((p_yes * (1 - p_yes))/len(race_justice))**0.5
```

The only new function here is `**` which is exponentiation. Since you are doing `**0.5`, you are raising the value to the `0.5` power, which is the same as taking the square root.

Now, to create the graph of the normal distribution.

```python
x0, x1 = ax.get_xlim()
x_pdf = np.linspace(x0, x1, 100)
y_pdf = st.norm.pdf(x=x_pdf, loc=p_yes, scale=SE)

ax.plot(x_pdf, y_pdf, color='red', label='model')                                                   
ax.legend()
```

Most of this code is beyond the scope of this lab. The important pieces to understand are the parameters of the `st.norm.pdf` and `plot` functions. The `st.norm.pdf` function creates the y-values for the normal distrubtion using the mean `loc=p_yes` and the standard deviation `scale=SE`. The `ax.plot` function then draws the graph in the `color` red, to more easily distinguish the graph from the histogram, and `label`s the graph `model`.

<div class="alert alert-block alert-info">
<b>Exercise 3:</b> How do the bootstrap distribution and the mathematical model compare? If the align, why do you think they do so? If not, why are they different? <b>Hint:</b> You may need to adjust the number of bins in the histogram.</div>

The `st.norm` function can also be used to find the confidence interval.

```python
st.norm.interval(confidence=0.95, loc=p_yes, scale=SE)
```

The parameters of the `.interval` function are similar to that of the `.pdf` function. You can also choose the confidence level with the `confidence` parameter.

<div class="alert alert-block alert-info">
<b>Exercise 4:</b> Compare the confidence intervals constructed from the bootstrap distribution and the mathematical model.</div>

# Inference comparing two proportions

Is there a difference in how Black and White respondents answer the question? Let's explore this question using the data set. First we will filter the data set to only include responses from Black and White people, then compute the difference in proportions.

```python
race_justice_bw = race_justice[(race_justice.race_eth == "Black") | (race_justice.race_eth == "White")]

p_black_yes = race_justice_bw.groupby('race_eth')['response'].value_counts(normalize=True)[('Black', 'Yes')]
p_white_yes = race_justice_bw.groupby('race_eth')['response'].value_counts(normalize=True)[('White', 'Yes')]
p_diff = p_black_yes - p_white_yes
print("The difference in observed proportions of Black people vs White people who believe that Blacks and Whites receive equal treatement from the police is", p_diff)
```

In the poll, the proporotion of White people who think Blacks and Whites receive equal treatment from the police is 20% higher than the proportion of Black people.

<div class="alert alert-block alert-info">
<b>Exercise 5:</b> What are the hypotheses for a hypothesis test to see if there is a difference in opinions between Black and White people? Should this be a one-tailed or two-tailed test? Why?</div>

You have the tools from lab 5 in order to perform the hypothesis test using simulation. Most of the code from the single proportion test above can be reused for the creation of the model. Since you are performing a hypothesis test for a difference in proportions, you will need to use $\hat{p}_{pool}$ when computing the standard error. The equations are $$\hat{p}_{pool} = \frac{\hat{p_1}n_1 + \hat{p_2}n_2}{n_1 + n_2}$$ and $$\text{SE} = \sqrt{\hat{p}_{pool}(1 - \hat{p}_{pool})\left(\frac{1}{n_1} + \frac{1}{n_2}\right)}$$ Computing these values individually will be easier.

```python
n_black = race_justice_bw['race_eth'].value_counts()[('Black')]
n_white = race_justice_bw['race_eth'].value_counts()[('White')]
p_pool = (p_black_yes * n_black + p_white_yes * n_white)/(n_black + n_white)
SE = (p_pool * (1 - p_pool) * (1/n_black + 1/n_white))**0.5
```

You can now use the same code as before to put all of the plots together.

```python
ax = sns.histplot(data=sim_difference, bins=20, label="simulated distribution", stat='density')
ax.axvline(x = p_diff, ymin = 0, ymax = 1, color = "black", linestyle = "dashed")

x0, x1 = ax.get_xlim()
x_pdf = np.linspace(x0, x1, 100)
y_pdf = st.norm.pdf(x=x_pdf, loc=0, scale=SE)

ax.plot(x_pdf, y_pdf, color='red', label='model')                                                   
ax.legend()
```

<div class="alert alert-block alert-info">
<b>Exercise 6:</b> How do the randomized distribution and the mathematical model compare? If the align, why do you think they do so? If not, why are they different? <b>Hint:</b> You may need to adjust the number of bins in the histogram.</div>

# Inference for two way tables

A two way table provides an excellent way to visualize this data. The data includes four values for `race_eth` and three values for `response`. Summarizing this in a table is a clean and efficient way to consider all of the data.

```python
crosstab = pd.crosstab(race_justice['race_eth'], race_justice['response'])
crosstab
```

This table provides a way to glimpse the data as whole. Above you saw a statistical difference in yes responses between Black and White people. Expanding on that idea, are there differences in responses based on race and ethnicity? The $X^2$ statistic can be computed and a $X^2$ test performed to determine if there is evidence of a difference. First, you will need to compute the $X^2$ statistic. The formula for which is $$X^2 = \sum \frac{(\text{observed count} - \text{expected count})^2}{\text{expected count}}$$ Computing $X^2$ by hand is exceedingly tedious. Luckily, the **scipy** package will do this work for us.

```python
obs_chi2 = st.chi2_contingency(crosstab).statistic
```

It is difficult to determine if this $X^2$ is a big number or not. This can be determined in two ways. A randomization technique by shuffling responses and a $X^2$ distribution can be constructed to determine variability of sampling, then the statistic used on these scales to determine likelihood of this particular sample. You will do both.

The randomization technique for a two way table is the same as it is for comparing two proportions. In this case, the `race_eth` of each entry will stay the same and the `response` will be randomized.

```python
sim_survey = pd.DataFrame().assign(race_eth=race_justice['race_eth'], response=race_justice['response'].sample(frac=1, ignore_index=True))
```

This time, however, the information to record from each simulation is the $X^2$ statistic from the simulation.

<div class="alert alert-block alert-info">
<b>Exercise 7:</b> Create a histogram using 1,000 simulations. Describe the shape of the histogram. Does the histogram appear to follow the normal model? Why or why not?</div>

The code for graphing the $X^2$ model is essentially the same as the code for graphing the normal model. The normal model requires the parameters of a mean and standard deviation. The only parameter for a $X^2$ distribution is the number of degrees of freedom. The degrees of freedom of a two way table is $$df = (\text{number of rows minus 1}) \times (\text{number of columns minus 1})$$ The **scipy** package can also get this information for us

```python
deg_free = st.chi2_contingency(crosstab).dof
```

Combining our two distributions works the exact same as before.

```python
ax = sns.histplot(data=sim_crosstab, bins=20, label="simulated distribution", stat='density')
ax.axvline(x = obs_chi2, ymin = 0, ymax = 1, color = "black", linestyle = "dashed")

x0, x1 = ax.get_xlim()
x_pdf = np.linspace(x0, x1, 100)
y_pdf = st.chi2.pdf(x=x_pdf, df=deg_free)

ax.plot(x_pdf, y_pdf, color='red', label='model')                                                   
ax.legend()
```

<div class="alert alert-block alert-info">
<b>Exercise 8:</b> How do the randomized distribution and the mathematical model compare? If the align, why do you think they do so? If not, why are they different? <b>Hint:</b> You may need to adjust the number of bins in the histogram.</div>

You can likely estimate the $p$-value visually. An exact value can be computed from the simulation using the same method as before.

```python
p_value = len([i for i in sim_crosstab if i > st.chi2_contingency(crosstab).statistic]) / len(sim_crosstab)
```

<div class="alert alert-block alert-info">
<b>Exercise 9:</b> Interpret these results in the context given. What does this say about how different people respond to the question "Do you think Blacks and Whites receive equal treatment from the police?"</div>

---

# Additional questions

<div class="alert alert-block alert-info">
<b>Exercise 10:</b> Reflect on what you expected to see from exercise 1 and what the statistics have now illuminated. How do these statistics help your understanding of issues of race and police brutality? What additional data would you be interested in to expand your understanding further?</div>