# Inference for numerical data

The `census` dataset is a random sample of 500 observations from the 2000 U.S. Census Data. The U.S. Census is a deccenial (once every 10 years) census that collects data on every resident. 

In this lab you'll be using the data from the sample to perform hypothesis tests using numerical data. This lab will make use of both simulations and mathematical models. This will allow you to compare the simulated distribution to the model's distribution. You will also be able to compare the results between them.

# Getting started

## Load packages

For this lab we will need the following packages.

```python
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as st
import pandas as pd
import seaborn as sns
```

## Creating a reproducible lab report

You will be using Jupyter notebook to create reproducible lab reports. Download the lab report template and load the template into Jupyter notebook. These templates can be used for each of the labs.

## The data

The data we are working with is in the census.csv file. Download and load the data frame into **python**. The `census` data frame has eight variables `census_year`, `state_fips_code` (the name of the state), `total_family_income`, `age`, `sex`, `race_general`, `marital_status`, and `total_personal_income`. Both income variables are measured in U.S. dollars. 

<div class="alert alert-block alert-info">
<b>Exercise 1:</b> From the sample taken, what are the levels for each of the <code>sex</code>, <code>race_general</code>, and <code>marital_status</code> variables?</div>

<div class="alert alert-block alert-info">
<b>Exercise 2:</b> Some observations in the data frame have <code>0</code> listed under <code>total_personal_income</code>. Others have <code>NaN</code> listed meaning that data is not recorded for some reason. Explore the data frame and determine why some observations have <code>NaN</code> listed for the variable <code>total_personal_income</code>.</div>

# Inference with a single mean

What was the average personal income of people in the U.S. in 2000? You can compute the average person income of the sample.

```python
ave_income = census['total_personal_income'].mean()
print("The observed average income is", ave_income)
```

This is the average for the sample and not the population though. If you had access to the entire census, this would be sufficient. Alas, this is just a sample of the whole census. In order to estimate the population parameter, you will need to construct a confidence interval. You have seen all the tools necessary to bootstrap a confidence interval. This time, however, you will be collecting means instead of proportions from the simulations. The same code as before can also be used to create the mathematical model with a few modifications. The mathematical model for distributions of sample proportions is the Normal model. For distributions of sample means, though, we use the t-distribution. The parameters necessary to create the appropriate t-distribution are the degrees of freedom $df = n - 1$, the average $\overline{x}$, and the standard error $SE = s / \sqrt{n}$. Thus, the distribution can be plotted by

```python
x0, x1 = ax.get_xlim()
x_pdf = np.linspace(x0, x1, 100)
y_pdf = st.t.pdf(x_pdf, df=len(census.index)-1, loc=ave_income, scale=census['total_personal_income'].std()/((len(census.index))**0.5))

ax.plot(x_pdf, y_pdf, 'r', lw=2, label='model')
```

Once again the **scipy** library is used, though this time calling for the t-distribution instead of the normal distribution. The `st.t.pdf` function creates the t-distribution using the parameters `df` for the degrees of freedom, `loc` for the average, and `scale` for the standard error. The function `census.index` returns the labels of each row in the `census` data frame. Thus, applying the `len`gth function to this list yields the number of rows, which is $n$, the sample size.

<div class="alert alert-block alert-info">
<b>Exercise 3:</b> Construct and display both bootstrapped and model distributions for the sample mean. Compute a confidence interval using the bootstrapped distributions. Compare the bootstrapped distribution to the model distribution. Why do you think the two distributions differ as much as they do?</div>

## Cleaning data

As you saw in exercise 2, some data for the `total_personal_income` variable is `NaN`. The code for constructing the bootstrap distribution and the code for constructing the model handle `NaN` values differently. In order to resolve this issue, you will need to "clean the data." The theory behind statistically proper data cleaning is beyond the scope of this book and labs. The aim for this section is for you to recognize that this is an important part of the process of doing statistics.

The `NaN` data are affecting your results. There are a variety of ways to address this issue. Right now, you are going to use the simplest. You will remove each data entry with `NaN`, then repeat the analysis. Luckily, there is a simple function that will do this for you.

```python
census_pi = pd.DataFrame().assign(total_personal_income=census['total_personal_income'])
census_pi_clean = census_pi.dropna(ignore_index=True)
```

Here you are creating a new data frame, `census_pi` that only contains the variable `total_personal_income` from the `census` data frame. You are going to be editing the data frame and you don't want change the original raw data, so creating a new data frame will preserve the original `census` data frame. It is sufficient to only include the `total_personal_income` variable because that is the only variable you are running statistics on right now. The function, `census_pi.dropna(ignore_index=True)` removes all rows from the `census_pi` data frame that have `NaN` as an entry. You now have the data frame `census_pi_clean` that contains all of the data from `census['total_personal_income']` with the `NaN` data removed.

<div class="alert alert-block alert-info">
<b>Exercise 4:</b> Construct and display both bootstrapped and model distributions for the sample mean of the cleaned data set. Compute a confidence interval using the bootstrapped distributions. Compare the bootstrapped distribution to the model distribution. Do they align better than before?</div>

# Inference comparing independent means

Did men and women have different average incomes in 2000? This question can be answered using inference for comparing two indepenent means. The structure of the code is the same as before. When constructing the model, the t-distribution will still be used. The parameters for the t-distribution are different since you are modeling a sample distribution for difference of means.

<div class="alert alert-block alert-info">
<b>Using the t-distribution for a difference in means.</b>
    The t-distribution can be used for inference when working with the standardized difference of two means if
    <ul>
 <li><i>Independence</i> (extended). The data are independent within and between the two groups, e.g., the data come from independent random samples or from a randomized experiment.</li>
</ul>
    The parameters for the model are
    <ul>
    <li> The <code>loc</code> or the center of the model is $0$ for a hypothesis test and $\overline{x_1} - \overline{x_2}$ for a confidence interval.</li>
    <li> The <code>df</code> or degrees of freedom of the model is the smaller of $n_1 - 1$ and $n_2 - 1$.</li>
    <li> The <code>scale</code> or standard error of the model is $\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}$</li>
    </ul>
</div>

With this information you are able to create a hypothesis test and confidence interval.

<div class="alert alert-block alert-info">
<b>Exercise 5:</b> Did men and women have different average incomes in 2000? Conduct a hypothesis test to find out. Be sure to include a confidence interval and interpret your results.</div>

<b>Note:</b> If your simulated distribution and model do not align, did you remember to clean the data?

# Inference comparing many means

Do the states have the same average family income? For this question you will need ANOVA and the F statistic. The first step is computing an F statistic for a data set (or simulated data set). Once again, **scipy** handles this nicely. Using a cleaned data frame with the name `census_f` you can compute the observed F statistic as

```python
obs_f = st.f_oneway(*[census_f[census_f.state_fips_code == i]['total_family_income'] for i in census_f['state_fips_code'].unique()]).statistic
print("The observed F value for the test is ", obs_f)
```

The function `st.f_oneway().statistic` will compute the F statistic by inputting the individual states' data frames for `'total_family_income'`. The code `*[census_f[census_f.state_fips_code == i]['total_family_income'] for i in census_f['state_fips_code'].unique()]` creates individual data frames for each unique state that includes the `'total_family_income'` variable, returning each individually to the `st.f_oneway` function.

Now you have the observed F statistic for the data set, $F = 1.41$. Is that high? Is that low? You can simulate to get a sense of how unusual such an F statistic is in this situation.

```python
simulated_f = []
for x in range(1000):
    sim_one_mix = pd.DataFrame().assign(total_family_income=census_f['total_family_income'], state_fips_code=census_f['state_fips_code'].sample(frac=1, ignore_index=True))
    simulated_f.append(st.f_oneway(*[sim_one_mix[sim_one_mix.state_fips_code == i]['total_family_income'] for i in sim_one_mix['state_fips_code'].unique()]).statistic)
```

Similar to previous randomization processes, this one randomizes which state has which income, then computes the F statistic for that simulation, storing the statistic in a list. This will provide a simulated distribution of F statistics so that you can conduct a hypothesis test.

The construction of the model will make use of a new function in the **scipy** library, the `st.f.pdf` function. This function behaves the same as previously seen similar functions like the `st.t.pdf` and `st.norm.pdf` functions. The important parameters for `st.f.pdf` are `dfn`, or the degrees of freedom for the numerator, and `dfd`, or the degrees of freedom for the denominator. The degrees of freedom for the numerator is $k - 1$ and the degrees of freedom for the denominator is $n - k$ where $n$ is the number of observations and $k$ is the number of individual groups (number of states present). The following code puts this together.

```python
y_pdf = st.f.pdf(x_pdf, dfn=len(census_f['state_fips_code'].unique()) - 1, dfd= len(census_f.index) - len(census_f['state_fips_code'].unique()))
```

With this information you are able to conduct an ANOVA hypothesis test.

<div class="alert alert-block alert-info">
<b>Exercise 6:</b> Do the states have the same average family income? Conduct an ANOVA test to find out.</div>

<div class="alert alert-block alert-info">
<b>Exercise 7:</b> The simulated distribution and model are fairly different. Why do you suspect that is?</div>

---

# Additional questions

<div class="alert alert-block alert-info">
<b>Exercise 8:</b> Create a new question to ask that can be answered through inference using the <code>census</code> data set. Conduct the appropriate tests to answer your question.</div>