# STAT 207 Homework 7 [25 points]

## Deeper Dive into Hypothesis Testing, Confidence Intervals, and Descriptive Analytics

Due: Friday, March 22, end of day (11:59 pm CT)

<hr>

## Imports 

Run the following code cell to import the necessary packages into the file.  You may import additional packages, as needed for this assignment.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import numpy as np

## Case Study: County Housing Information

The provided **county.csv** file contains various statistics about the population of every county in the United States.  You can read more about this dataset and its variables here:

https://www.openintro.org/data/index.php?data=county

Observational Unit: Each row contains data recorded for a county

Variables: We will focus on the `multi_unit` variable in the data, which reports the percent of housing units in each county that are in multi-unit structures (e.g. apartments).

Below, we read in the data as `df` for later analysis.

In [None]:
df = pd.read_csv('county.csv')
df.head()

In [None]:
df.shape

In [None]:
df['state'].nunique()

## 1. Deeper Dive Into p-Values and Significance Levels [3 points] 

**a)** We will start by creating a sampling distribution of the mean multi-unit housing rate for 5 randomly selected counties.  Record 5000 repetitions in your simulated sampling distribution.

In [None]:
df['multi_unit'].mean()

In [None]:
df['multi_unit'].describe()

In [None]:
df.sample(5, replace = True)['multi_unit'].mean()

In [None]:
df_samp_dist1 = []
for i in range(5000):
    mean_mult_unit = df.sample(5, replace = True)['multi_unit'].mean()
    d = {'mean_mult_unit' : mean_mult_unit}
    df_samp_dist1.append(d)

In [None]:
df_samp_dist1 = pd.DataFrame(df_samp_dist1)
df_samp_dist1.describe()

In [None]:
df_samp_dist1.hist()

**b)** Now, we will use the following hypotheses as the framework for our next parts:

$H_0: \mu = 12.32$

$H_a: \mu > 12.32$

We will take 500 repeated random samples of size 5 from the population of counties.  For each of these repeated random samples, calculate the corresponding p-value for the above hypotheses using the sampling distribution from **part a**.  Save the p-value for future analyses.

In [None]:
df_samp_dist1.describe()

In [None]:
sim_pval = []
for i in range(500):
    sample_mean = df['multi_unit'].sample(5, replace = True).mean()
    p_val = (df_samp_dist1 > sample_mean).mean().iloc[0]
    d = {"p_val" : p_val}
    sim_pval.append(d)

In [None]:
sim_pval = pd.DataFrame(sim_pval)
sim_pval

In [None]:
sim_pval.hist()

In [None]:
sim_pval.describe()

In [None]:
(sim_pval < 0.05).mean()

In [None]:
(sim_pval < 0.2).mean()

In [None]:
(sim_pval < 0.5).mean()

**c)** Finally, suppose that we consider Champaign and its four neighboring counties (Vermilion, Ford, Piatt, and Douglas) as a sample of counties from the US.  First, calculate the sample mean multi-unit rate of these five counties from Illinois.  Then, using your simulated sampling distribution, calculate the p-value based on these five counties.

**Tip**: You may want to review Homework 5.  We can use the **`&`** ("and") operator to indicate that we want **both** conditions on either side of the operator to be met.  We can use the `**|**` ("or") operator to indicate that we want **at least one** of the conditions to be met.  We can also chain these operators together if we need to represent more complex operations.

In [None]:
df_counties = df[ (df.state == "Illinois") & ((df.name == "Champaign County") | (df.name == "Vermilion County") | (df.name == "Ford County") | (df.name == "Piatt County") | (df.name == "Douglas County"))]
df_counties

In [None]:
df_counties['multi_unit'].mean()

In [None]:
(df_samp_dist1 > 15.26).mean().iloc[0]

## Case Study: Kitchen Prep Time

The `food_prep.csv` file contains information about how much time a sample of American adults spent preparing food and drink (in minutes) in the last 24 hours.  The data has already been cleaned, so you don't need to worry about cleaning the data before analyzing it.

## 2. A Confidence Interval [1 points]

Read in the `food_prep.csv` file.  Then, generate a sampling distribution for the **median** time spent preparing food and drink by all American adults.  Finally, find the 90% confidence interval using your sampling distribution.

In [None]:
df_food = pd.read_csv("food_prep.csv")
df_food.head()

In [None]:
df_food.median().iloc[0]

In [None]:
df_food.shape

In [None]:
food_samp_dist = []
for i in range(5000):
    median_time = df_food.sample(400, replace = True).median()
    d = {'median_time' : median_time}
    food_samp_dist.append(d)

In [None]:
food_samp_dist = pd.DataFrame(food_samp_dist)
food_samp_dist.describe()

In [None]:
food_samp_dist.head()

In [None]:
food_samp_dist.hist()

In [None]:
print(f"Lower Bound: {np.quantile(food_samp_dist, .05)}")
print(f"Upper Bound: {np.quantile(food_samp_dist, .95)}")

In [None]:
df_food.median().iloc[0]

In [None]:
sns.violinplot(data = df_food)

## Case Study: Who Dislikes Superbowl Ads?

We'll return to the dataset using Superbowl ads that we previously explored in Homework 4.  Below, the data is prepared and read in for you.  Note: we'll use the log of the dislike count for this question (both GitHub and Gradescope).

In [None]:
df_superbowl = pd.read_csv('superbowl_ads.csv')
df_superbowl['log_dislike'] = np.log(df_superbowl['dislike_count'] + 1)

In [None]:
df_superbowl.head()

## 3. Descriptive Analytics [1 point]

Generate one set of numerical summaries for the log of the dislike count of Superbowl ads based on whether the ad includes an animal and whether the ad is funny.  Do so in one line of code for full credit.

In [None]:
df_superbowl.columns

In [None]:
df_superbowl[['log_dislike', 'animals', 'funny']].groupby(['animals', 'funny']).describe().reset_index()

In [None]:
sns.boxplot(x = 'animals', y = 'log_dislike', hue = 'use_sex', data = df_superbowl)

In [None]:
sns.boxplot(x = 'animals', y = 'log_dislike', data = df_superbowl)

In [None]:
sns.boxplot(x = 'funny', y = 'log_dislike', hue = 'animals', data = df_superbowl)

Remember to keep all your cells and hit the save icon above periodically to checkpoint (save) your results on your local computer. Once you are satisified with your results restart the kernel and run all (Kernel -> Restart & Run All). **Make sure nothing has changed**. Checkpoint and exit (File -> Save and Checkpoint + File -> Close and Halt). Follow the instructions on the Homework 7 Canvas Assignment to submit your notebook to GitHub.