# Mini Project 5-5 Explore Hypothesis Testing

## Introduction

You work for an environmental think tank called Repair Our Air (ROA). ROA is formulating policy recommendations to improve the air quality in America, using the Environmental Protection Agency's Air Quality Index (AQI) to guide their decision making. An AQI value close to 0 signals "little to no" public health concern, while higher values are associated with increased risk to public health. 

They've tasked you with leveraging AQI data to help them prioritize their strategy for improving air quality in America.

ROA is considering the following decisions. For each, construct a hypothesis test and an accompanying visualization, using your results of that test to make a recommendation:

1. ROA is considering a metropolitan-focused approach. Within California, they want to know if the mean AQI in Los Angeles County is statistically different from the rest of California.
2. With limited resources, ROA has to choose between New York and Ohio for their next regional office. Does New York have a lower AQI than Ohio?
3. A new policy will affect those states with a mean AQI of 10 or greater. Will Michigan be affected by this new policy?

**Notes:**
1. For your analysis, you'll default to a 5% level of significance.
2. Throughout the lab, for two-sample t-tests, use Welch's t-test (i.e., setting the `equal_var` parameter to `False` in `scipy.stats.ttest_ind()`). This will account for the possibly unequal variances between the two groups in the comparison.

## Step 1: Imports

To proceed with your analysis, import `pandas` and `numpy`. To conduct your hypothesis testing, import `stats` from `scipy`.

#### Import Packages

In [1]:
# Import relevant packages
import pandas as pd
import numpy as np
from scipy import stats

You are also provided with a dataset with national Air Quality Index (AQI) measurements by state over time for this analysis. `Pandas` was used to import the file `c4_epa_air_quality.csv` as a dataframe named `aqi`. As shown in this cell, the dataset has been automatically loaded in for you. You do not need to download the .csv file, or provide more code, in order to access the dataset and proceed with this lab. Please continue with this activity by completing the following instructions.

**Note:** For purposes of your analysis, you can assume this data is randomly sampled from a larger population.

#### Load Dataset

In [2]:
# IMPORT YOUR DATA
aqi=pd.read_csv("c4_epa_air_quality.csv")

## Step 2: Data Exploration

### Before proceeding to your deliverables, explore your datasets.

Use the following space to surface descriptive statistics about your data. In particular, explore whether you believe the research questions you were given are readily answerable with this data.

In [3]:
# Use head() to show a sample of data
aqi.head(10)

Unnamed: 0.1,Unnamed: 0,date_local,state_name,county_name,city_name,local_site_name,parameter_name,units_of_measure,arithmetic_mean,aqi
0,0,2018-01-01,Arizona,Maricopa,Buckeye,BUCKEYE,Carbon monoxide,Parts per million,0.473684,7
1,1,2018-01-01,Ohio,Belmont,Shadyside,Shadyside,Carbon monoxide,Parts per million,0.263158,5
2,2,2018-01-01,Wyoming,Teton,Not in a city,Yellowstone National Park - Old Faithful Snow ...,Carbon monoxide,Parts per million,0.111111,2
3,3,2018-01-01,Pennsylvania,Philadelphia,Philadelphia,North East Waste (NEW),Carbon monoxide,Parts per million,0.3,3
4,4,2018-01-01,Iowa,Polk,Des Moines,CARPENTER,Carbon monoxide,Parts per million,0.215789,3
5,5,2018-01-01,Hawaii,Honolulu,Not in a city,Kapolei,Carbon monoxide,Parts per million,0.994737,14
6,6,2018-01-01,Hawaii,Honolulu,Not in a city,Kapolei,Carbon monoxide,Parts per million,0.2,2
7,7,2018-01-01,Pennsylvania,Erie,Erie,,Carbon monoxide,Parts per million,0.2,2
8,8,2018-01-01,Hawaii,Honolulu,Honolulu,Honolulu,Carbon monoxide,Parts per million,0.4,5
9,9,2018-01-01,Colorado,Larimer,Fort Collins,Fort Collins - CSU - S. Mason,Carbon monoxide,Parts per million,0.3,6


In [4]:
# check varibles
aqi.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 260 entries, 0 to 259
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Unnamed: 0        260 non-null    int64  
 1   date_local        260 non-null    object 
 2   state_name        260 non-null    object 
 3   county_name       260 non-null    object 
 4   city_name         260 non-null    object 
 5   local_site_name   257 non-null    object 
 6   parameter_name    260 non-null    object 
 7   units_of_measure  260 non-null    object 
 8   arithmetic_mean   260 non-null    float64
 9   aqi               260 non-null    int64  
dtypes: float64(1), int64(2), object(7)
memory usage: 20.4+ KB


In [5]:
# Use describe() to summarize AQI
aqi.describe()

Unnamed: 0.1,Unnamed: 0,arithmetic_mean,aqi
count,260.0,260.0,260.0
mean,129.5,0.403169,6.757692
std,75.199734,0.317902,7.061707
min,0.0,0.0,0.0
25%,64.75,0.2,2.0
50%,129.5,0.276315,5.0
75%,194.25,0.516009,9.0
max,259.0,1.921053,50.0


In [6]:
# For a more thorough examination of observations by state use values_counts()
aqi['state_name'].value_counts()

state_name
California              66
Arizona                 14
Ohio                    12
Florida                 12
Texas                   10
New York                10
Pennsylvania            10
Michigan                 9
Colorado                 9
Minnesota                7
New Jersey               6
Indiana                  5
North Carolina           4
Massachusetts            4
Maryland                 4
Oklahoma                 4
Virginia                 4
Nevada                   4
Connecticut              4
Kentucky                 3
Missouri                 3
Wyoming                  3
Iowa                     3
Hawaii                   3
Utah                     3
Vermont                  3
Illinois                 3
New Hampshire            2
District Of Columbia     2
New Mexico               2
Montana                  2
Oregon                   2
Alaska                   2
Georgia                  2
Washington               2
Idaho                    2
Nebraska         

#### **Question 1: From the preceding data exploration, what do you recognize?**

A: I recognize that the number of occurance of California are disappropriately more than other states. This might because there are more sites in California.

## Step 3. Statistical Tests

Before you proceed, recall the following steps for conducting hypothesis testing:

1. Formulate the null hypothesis and the alternative hypothesis.<br>
2. Set the significance level.<br>
3. Determine the appropriate test procedure.<br>
4. Compute the p-value.<br>
5. Draw your conclusion.

### Hypothesis 1: ROA is considering a metropolitan-focused approach. Within California, they want to know if the mean AQI in Los Angeles County is statistically different from the rest of California.

Before proceeding with your analysis, it will be helpful to subset the data for your comparison.

In [15]:
# Create dataframes for each sample being compared in your test
california= aqi[aqi["state_name"] == "California"]
losangeles=aqi[aqi['county_name']=='Los Angeles']
rest = california[california["county_name"] != "Los Angeles"]

In [16]:
# Check head
print(rest.head(10))
print(losangeles.head(10))

    Unnamed: 0  date_local  state_name     county_name      city_name  \
16          16  2018-01-01  California  San Bernardino        Ontario   
18          18  2018-01-01  California      Sacramento   Arden-Arcade   
26          26  2018-01-01  California          Orange       La Habra   
27          27  2018-01-01  California         Alameda  Not in a city   
34          34  2018-01-01  California          Fresno         Fresno   
40          40  2018-01-01  California       San Mateo   Redwood City   
43          43  2018-01-01  California    Contra Costa        Concord   
45          45  2018-01-01  California           Butte          Chico   
46          46  2018-01-01  California       Riverside      Mira Loma   
58          58  2018-01-01  California            Kern          Arvin   

                 local_site_name   parameter_name   units_of_measure  \
16  Ontario Near Road (Etiwanda)  Carbon monoxide  Parts per million   
18     Sacramento-Del Paso Manor  Carbon monoxide  P

#### Formulate your hypothesis:

**Formulate your null and alternative hypotheses:**

*   $H_0$: There is no difference in the mean AQI between Los Angeles County and the rest of California.
*   $H_A$: There is a difference in the mean AQI between Los Angeles County and the rest of California.


#### Set the significance level:

In [14]:
# For this analysis, the significance level is 5%

#### Determine the appropriate test procedure:

Here, you are comparing the sample means between two independent samples. Therefore, you will utilize a **two-sample  𝑡-test**.

#### Compute the P-value

In [20]:
# Compute your p-value here
print(losangeles["aqi"].mean())
print(rest["aqi"].mean())

16.285714285714285
11.0


#### **Question 2. What is your P-value for hypothesis 1, and what does this indicate for your null hypothesis?**

In [19]:
# Extracting pvalue and make the test
t_stat, p_value = stats.ttest_ind(losangeles['aqi'], rest['aqi'], equal_var=False)
print(t_stat)
print(p_value)

2.1107010796372014
0.049839056842410995


A: The P value for hypothesis 1 is 0.04984, smaller than 5%, so the null hypothesis is rejected and there is a difference in the mean AQI between Los Angeles County and the rest of California. Specifically, the mean aqi in Los Angeles County is significantly higher than the rest of California.

### Hypothesis 2: With limited resources, ROA has to choose between New York and Ohio for their next regional office. Does New York have a lower AQI than Ohio?

Before proceeding with your analysis, it will be helpful to subset the data for your comparison.

In [21]:
# Create dataframes for each sample being compared in your test
newyork=aqi[aqi['state_name']=='New York']
ohio=aqi[aqi['state_name']=='Ohio']

In [22]:
# Check head
print(newyork.head(10))
print(ohio.head(10))

     Unnamed: 0  date_local state_name county_name      city_name  \
90           90  2018-01-01   New York        Erie    Cheektowaga   
113         113  2018-01-01   New York       Bronx       New York   
124         124  2018-01-01   New York      Monroe      Rochester   
167         167  2018-01-01   New York    New York       New York   
173         173  2018-01-01   New York      Queens       New York   
182         182  2018-01-01   New York      Queens       New York   
184         184  2018-01-01   New York     Steuben  Not in a city   
195         195  2018-01-01   New York        Erie        Buffalo   
196         196  2018-01-01   New York      Monroe      Rochester   
234         234  2018-01-01   New York      Albany         Albany   

              local_site_name   parameter_name   units_of_measure  \
90          Buffalo Near-Road  Carbon monoxide  Parts per million   
113           PFIZER LAB SITE  Carbon monoxide  Parts per million   
124               ROCHESTER 2  Ca

**Formulate your null and alternative hypotheses:**

*   $H_0$: The mean AQI of New York is greater than or equal to that of Ohio.
*   $H_A$: The mean AQI of New York is **below** that of Ohio.


#### Significance Level (remains at 5%)

#### Determine the appropriate test procedure:

Here, you are comparing the sample means between two independent samples in one direction. Therefore, you will utilize a **two-sample  𝑡-test**.

#### Compute the P-value

In [26]:
# Compute your p-value here
print(newyork['aqi'].mean())
print(ohio['aqi'].mean())

2.5
3.3333333333333335


#### **Question 3. What is your P-value for hypothesis 2, and what does this indicate for your null hypothesis?**

In [25]:
# Your code here.
t_stat, p_value = stats.ttest_ind(newyork['aqi'], ohio['aqi'], equal_var=False, alternative='less')
print(t_stat)
print(p_value)

-2.025951038880333
0.03044650269193468


A: The p value of the test is 0.03, smaller than 0.05, so the null hypothesis is rejected at 5% level. As a result, the mean AQI of New York is significantly below that of Ohio.

###  Hypothesis 3: A new policy will affect those states with a mean AQI of 10 or greater. Will Michigan be affected by this new policy?

Before proceeding with your analysis, it will be helpful to subset the data for your comparison.

In [29]:
# Create dataframes for each sample being compared in your test
michigan=aqi[aqi['state_name']=='Michigan']
print(michigan['aqi'].mean())

8.11111111111111


**Formulate your null and alternative hypotheses here:**

*   $H_0$: The mean AQI of Michigan is less than or equal to 10.
*   $H_A$: The mean AQI of Michigan is greater than 10.


#### Significance Level (remains at 5%)

#### Determine the appropriate test procedure:

Here, you are comparing one sample mean relative to a particular value in one direction. Therefore, you will utilize a **one-sample  𝑡-test**. 

#### Compute the P-value

In [28]:
# Compute your p-value here
t_stat, p_value = stats.ttest_1samp(michigan['aqi'], 10, alternative='greater')
print(t_stat)
print(p_value)

-1.7395913343286131
0.9399405193140109


#### **Question 4. What is your P-value for hypothesis 3, and what does this indicate for your null hypothesis?**

A: The p value of the test is 0.94, larger than 0.05, so the null hypothesis cannot be rejected at 5% level. It is learnt that the mean aqi of Michigan is less or equal than 10, so this policy might not affect Michigan.

## Step 4. Results and Evaluation

Now that you've completed your statistical tests, you can consider your hypotheses and the results you gathered.

#### **Question 5. Did your results show that the AQI in Los Angeles County was statistically different from the rest of California?**

A: Yes. Specifically, it shows that the aqi in Los Angeles County is statistically larger than the rest of California.

#### **Question 6. Did New York or Ohio have a lower AQI?**

A: Yes, New York has a statistically lower aqi than Ohio does. 

#### **Question 7: Will Michigan be affected by the new policy impacting states with a mean AQI of 10 or greater?**



A: Since the aqi of Michigan is lower than 10, it would not be impacted by the new policy. 

# Conclusion

**What are key takeaways from this project?**

A: This project teaches me how to conduct hypothesis tests step by step, from setting the null and alternative hypotheses to interpreting the results of p-values and t-statistics.

**What would you consider presenting to your manager as part of your findings?**

A: The Los Angeles County maintains statistically higher mean aqi than the rest of California. New York maintains a statistically lower mean aqi than Ohio. The mean aqi of Michigan is statistically lower than 10. Concerning statistic process and results could be added.


**What would you convey to external readers?**

A: 
The Los Angeles County maintains statistically higher mean aqi than the rest of California, so the metropolitan-focused approach could be adopted.
Since New York maintains a statistically lower mean aqi than Ohio, ROA might choose the next regional office in New York.
Since the mean aqi of Michigan is statistically lower than 10, this policy would not impact Michigan.
