# Hypothesis testing

### Introduction

In this notebook, we are working in colaboration with the environmental think tank called Repair Our Air (ROA). ROA is formulating policy recommendations to improve the air quality in America, using the Environmental Protection Agency's Air Quality Index (AQI) to guide their decision making. An AQI value close to 0 signals "little to no" public health concern, while higher values are associated with increased risk to public health.

They've tasked us with leveraging AQI data to help them prioritize their strategy for improving air quality in America.

ROA is considering the following decisions. For each, we will construct a hypothesis test and an accompanying visualization, using the results of that test to make a recommendation:

1. ROA is considering a metropolitan-focused approach. Within California, they want to know if the mean AQI in Los Angeles County is statistically different from the rest of California.
2. With limited resources, ROA has to choose between New York and Ohio for their next regional office. Does New York have a lower AQI than Ohio?
3. A new policy will affect those states with a mean AQI of 10 or greater. Can we rule out Michigan from being affected by this new policy?

**Notes:**
1. For the analysis, we'll default to a 5% level of significance.
2. Throughout the lab, for two-sample t-tests, we'll use Welch's t-test (i.e., setting the `equal_var` parameter to `False` in `scipy.stats.ttest_ind()`). This will account for the possibly unequal variances between the two groups in the comparison.

In [1]:
# Import relevant libraries, packages, and modules.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
from scipy import stats

In [139]:
#import dataset
aqi= pd.read_csv(r'C:\Users\user\Desktop\Course 4\c4_epa_air_quality.csv', index_col= 0) 

## Statistical Tests
1. Formulate the null hypothesis and the alternative hypothesis.<br>
2. Set the significance level.<br>
3. Determine the appropriate test procedure.<br>
4. Compute the p-value.<br>
5. Draw your conclusion.

### Hypothesis 1:
`ROA is considering a metropolitan-focused approach. Within California, they want to know if the mean AQI in Los Angeles country is statistically different from the rest of California.`

In [149]:
#create dataframes for each sample being compared in the test
ca_la= aqi[aqi['county_name']=='Los Angeles']
ca_others =aqi[(aqi['state_name']=='California')& (aqi['county_name']!='Los Angeles')]

In [150]:
ca_la.shape

(14, 9)

In [151]:
ca_others.shape

(52, 9)

In [155]:
ca_la

Unnamed: 0,date_local,state_name,county_name,city_name,local_site_name,parameter_name,units_of_measure,arithmetic_mean,aqi
33,2018-01-01,California,Los Angeles,Lancaster,Lancaster-Division Street,Carbon monoxide,Parts per million,0.394737,7
42,2018-01-01,California,Los Angeles,Santa Clarita,Santa Clarita,Carbon monoxide,Parts per million,0.394737,7
61,2018-01-01,California,Los Angeles,Pasadena,Pasadena,Carbon monoxide,Parts per million,0.789474,16
76,2018-01-01,California,Los Angeles,Los Angeles,LAX Hastings,Carbon monoxide,Parts per million,0.863158,17
109,2018-01-01,California,Los Angeles,Los Angeles,Los Angeles-North Main Street,Carbon monoxide,Parts per million,0.994737,17
110,2018-01-01,California,Los Angeles,Los Angeles,Los Angeles-North Main Street,Carbon monoxide,Parts per million,0.9,16
119,2018-01-01,California,Los Angeles,Reseda,Reseda,Carbon monoxide,Parts per million,1.015789,19
132,2018-01-01,California,Los Angeles,Compton,Compton,Carbon monoxide,Parts per million,1.742105,40
163,2018-01-01,California,Los Angeles,Azusa,Azusa,Carbon monoxide,Parts per million,0.673684,10
172,2018-01-01,California,Los Angeles,Pico Rivera,Pico Rivera #2,Carbon monoxide,Parts per million,1.047368,18


In [154]:
ca_others

Unnamed: 0,date_local,state_name,county_name,city_name,local_site_name,parameter_name,units_of_measure,arithmetic_mean,aqi
16,2018-01-01,California,San Bernardino,Ontario,Ontario Near Road (Etiwanda),Carbon monoxide,Parts per million,0.747368,11
18,2018-01-01,California,Sacramento,Arden-Arcade,Sacramento-Del Paso Manor,Carbon monoxide,Parts per million,0.752632,16
26,2018-01-01,California,Orange,La Habra,La Habra,Carbon monoxide,Parts per million,0.673684,13
27,2018-01-01,California,Alameda,Not in a city,Berkeley- Aquatic Park,Carbon monoxide,Parts per million,1.088889,15
34,2018-01-01,California,Fresno,Fresno,Fresno - Garland,Carbon monoxide,Parts per million,1.0,15
40,2018-01-01,California,San Mateo,Redwood City,Redwood City,Carbon monoxide,Parts per million,0.672222,10
43,2018-01-01,California,Contra Costa,Concord,Concord,Carbon monoxide,Parts per million,0.294444,5
45,2018-01-01,California,Butte,Chico,Chico-East Avenue,Carbon monoxide,Parts per million,0.466667,9
46,2018-01-01,California,Riverside,Mira Loma,Mira Loma (Van Buren),Carbon monoxide,Parts per million,1.231579,27
58,2018-01-01,California,Kern,Arvin,Arvin-Di Giorgio,Carbon monoxide,Parts per million,0.278947,3


## Formulate the hypothesis

Formulate the null and alternative hypotheses:

*   $H_0$: There is no difference in the mean AQI between Los Angeles County and the rest of California.
*   $H_A$: There is a difference in the mean AQI between Los Angeles County and the rest of California.


In [158]:
#set the significance level
significance_level= 0.05
print(significance_level)

0.05


## Determine the appropriate test procedure:

Here, we are comparing the sample means between two independent samples. Therefore, we will utilize a **two-sample  𝑡-test**.

### Compute the P-value

In [159]:
#compute the P-value
stats.ttest_ind(a= ca_la['aqi'], b= ca_others['aqi'], equal_var= False)

Ttest_indResult(statistic=2.1107010796372014, pvalue=0.049839056842410995)

### Quick Insight 

With a p-value (0.049) being less than 0.05 (as the significance level is 5% or 0.05), reject the null hypothesis in favor of the alternative hypothesis. Therefore, conclude that there is na difference in the mean of AQI between los Angeles and the rest of california.

Therefore, a metropolitan strategy may make sense in this case.

### Hypothesis 2:
`With limited resources, ROA has to choose between Newyork and Ohio for their next regional office. Does New york have a lower AQI than Ohio, and is it statistically significant?`

In [160]:
# Create dataframes for each sample being compared in the test

ny = aqi[aqi['state_name']=='New York']
ohio = aqi[aqi['state_name']=='Ohio']

In [162]:
ny.shape

(10, 9)

In [163]:
ohio.shape

(12, 9)

In [164]:
ny

Unnamed: 0,date_local,state_name,county_name,city_name,local_site_name,parameter_name,units_of_measure,arithmetic_mean,aqi
90,2018-01-01,New York,Erie,Cheektowaga,Buffalo Near-Road,Carbon monoxide,Parts per million,0.252632,3
113,2018-01-01,New York,Bronx,New York,PFIZER LAB SITE,Carbon monoxide,Parts per million,0.289474,3
124,2018-01-01,New York,Monroe,Rochester,ROCHESTER 2,Carbon monoxide,Parts per million,0.2,2
167,2018-01-01,New York,New York,New York,CCNY,Carbon monoxide,Parts per million,0.2,2
173,2018-01-01,New York,Queens,New York,Queens College Near Road,Carbon monoxide,Parts per million,0.273684,3
182,2018-01-01,New York,Queens,New York,QUEENS COLLEGE 2,Carbon monoxide,Parts per million,0.2,2
184,2018-01-01,New York,Steuben,Not in a city,PINNACLE STATE PARK,Carbon monoxide,Parts per million,0.2,2
195,2018-01-01,New York,Erie,Buffalo,BUFFALO,Carbon monoxide,Parts per million,0.3,3
196,2018-01-01,New York,Monroe,Rochester,Rochester Near-Road,Carbon monoxide,Parts per million,0.2,2
234,2018-01-01,New York,Albany,Albany,LOUDONVILLE,Carbon monoxide,Parts per million,0.221053,3


In [165]:
ohio

Unnamed: 0,date_local,state_name,county_name,city_name,local_site_name,parameter_name,units_of_measure,arithmetic_mean,aqi
1,2018-01-01,Ohio,Belmont,Shadyside,Shadyside,Carbon monoxide,Parts per million,0.263158,5
12,2018-01-01,Ohio,Hamilton,Cincinnati,Taft NCore,Carbon monoxide,Parts per million,0.252632,3
22,2018-01-01,Ohio,Stark,Canton,Canton,Carbon monoxide,Parts per million,0.394737,6
51,2018-01-01,Ohio,Summit,Akron,NIHF STEM MS,Carbon monoxide,Parts per million,0.083333,3
59,2018-01-01,Ohio,Cuyahoga,Cleveland,GT Craig NCore,Carbon monoxide,Parts per million,0.25,3
120,2018-01-01,Ohio,Cuyahoga,Cleveland,Galleria,Carbon monoxide,Parts per million,0.273684,3
149,2018-01-01,Ohio,Franklin,Columbus,Morse Rd,Carbon monoxide,Parts per million,0.184211,3
191,2018-01-01,Ohio,Franklin,Columbus,Smoky Row Near Road,Carbon monoxide,Parts per million,0.115789,2
215,2018-01-01,Ohio,Cuyahoga,Warrensville Heights,Cleveland Near Road,Carbon monoxide,Parts per million,0.321053,5
231,2018-01-01,Ohio,Montgomery,Dayton,Reibold,Carbon monoxide,Parts per million,0.163158,2


## Formulate the null and alternative hypothesis


*   $H_0$: The mean AQI of New York is greater than or equal to that of Ohio.
*   $H_A$: The mean AQI of New York is **below** that of Ohio.


## Note: 
 **Significance level remains `0.05`**

#### Determine the appropriate test procedure:

Here, we are comparing the sample means between two independent samples in one direction. Therefore, you will utilize a **two-sample  𝑡-test**.

In [168]:
#Compute the P-value
tstat, p_value= stats.ttest_ind(a=ny['aqi'], b=ohio['aqi'], alternative= 'less', equal_var= False)

#print result
print(tstat)
print(p_value)

-2.025951038880333
0.03044650269193468


### Quick Insight 

With a p-value (0.030) of less than 0.05 (as the significance level is 5%) and a t-statistic < 0 (-2.036 is less than 0), **reject the null hypothesis in favor of the alternative hypothesis**.

Therefore, we can conclude at the 5% significance level that New York has a lower mean AQI than Ohio.

## Hypothesis 3: A new policy will affect those states with a mean AQI of 10 or greater. Can we rule out Michigan from being affected by this new policy

In [169]:
#create dataframes for each sample being companed in the test
michigan= aqi[aqi['state_name']== 'Michigan']

In [170]:
michigan.shape

(9, 9)

In [171]:
michigan

Unnamed: 0,date_local,state_name,county_name,city_name,local_site_name,parameter_name,units_of_measure,arithmetic_mean,aqi
65,2018-01-01,Michigan,Wayne,Livonia,LIVONIA-NR,Carbon monoxide,Parts per million,0.338889,5
122,2018-01-01,Michigan,Wayne,Detroit,West corner,Carbon monoxide,Parts per million,0.394737,8
123,2018-01-01,Michigan,Wayne,Detroit,MARK TWAIN MIDDLE SCHOOL,Carbon monoxide,Parts per million,0.515789,9
129,2018-01-01,Michigan,Wayne,Detroit,ELIZA-NR,Carbon monoxide,Parts per million,0.616667,11
192,2018-01-01,Michigan,Wayne,Allen Park,Allen Park,Carbon monoxide,Parts per million,0.811111,13
207,2018-01-01,Michigan,Wayne,Not in a city,Eliza Downwind,Carbon monoxide,Parts per million,0.516667,10
226,2018-01-01,Michigan,Kent,Grand Rapids,GR-MONROE,Carbon monoxide,Parts per million,0.2,2
242,2018-01-01,Michigan,Wayne,Detroit,(Northeast corner),Carbon monoxide,Parts per million,0.378947,7
248,2018-01-01,Michigan,Wayne,Detroit,NORTHWEST,Carbon monoxide,Parts per million,0.415789,8


### Formulate the null and alternative hypothesis:

*   $H_0$: The mean AQI of Michigan is less than or equal to 10.
*   $H_A$: The mean AQI of Michigan is greater than 10.


## Note: 
**Significance Level remains `0.05`**

### Determine the appropriate test procedure:

Here, we are comparing one sample mean relative to a particular value in one direction. Therefore, you will utilize a **one-sample  𝑡-test**.

In [176]:
#Compute the P-value
tstat,pvalue= stats.ttest_1samp(michigan['aqi'], 10, alternative= 'greater')

#print result
print(tstat)
print(pvalue)

-1.7395913343286131
0.9399405193140109


### Quick Insight 
With a p-value (0.940) being greater than 0.05 (as the significance level is 5%) and a t-statistic < 0 (-1.74), **fail to reject the null hypothesis**.

Therefore, you cannot conclude at the 5% significance level that Michigan's mean AQI is greater than 10. This implies that Michigan would not be affected by the new policy.