# Mini Project 5-5 Explore Hypothesis Testing

## Introduction

You work for an environmental think tank called Repair Our Air (ROA). ROA is formulating policy recommendations to improve the air quality in America, using the Environmental Protection Agency's Air Quality Index (AQI) to guide their decision making. An AQI value close to 0 signals "little to no" public health concern, while higher values are associated with increased risk to public health. 

They've tasked you with leveraging AQI data to help them prioritize their strategy for improving air quality in America.

ROA is considering the following decisions. For each, construct a hypothesis test and an accompanying visualization, using your results of that test to make a recommendation:

1. ROA is considering a metropolitan-focused approach. Within California, they want to know if the mean AQI in Los Angeles County is statistically different from the rest of California.
2. With limited resources, ROA has to choose between New York and Ohio for their next regional office. Does New York have a lower AQI than Ohio?
3. A new policy will affect those states with a mean AQI of 10 or greater. Will Michigan be affected by this new policy?

**Notes:**
1. For your analysis, you'll default to a 5% level of significance.
2. Throughout the lab, for two-sample t-tests, use Welch's t-test (i.e., setting the `equal_var` parameter to `False` in `scipy.stats.ttest_ind()`). This will account for the possibly unequal variances between the two groups in the comparison.

## Step 1: Imports

To proceed with your analysis, import `pandas` and `numpy`. To conduct your hypothesis testing, import `stats` from `scipy`.

#### Import Packages

In [2]:
# Import relevant packages
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import scipy as sp

You are also provided with a dataset with national Air Quality Index (AQI) measurements by state over time for this analysis. `Pandas` was used to import the file `c4_epa_air_quality.csv` as a dataframe named `aqi`. As shown in this cell, the dataset has been automatically loaded in for you. You do not need to download the .csv file, or provide more code, in order to access the dataset and proceed with this lab. Please continue with this activity by completing the following instructions.

**Note:** For purposes of your analysis, you can assume this data is randomly sampled from a larger population.

#### Load Dataset

In [3]:
# IMPORT YOUR DATA
aqi = pd.read_csv('modified_c4_epa_air_quality.csv')
aqi = aqi.dropna()

## Step 2: Data Exploration

### Before proceeding to your deliverables, explore your datasets.

Use the following space to surface descriptive statistics about your data. In particular, explore whether you believe the research questions you were given are readily answerable with this data.

In [4]:
# Use head() to show a sample of data
aqi.head()

Unnamed: 0,date_local,state_name,county_name,city_name,local_site_name,parameter_name,units_of_measure,aqi_log
0,2018-01-01,Arizona,Maricopa,Buckeye,BUCKEYE,Carbon monoxide,Parts per million,2.079442
1,2018-01-01,Ohio,Belmont,Shadyside,Shadyside,Carbon monoxide,Parts per million,1.791759
2,2018-01-01,Wyoming,Teton,Not in a city,Yellowstone National Park - Old Faithful Snow ...,Carbon monoxide,Parts per million,1.098612
3,2018-01-01,Pennsylvania,Philadelphia,Philadelphia,North East Waste (NEW),Carbon monoxide,Parts per million,1.386294
4,2018-01-01,Iowa,Polk,Des Moines,CARPENTER,Carbon monoxide,Parts per million,1.386294


In [5]:
# check varibles
aqi.columns

Index(['date_local', 'state_name', 'county_name', 'city_name',
       'local_site_name', 'parameter_name', 'units_of_measure', 'aqi_log'],
      dtype='object')

In [6]:
# Use describe() to summarize AQI
aqi.describe()

Unnamed: 0,aqi_log
count,257.0
mean,1.768918
std,0.716498
min,0.0
25%,1.098612
50%,1.791759
75%,2.302585
max,3.931826


In [7]:
# For a more thorough examination of observations by state use values_counts()
aqi['state_name'].value_counts()




state_name
California              66
Arizona                 14
Ohio                    12
Florida                 12
Texas                   10
New York                10
Pennsylvania             9
Colorado                 9
Michigan                 9
Minnesota                7
New Jersey               6
Indiana                  5
Massachusetts            4
Oklahoma                 4
North Carolina           4
Nevada                   4
Maryland                 4
Connecticut              4
Virginia                 4
Utah                     3
Vermont                  3
Illinois                 3
Missouri                 3
Hawaii                   3
Iowa                     3
Wyoming                  3
Kentucky                 3
Alaska                   2
Rhode Island             2
Georgia                  2
Tennessee                2
Washington               2
Montana                  2
Maine                    2
Idaho                    2
New Mexico               2
District Of Colum

#### **Question 1: From the preceding data exploration, what do you recognize?**

A:75% of the city have really low aqi, and california has the most values in this dataset.

## Step 3. Statistical Tests

Before you proceed, recall the following steps for conducting hypothesis testing:

1. Formulate the null hypothesis and the alternative hypothesis.<br>
2. Set the significance level.<br>
3. Determine the appropriate test procedure.<br>
4. Compute the p-value.<br>
5. Draw your conclusion.

### Hypothesis 1: ROA is considering a metropolitan-focused approach. Within California, they want to know if the mean AQI in Los Angeles County is statistically different from the rest of California.

Before proceeding with your analysis, it will be helpful to subset the data for your comparison.

In [8]:
# Create dataframes for each sample being compared in your test
california_data = aqi[aqi['state_name'] == 'California']
# 1. Create dataframe for Los Angeles County
la_data = california_data[california_data['county_name'] == 'Los Angeles']

# 2. Create dataframe for the rest of California (excluding Los Angeles)
rest_of_california = california_data[california_data['county_name'] != 'Los Angeles']



In [9]:
# Check head
print("Los Angeles data:")
print(la_data.head())

print("\nRest of California data:")
print(rest_of_california.head())


Los Angeles data:
     date_local  state_name  county_name      city_name  \
33   2018-01-01  California  Los Angeles      Lancaster   
42   2018-01-01  California  Los Angeles  Santa Clarita   
61   2018-01-01  California  Los Angeles       Pasadena   
76   2018-01-01  California  Los Angeles    Los Angeles   
109  2018-01-01  California  Los Angeles    Los Angeles   

                   local_site_name   parameter_name   units_of_measure  \
33       Lancaster-Division Street  Carbon monoxide  Parts per million   
42                   Santa Clarita  Carbon monoxide  Parts per million   
61                        Pasadena  Carbon monoxide  Parts per million   
76                    LAX Hastings  Carbon monoxide  Parts per million   
109  Los Angeles-North Main Street  Carbon monoxide  Parts per million   

      aqi_log  
33   2.079442  
42   2.079442  
61   2.833213  
76   2.890372  
109  2.890372  

Rest of California data:
    date_local  state_name     county_name      city_name  \

#### Formulate your hypothesis:

**Formulate your null and alternative hypotheses:**

*   $H_0$: There is no difference in the mean AQI between Los Angeles County and the rest of California.
*   $H_A$: There is a difference in the mean AQI between Los Angeles County and the rest of California.


#### Set the significance level:

In [10]:
# For this analysis, the significance level is 5%

#### Determine the appropriate test procedure:

Here, you are comparing the sample means between two independent samples. Therefore, you will utilize a **two-sample  𝑡-test**.

#### Compute the P-value

In [11]:
# Compute your p-value here
from scipy import stats
statistic, pvalue = stats.ttest_ind(a=la_data['aqi_log'], b=rest_of_california['aqi_log'], equal_var=False)

print ("pvalue:",pvalue)
print()


pvalue: 0.01340695749474014



#### **Question 2. What is your P-value for hypothesis 1, and what does this indicate for your null hypothesis?**

In [12]:
# Extracting pvalue and make the test
if pvalue < 0.05:
    
    print('pvalue < 0.05, Reject Ho.')          # Ha: There is a difference in the mean
else:
    print('pvalue > 0.05, Fail to reject Ho.')  # Ho: There is no difference in the mean

pvalue < 0.05, Reject Ho.


A:it means that we can reject the ho hypothesis that there is no difference in the mean between Los Angeles County and the rest of California

### Hypothesis 2: With limited resources, ROA has to choose between New York and Ohio for their next regional office. Does New York have a lower AQI than Ohio?

Before proceeding with your analysis, it will be helpful to subset the data for your comparison.

In [13]:
# Create dataframes for each sample being compared in your test
# Filter the data for New York and Ohio
ny_data = aqi[aqi['state_name'] == 'New York']
ohio_data = aqi[aqi['state_name'] == 'Ohio']



In [14]:
# Check head
print("New York AQI Data:")
print(ny_data.head())

print("\nOhio AQI Data:")
print(ohio_data.head())


New York AQI Data:
     date_local state_name county_name    city_name           local_site_name  \
90   2018-01-01   New York        Erie  Cheektowaga         Buffalo Near-Road   
113  2018-01-01   New York       Bronx     New York           PFIZER LAB SITE   
124  2018-01-01   New York      Monroe    Rochester               ROCHESTER 2   
167  2018-01-01   New York    New York     New York                      CCNY   
173  2018-01-01   New York      Queens     New York  Queens College Near Road   

      parameter_name   units_of_measure   aqi_log  
90   Carbon monoxide  Parts per million  1.386294  
113  Carbon monoxide  Parts per million  1.386294  
124  Carbon monoxide  Parts per million  1.098612  
167  Carbon monoxide  Parts per million  1.098612  
173  Carbon monoxide  Parts per million  1.386294  

Ohio AQI Data:
    date_local state_name county_name   city_name local_site_name  \
1   2018-01-01       Ohio     Belmont   Shadyside       Shadyside   
12  2018-01-01       Ohio   

**Formulate your null and alternative hypotheses:**

*   $H_0$: The mean AQI of New York is greater than or equal to that of Ohio.
*   $H_A$: The mean AQI of New York is **below** that of Ohio.


#### Significance Level (remains at 5%)

#### Determine the appropriate test procedure:

Here, you are comparing the sample means between two independent samples in one direction. Therefore, you will utilize a **two-sample  𝑡-test**.

#### Compute the P-value

In [15]:
# Compute your p-value here
statistic, pvalue = stats.ttest_ind(a=ny_data['aqi_log'], b=ohio_data['aqi_log'], equal_var=False)

print ("pvalue:",pvalue)
print()


pvalue: 0.06430898914287665



#### **Question 3. What is your P-value for hypothesis 2, and what does this indicate for your null hypothesis?**

In [16]:
# Your code here.
# Extracting pvalue and make the test
if pvalue < 0.05:
    
    print('pvalue < 0.05, Reject Ho.')          # Ha: There is a difference in the mean
else:
    print('pvalue > 0.05, Fail to reject Ho.')  # Ho: There is no difference in the mean

pvalue > 0.05, Fail to reject Ho.


A:it means that we can not reject ho hypothesis that there is no difference in the mean between ny and ohio

###  Hypothesis 3: A new policy will affect those states with a mean AQI of 10 or greater. Will Michigan be affected by this new policy?

Before proceeding with your analysis, it will be helpful to subset the data for your comparison.

In [17]:
# Create dataframes for each sample being compared in your test
# Filter the data for Michigan
michigan_data = aqi[aqi['state_name'] == 'Michigan']

# Check the first few rows to ensure the data is correct
print("Michigan AQI Data:")
print(michigan_data.head())

Michigan AQI Data:
     date_local state_name county_name   city_name           local_site_name  \
65   2018-01-01   Michigan       Wayne     Livonia                LIVONIA-NR   
122  2018-01-01   Michigan       Wayne     Detroit               West corner   
123  2018-01-01   Michigan       Wayne     Detroit  MARK TWAIN MIDDLE SCHOOL   
129  2018-01-01   Michigan       Wayne     Detroit                  ELIZA-NR   
192  2018-01-01   Michigan       Wayne  Allen Park                Allen Park   

      parameter_name   units_of_measure   aqi_log  
65   Carbon monoxide  Parts per million  1.791759  
122  Carbon monoxide  Parts per million  2.197225  
123  Carbon monoxide  Parts per million  2.302585  
129  Carbon monoxide  Parts per million  2.484907  
192  Carbon monoxide  Parts per million  2.639057  


**Formulate your null and alternative hypotheses here:**

*   $H_0$: The mean AQI of Michigan is less than or equal to 10.
*   $H_A$: The mean AQI of Michigan is greater than 10.


#### Significance Level (remains at 5%)

#### Determine the appropriate test procedure:

Here, you are comparing one sample mean relative to a particular value in one direction. Therefore, you will utilize a **one-sample  𝑡-test**. 

#### Compute the P-value

In [18]:
# Compute your p-value here
statistic, pvalue = stats.ttest_1samp(michigan_data['aqi_log'], 10)

# Print the p-value
print("p-value:", pvalue)

# Hypothesis test result
if pvalue < 0.05:
    print('p-value < 0.05, Reject H0: The mean AQI of Michigan is greater than 10.')
else:
    print('p-value > 0.05, Fail to reject H0: The mean AQI of Michigan is not greater than 10.')

p-value: 2.2171478049783093e-11
p-value < 0.05, Reject H0: The mean AQI of Michigan is greater than 10.


#### **Question 4. What is your P-value for hypothesis 3, and what does this indicate for your null hypothesis?**

A:p-value < 0.05, Reject H0: The mean AQI of Michigan is greater than 10.

## Step 4. Results and Evaluation

Now that you've completed your statistical tests, you can consider your hypotheses and the results you gathered.

#### **Question 5. Did your results show that the AQI in Los Angeles County was statistically different from the rest of California?**

A:yes, because the result reject the h0 hypothesis

#### **Question 6. Did New York or Ohio have a lower AQI?**

A:no there is no statistical difference

#### **Question 7: Will Michigan be affected by the new policy impacting states with a mean AQI of 10 or greater?**



A:no it will not

# Conclusion

**What are key takeaways from this project?**

A: I learned how to do hypothesis test using python

**What would you consider presenting to your manager as part of your findings?**

A: that using stats.ttest_ind() is convenient.


**What would you convey to external readers?**

A: 
