# Air Quality Analysis with Hypothesis Testing

## Introduction

A work for an environmental think tank called Repair Our Air (ROA). ROA is formulating policy recommendations to improve the air quality in America, using the Environmental Protection Agency's Air Quality Index (AQI) to guide their decision making. An AQI value close to 0 signals "little to no" public health concern, while higher values are associated with increased risk to public health. 

I've tasked with leveraging AQI data to help them prioritize their strategy for improving air quality in America.

ROA is considering the following decisions:

1. ROA is considering a metropolitan-focused approach. Within California, they want to know if the mean AQI in Los Angeles County is statistically different from the rest of California.
2. With limited resources, ROA has to choose between New York and Ohio for their next regional office. Does New York have a lower AQI than Ohio?
3. A new policy will affect those states with a mean AQI of 10 or greater. Will Michigan be affected by this new policy?

**Notes:**
1. For the analysis, I'll default to a 5% level of significance.
2. Throughout the project, for two-sample t-tests, I will use Welch's t-test (i.e., setting the `equal_var` parameter to `False` in `scipy.stats.ttest_ind()`). This will account for the possibly unequal variances between the two groups in the comparison.

## Step 1: Imports


#### Import Packages

In [20]:
# Import relevant packages

import pandas as pd
import numpy as np
from scipy import stats

#### Load Dataset

In [22]:
# IMPORT THE DATA.

aqi = pd.read_csv('c4_epa_air_quality.csv')

## Step 2: Data Exploration

### Before proceeding to the deliverables, explore your datasets.

Exploring whether the research questions I were given are readily answerable with this data.

In [25]:
# Explore The dataframe `aqi`:



print("describe() To summarize AQI")
print(aqi.describe(include='all'))

print("values_counts() For a more thorough examination of observations by state ")
print(aqi['state_name'].value_counts())

print("head() To show a sample of data")
aqi.head()

describe() To summarize AQI
        Unnamed: 0  date_local  state_name  county_name      city_name  \
count   260.000000         260         260          260            260   
unique         NaN           1          52          149            190   
top            NaN  2018-01-01  California  Los Angeles  Not in a city   
freq           NaN         260          66           14             21   
mean    129.500000         NaN         NaN          NaN            NaN   
std      75.199734         NaN         NaN          NaN            NaN   
min       0.000000         NaN         NaN          NaN            NaN   
25%      64.750000         NaN         NaN          NaN            NaN   
50%     129.500000         NaN         NaN          NaN            NaN   
75%     194.250000         NaN         NaN          NaN            NaN   
max     259.000000         NaN         NaN          NaN            NaN   

       local_site_name   parameter_name   units_of_measure  arithmetic_mean  \
coun

Unnamed: 0.1,Unnamed: 0,date_local,state_name,county_name,city_name,local_site_name,parameter_name,units_of_measure,arithmetic_mean,aqi
0,0,2018-01-01,Arizona,Maricopa,Buckeye,BUCKEYE,Carbon monoxide,Parts per million,0.473684,7
1,1,2018-01-01,Ohio,Belmont,Shadyside,Shadyside,Carbon monoxide,Parts per million,0.263158,5
2,2,2018-01-01,Wyoming,Teton,Not in a city,Yellowstone National Park - Old Faithful Snow ...,Carbon monoxide,Parts per million,0.111111,2
3,3,2018-01-01,Pennsylvania,Philadelphia,Philadelphia,North East Waste (NEW),Carbon monoxide,Parts per million,0.3,3
4,4,2018-01-01,Iowa,Polk,Des Moines,CARPENTER,Carbon monoxide,Parts per million,0.215789,3


#### **Observations**


- We have county-level data for the first hypothesis.
- Ohio and New York both have a higher number of observations to work with in this dataset.

## Step 3. Statistical Tests

Recalling the following steps for conducting hypothesis testing:

1. Formulate the null hypothesis and the alternative hypothesis.<br>
2. Set the significance level.<br>
3. Determine the appropriate test procedure.<br>
4. Compute the p-value.<br>
5. Draw your conclusion.

### Hypothesis 1: ROA is considering a metropolitan-focused approach. Within California, they want to know if the mean AQI in Los Angeles County is statistically different from the rest of California.

Before proceeding with the analysis, it will be helpful to subset the data for the comparison.

In [31]:
# Create dataframes for each sample being compared in the test

california_los_angeles = aqi[aqi['county_name']=='Los Angeles']
california_other = aqi[(aqi['state_name']=='California') & (aqi['county_name']!='Los Angeles')]

#### Formulate the hypothesis:

**Formulate the null and alternative hypotheses:**

*   $H_0$: There is no difference in the mean AQI between Los Angeles County and the rest of California.
*   $H_A$: There is a difference in the mean AQI between Los Angeles County and the rest of California.


#### Set the significance level:

In [35]:
# For this analysis, the significance level is 5%

significance_level=0.05

#### Determining the appropriate test procedure:

Comparing the sample means between two independent samples. Therefore, I will utilize a **two-sample  t-test**.

#### Compute the P-value

In [44]:
# Compute the p-value 

stats.ttest_ind(a=california_los_angeles['aqi'],b=california_other['aqi'],equal_var=False)

TtestResult(statistic=2.1107010796372014, pvalue=0.049839056842410995, df=17.08246830361151)

- The p-value for my hypothesis 1 is 0.04983, so we will reject the null hypothesis.

- Therefore, a metropolitan strategy may make sense in this case.

### Hypothesis 2: With limited resources, ROA has to choose between New York and Ohio for their next regional office. Does New York have a lower AQI than Ohio?

Before proceeding with the analysis, it will be helpful to subset the data for the comparison.

In [49]:
# Create dataframes for each sample being compared in the test

new_york=aqi[aqi['state_name']=='New York']
ohio = aqi[aqi['state_name']=='Ohio']

#### Formulate the hypothesis:

**Formulate the null and alternative hypotheses:**

*   $H_0$: The mean AQI of New York is greater than or equal to that of Ohio.
*   $H_A$: The mean AQI of New York is **below** that of Ohio.


#### Significance Level (remains at 5%)

#### Determining the appropriate test procedure:

Comparing the sample means between two independent samples in one direction. Therefore, I will utilize a **two-sample  t-test**.

#### Compute the P-value

In [21]:
# Compute the p-value

stats.ttest_ind(a=new_york['aqi'],b=ohio['aqi'],alternative='less',equal_var=False)

Ttest_indResult(statistic=-2.025951038880333, pvalue=0.030446502691934697)

- With a p-value (0.030) of less than 0.05 (as the significance level is 5%) and a t-statistic < 0 (-2.036), **reject the null hypothesis in favor of the alternative hypothesis**.

- Therefore, I can conclude at the 5% significance level that New York has a lower mean AQI than Ohio.

###  Hypothesis 3: A new policy will affect those states with a mean AQI of 10 or greater. Will Michigan be affected by this new policy?

Before proceeding with the analysis, it will be helpful to subset the data for the comparison.

In [22]:
# Create dataframes for each sample being compared in the test

michigan = aqi[aqi['state_name']=='Michigan']

#### Formulate the hypothesis:

**Formulate the null and alternative hypotheses:**

*   $H_0$: The mean AQI of Michigan is less than or equal to 10.
*   $H_A$: The mean AQI of Michigan is greater than 10.


#### Significance Level (remains at 5%)

#### Determining the appropriate test procedure:

Comparing one sample mean relative to a particular value in one direction. Therefore, I will utilize a **one-sample  t-test**. 

#### Compute the P-value

In [23]:
# Compute the p-value

stats.ttest_1samp(michigan['aqi'],10,alternative='greater')

Ttest_1sampResult(statistic=-1.7395913343286131, pvalue=0.9399405193140109)

- With a p-value (0.940) being greater than 0.05 (as the significance level is 5%) and a t-statistic < 0 (-1.74), **fail to reject the null hypothesis**.

- Therefore, I cannot conclude at the 5% significance level that Michigan's mean AQI is greater than 10. This implies that Michigan would most likely not be affected by the new policy.

## Step 4. Results and Evaluation

Now that I've completed the statistical tests, I can consider my hypotheses and the results gathered.

#### **Did the results show that the AQI in Los Angeles County was statistically different from the rest of California?**

Yes, the results indicated that the AQI in Los Angeles County was in fact different from the rest of California.

#### **Did New York or Ohio have a lower AQI?**

Using a 5% significance level, I can conclude that New York has a lower AQI than Ohio based on the results.

#### **Will Michigan be affected by the new policy impacting states with a mean AQI of 10 or greater?**



Based on the tests, I would fail to reject the null hypothesis, meaning I can't conclude that the mean AQI is greater than 10. Thus, it is unlikely that Michigan would be affected by the new policy.

# Conclusion

**Key Insights:**

- Statistical Significance of AQI Levels: The analysis revealed that the mean AQI for Los Angeles is statistically different from the mean AQI across the rest of California at a 5% significance level, indicating significant air quality concerns specific to Los Angeles.

- Comparative Analysis Between States: The findings demonstrated that New York has a lower mean AQI compared to Ohio, suggesting better air quality in New York, which may inform public health strategies and resource allocation.

- Limitations in Conclusions: The analysis was unable to establish, at the 5% significance level, that Michigan's mean AQI exceeds a value of 10. This uncertainty highlights a need for further investigation and potentially larger sample sizes to draw more definitive conclusions.

- Importance of Statistical Context: Presenting the null and alternative hypotheses for each test, along with the p-values and types of tests conducted, provides stakeholders with essential context to understand the implications of the findings and their significance in air quality management.

In [75]:
import json

# Load the notebook file
with open("Activity_Explore hypothesis testing.ipynb", "r", encoding="utf-8") as f:
    notebook = json.load(f)

# Iterate over all cells and extract markdown cells
markdown_cells = []
for cell in notebook['cells']:
    if cell['cell_type'] == 'markdown':
        markdown_cells.append(''.join(cell['source']))

# Print or copy all markdown cells
for i, md in enumerate(markdown_cells):
    print(f"Markdown Cell {i+1}:\n{md}\n")


Markdown Cell 1:
# Activity: Explore hypothesis testing

Markdown Cell 2:
## Introduction

Markdown Cell 3:
My work for an environmental think tank called Repair Our Air (ROA). ROA is formulating policy recommendations to improve the air quality in America, using the Environmental Protection Agency's Air Quality Index (AQI) to guide their decision making. An AQI value close to 0 signals "little to no" public health concern, while higher values are associated with increased risk to public health. 

They've tasked me with leveraging AQI data to help them prioritize their strategy for improving air quality in America.

Markdown Cell 4:
ROA is considering the following decisions. For each, construct a hypothesis test and an accompanying visualization, using my results of that test to make a recommendation:

1. ROA is considering a metropolitan-focused approach. Within California, they want to know if the mean AQI in Los Angeles County is statistically different from the rest of California.