# Activity: Explore hypothesis testing

## Introduction

You work for an environmental think tank called Repair Our Air (ROA). ROA is formulating policy recommendations to improve the air quality in America, using the Environmental Protection Agency's Air Quality Index (AQI) to guide their decision making. An AQI value close to 0 signals "little to no" public health concern, while higher values are associated with increased risk to public health. 

They've tasked you with leveraging AQI data to help them prioritize their strategy for improving air quality in America.

ROA is considering the following decisions. For each, construct a hypothesis test and an accompanying visualization, using your results of that test to make a recommendation:

1. ROA is considering a metropolitan-focused approach. Within California, they want to know if the mean AQI in Los Angeles County is statistically different from the rest of California.
2. With limited resources, ROA has to choose between New York and Ohio for their next regional office. Does New York have a lower AQI than Ohio?
3. A new policy will affect those states with a mean AQI of 10 or greater. Will Michigan be affected by this new policy?

**Notes:**
1. For your analysis, you'll default to a 5% level of significance.
2. Throughout the lab, for two-sample t-tests, use Welch's t-test (i.e., setting the `equal_var` parameter to `False` in `scipy.stats.ttest_ind()`). This will account 

## Step 1: Imports

To proceed with your analysis, import `pandas` and `numpy`. To conduct your hypothesis testing, import `stats` from `scipy`.

#### Import Packages

In [50]:
# Import relevant packages

import pandas as pd
import numpy as np
import seaborn as sns
from scipy import stats

You are also provided with a dataset with national Air Quality Index (AQI) measurements by state over time for this analysis. `Pandas` was used to import the file `c4_epa_air_quality.csv` as a dataframe named `aqi`.

**Note:** For purposes of your analysis, you can assume this data is randomly sampled from a larger population.

#### Load Dataset

In [51]:
# RUN THIS CELL TO IMPORT YOUR DATA.

aqi = pd.read_csv(r'C:\Users\saswa\Documents\GitHub\Python-For-Data-Analysis\Course-4\Data\shared_data\c4_epa_air_quality.csv')

## Step 2: Data Exploration

### Before proceeding to your deliverables, explore your datasets.

Use the following space to surface descriptive statistics about your data. In particular, explore whether you believe the research questions you were given are readily answerable with this data.

In [52]:
# Explore your dataframe `aqi` here:

print(aqi.head(10))

print(aqi.describe(include='all'))

print(aqi['state_name'].value_counts())

   Unnamed: 0  date_local    state_name   county_name      city_name  \
0           0  2018-01-01       Arizona      Maricopa        Buckeye   
1           1  2018-01-01          Ohio       Belmont      Shadyside   
2           2  2018-01-01       Wyoming         Teton  Not in a city   
3           3  2018-01-01  Pennsylvania  Philadelphia   Philadelphia   
4           4  2018-01-01          Iowa          Polk     Des Moines   
5           5  2018-01-01        Hawaii      Honolulu  Not in a city   
6           6  2018-01-01        Hawaii      Honolulu  Not in a city   
7           7  2018-01-01  Pennsylvania          Erie           Erie   
8           8  2018-01-01        Hawaii      Honolulu       Honolulu   
9           9  2018-01-01      Colorado       Larimer   Fort Collins   

                                     local_site_name   parameter_name  \
0                                            BUCKEYE  Carbon monoxide   
1                                          Shadyside  Carbon 

#### **Question 1: From the preceding data exploration, what do you recognize?**

Based on the preceding data exploration, the three research questions can be answered with the available data. Specifically:

- For the first hypothesis, we have county-level AQI data, which is sufficient to compare the AQI between Los Angeles County and the rest of California.
- For the second hypothesis, both Ohio and New York have a sufficient number of observations in the dataset, allowing for a robust comparison of their AQI levels.
- The third hypothesis about Michigan’s AQI and its potential impact from the new policy can also be assessed, as Michigan's data is available and contains enough information for the analysis.

Overall, the dataset is well-suited to answer the research questions, given the availability of county-level and state-level data for each hypothesis.

## Step 3. Statistical Tests

Before you proceed, recall the following steps for conducting hypothesis testing:

1. Formulate the null hypothesis and the alternative hypothesis.<br>
2. Set the significance level.<br>
3. Determine the appropriate test procedure.<br>
4. Compute the p-value.<br>
5. Draw your conclusion.

### Hypothesis 1: ROA is considering a metropolitan-focused approach. Within California, they want to know if the mean AQI in Los Angeles County is statistically different from the rest of California.

Before proceeding with your analysis, it will be helpful to subset the data for your comparison.

In [53]:
# Create dataframes for each sample being compared in your test
ca_la=aqi[aqi['county_name']=='Los Angeles']
ca_other=aqi[(aqi['state_name']=='California') & (aqi['county_name']!='Los Angeles')]


In [54]:
ca_la['aqi'].mean()

np.float64(16.285714285714285)

In [55]:
ca_other['aqi'].mean()

np.float64(11.0)

#### Formulate your hypothesis:

**Formulate your null and alternative hypotheses:**

*   $H_0$: There is no difference in the mean AQI between Los Angeles County and the rest of California.
*   $H_A$: There is a difference in the mean AQI between Los Angeles County and the rest of California.


#### Set the significance level:

In [56]:
# For this analysis, the significance level is 5%

significance_level=0.05
significance_level

0.05

#### Determine the appropriate test procedure:

Here, you are comparing the sample means between two independent samples. Therefore, you will utilize a **two-sample  𝑡-test**.

#### Compute the P-value

In [57]:
# Compute your p-value here

stats.ttest_ind(a=ca_la['aqi'],b=ca_other['aqi'],equal_var=False)


TtestResult(statistic=np.float64(2.1107010796372014), pvalue=np.float64(0.049839056842410995), df=np.float64(17.08246830361151))

#### **Question 2. What is your P-value for hypothesis 1, and what does this indicate for your null hypothesis?**

The p-value for hypothesis 1 is 0.049, which is less than the significance level of 0.05. This implies that we can reject the null hypothesis in favor of the alternative hypothesis, as the difference observed is statistically significant. 

Therefore, based on the result, we conclude that there is sufficient evidence to suggest that Los Angeles County's AQI is statistically different from the rest of California. As a result, a metropolitan-focused strategy may be a sensible approach for improving air quality in California.


### Hypothesis 2: With limited resources, ROA has to choose between New York and Ohio for their next regional office. Does New York have a lower AQI than Ohio?

Before proceeding with your analysis, it will be helpful to subset the data for your comparison.

In [58]:
# Create dataframes for each sample being compared in your test

ny=aqi[aqi['state_name']=='New York']
ohio=aqi[aqi['state_name']=='Ohio']

#### Formulate your hypothesis:

**Formulate your null and alternative hypotheses:**

*   $H_0$: The mean AQI of New York is greater than or equal to that of Ohio.
*   $H_A$: The mean AQI of New York is **below** that of Ohio.


#### Significance Level (remains at 5%)

#### Determine the appropriate test procedure:

Here, you are comparing the sample means between two independent samples in one direction. Therefore, you will utilize a **two-sample  𝑡-test**.

#### Compute the P-value

In [59]:
# Compute your p-value here

tstat, pvalue = stats.ttest_ind(a=ny['aqi'], b=ohio['aqi'], alternative='less', equal_var=False)
print(tstat)
print(pvalue)

-2.025951038880333
0.030446502691934683


#### **Question 3. What is your P-value for hypothesis 2, and what does this indicate for your null hypothesis?**

The p-value for hypothesis 2 is 0.0304, which is less than the significance level of 0.05, indicating a statistically significant difference between New York and Ohio's AQI. Additionally, with the t-statistic being negative (-2.036), we reject the null hypothesis in favor of the alternative hypothesis.

This suggests that New York's mean AQI is lower than Ohio's mean AQI, and the observed difference is not likely due to random chance. Therefore, at the 5% significance level, we can confidently conclude that New York has a lower AQI compared to Ohio at the population level.


###  Hypothesis 3: A new policy will affect those states with a mean AQI of 10 or greater. Will Michigan be affected by this new policy?

Before proceeding with your analysis, it will be helpful to subset the data for your comparison.

In [60]:
# Create dataframes for each sample being compared in your test

michigan = aqi[aqi['state_name']=='Michigan']

#### Formulate your hypothesis:

**Formulate your null and alternative hypotheses here:**

*   $H_0$: The mean AQI of Michigan is less than or equal to 10.
*   $H_A$: The mean AQI of Michigan is greater than 10.


#### Significance Level (remains at 5%)

#### Determine the appropriate test procedure:

Here, you are comparing one sample mean relative to a particular value in one direction. Therefore, you will utilize a **one-sample  𝑡-test**. 

#### Compute the P-value

In [61]:
# Compute your p-value here

tstat1, pvalue1 = stats.ttest_1samp(michigan['aqi'], 10, alternative='greater')
print(tstat1)
print(pvalue1)

-1.7395913343286131
0.9399405193140109


#### **Question 4. What is your P-value for hypothesis 3, and what does this indicate for your null hypothesis?**

The p-value for hypothesis 3 is 0.940, which is greater than the significance level of 0.05. As a result, we fail to reject the null hypothesis. Additionally, the t-statistic is negative (-1.74).

This suggests that, based on the data, there is insufficient evidence to conclude that Michigan's mean AQI is greater than 10 at the 5% significance level. Therefore, we cannot confidently claim that Michigan's AQI is significantly higher than the threshold, implying that Michigan is less likely to be affected by the new policy. The population mean AQI is likely less than or equal to the benchmark value as stated in the null hypothesis.


#### **Question 5. Did your results show that the AQI in Los Angeles County was statistically different from the rest of California?**

No, the results did not show that the AQI in Los Angeles County was statistically different from the rest of California. 

The p-value was greater than the significance level (0.05), indicating that the observed difference could be due to sampling variability and not a statistically significant effect. Therefore, we fail to reject the null hypothesis, suggesting that there is no substantial evidence to claim that Los Angeles County's AQI is different from the rest of California's AQI.


#### **Question 6. Did New York or Ohio have a lower AQI?**

Using a 5% significance level, we can conclude that New York has a lower AQI than Ohio. 

The mean AQI for New York was found to be lower than that of Ohio, and the difference was statistically significant. Since the alternative hypothesis tested whether New York’s AQI is lower than Ohio's, the results support that New York has a statistically significantly lower AQI than Ohio.


#### **Question 7: Will Michigan be affected by the new policy impacting states with a mean AQI of 10 or greater?**



**Answer**:  
Based on the test results, we fail to reject the null hypothesis, which suggests that the mean AQI in Michigan is not statistically significantly greater than 10. 

Since the p-value was greater than the significance level (0.05), the result indicates that the observed difference in AQI could be due to chance rather than a true effect. Therefore, it is unlikely that Michigan would be affected by the new policy targeting states with a mean AQI of 10 or greater.


# Conclusion

**What are key takeaways from this lab?**

**Key Takeaways**:  

- Despite small sample sizes, the variation within the data allowed for statistically significant conclusions at the 5% significance level.  
- Los Angeles' mean AQI was statistically different from the rest of California, while New York was found to have a lower mean AQI than Ohio.  
- We were unable to conclude that Michigan's mean AQI was greater than 10, as this difference was not statistically significant.  

**Additional Insights**:  
- Python proved to be a powerful tool for precise and efficient sampling and hypothesis testing, thanks to its wide variety of libraries and methods.  
- We successfully addressed three key questions from the company, analyzing the AQI data for different cities:  
  - The AQI in Los Angeles County was not statistically different from the rest of California, indicating that any observed difference was likely due to sampling variability.  
  - New York had a lower mean AQI than Ohio, with the difference being statistically significant.  
  - Michigan is unlikely to be affected by the new policy, as its mean AQI being greater than 10 was not statistically significant and was likely due to chance.



**What would you consider presenting to your manager as part of your findings?**

**Presentation to Manager:**

- **Introduction**: 
  I would present the findings in the form of a notebook that clearly documents the hypothesis tests conducted, along with the relevant p-values, conclusions, and insights.

- **Key Findings**:
  - **Los Angeles vs. Rest of California**:  
    The hypothesis test indicated that Los Angeles' mean AQI was **not statistically different** from the rest of California. The observed difference was likely due to sampling variability, and the p-value did not meet the threshold for statistical significance.
  
  - **New York vs. Ohio**:  
    Based on the test results, **New York has a statistically lower AQI than Ohio**, with a **95% confidence level**. The p-value was less than 0.05, indicating a statistically significant difference between the two.

  - **Michigan and the New Policy**:  
    The hypothesis test showed that Michigan's AQI was **not statistically greater than 10**, meaning that Michigan is **unlikely to be affected by the new policy**. The p-value was greater than the significance level (0.05), suggesting that the difference was likely due to random chance.

- **Statistical Setup**:  
  For each test, I would explain the null and alternative hypotheses, specify whether the test was one-tailed or two-tailed, and describe the t-test configuration used from the `stats` module.  
  This would ensure clarity about the methodology and help contextualize the findings.


**What would you convey to external stakeholders?**

**Presentation to External Stakeholders:**

- **Overview of the Findings**:  
  I would communicate the conclusions based on a 5% significance level, highlighting the results for each of the research questions.

- **Key Insights**:
  - **Los Angeles AQI**:  
    The hypothesis test showed that Los Angeles County’s AQI is **not statistically higher** than the rest of California. This suggests that air quality concerns may not be more concentrated in Los Angeles compared to other areas in California. **Resources and interventions might be better directed elsewhere** based on this finding.

  - **Ohio vs. New York AQI**:  
    Ohio’s AQI was found to be **statistically higher** than New York’s, with a significant difference between the two. Therefore, stakeholders should **consider New York for a regional office** if air quality is a factor in site selection.

  - **Michigan’s Likely Impact by the New Policy**:  
    The test results indicated that Michigan’s AQI is **not statistically greater than 10**, suggesting that Michigan is **unlikely to be impacted by the new policy**. The difference in AQI is not reliable enough to support action on the policy’s basis.

- **Sample Statistics**:  
  Providing sample statistics, such as the AQI values for Los Angeles, New York, Ohio, and Michigan, would offer stakeholders a clearer understanding of the data compared and the significance of the results. This helps contextualize the findings and enables stakeholders to make informed decisions.

This approach helps in translating statistical results into actionable insights and recommendations for the stakeholders.
