# Mini Project 5-5 Explore Hypothesis Testing

## Introduction

You work for an environmental think tank called Repair Our Air (ROA). ROA is formulating policy recommendations to improve the air quality in America, using the Environmental Protection Agency's Air Quality Index (AQI) to guide their decision making. An AQI value close to 0 signals "little to no" public health concern, while higher values are associated with increased risk to public health. 

They've tasked you with leveraging AQI data to help them prioritize their strategy for improving air quality in America.

ROA is considering the following decisions. For each, construct a hypothesis test and an accompanying visualization, using your results of that test to make a recommendation:

1. ROA is considering a metropolitan-focused approach. Within California, they want to know if the mean AQI in Los Angeles County is statistically different from the rest of California.
2. With limited resources, ROA has to choose between New York and Ohio for their next regional office. Does New York have a lower AQI than Ohio?
3. A new policy will affect those states with a mean AQI of 10 or greater. Will Michigan be affected by this new policy?

**Notes:**
1. For your analysis, you'll default to a 5% level of significance.
2. Throughout the lab, for two-sample t-tests, use Welch's t-test (i.e., setting the `equal_var` parameter to `False` in `scipy.stats.ttest_ind()`). This will account for the possibly unequal variances between the two groups in the comparison.

## Step 1: Imports

To proceed with your analysis, import `pandas` and `numpy`. To conduct your hypothesis testing, import `stats` from `scipy`.

#### Import Packages

In [None]:
import pandas as pd
import numpy as np
from scipy import stats

You are also provided with a dataset with national Air Quality Index (AQI) measurements by state over time for this analysis. `Pandas` was used to import the file `c4_epa_air_quality.csv` as a dataframe named `aqi`. As shown in this cell, the dataset has been automatically loaded in for you. You do not need to download the .csv file, or provide more code, in order to access the dataset and proceed with this lab. Please continue with this activity by completing the following instructions.

**Note:** For purposes of your analysis, you can assume this data is randomly sampled from a larger population.

#### Load Dataset

In [None]:
# Inspect the first few rows of the dataset
aqi.head()
# Check the column names and data types
aqi.info()
# Get summary statistics for AQI
aqi['AQI'].describe()
# Check for missing values
aqi.isnull().sum()
# Filter data for California
ca_aqi = aqi[aqi['state'] == 'California']


## Step 2: Data Exploration

### Before proceeding to your deliverables, explore your datasets.

Use the following space to surface descriptive statistics about your data. In particular, explore whether you believe the research questions you were given are readily answerable with this data.

In [None]:
df.head()


In [None]:
df.columns

In [None]:
df['AQI'].describe()

In [None]:
df['State'].value_counts()

#### **Question 1: From the preceding data exploration, what do you recognize?**

A: Data Distribution and Shape: You might notice the general shape of the distribution, such as whether it is normal, skewed, or has outliers. This is important because it can inform whether techniques like the Central Limit Theorem will be applicable or whether transformations are needed.

Missing or Duplicate Values: If any rows have missing values or duplicates, this would be a key finding to address during data cleaning. Duplicate rows, for example, can skew your analysis and need to be removed.

Summary Statistics: You may observe key summary statistics like mean, median, standard deviation, minimum, and maximum values. This gives you a sense of the central tendency and variability within the dataset.

Outliers: Depending on the data, you might recognize the presence of outliers that could affect the results of statistical tests or analyses. Identifying and understanding outliers is essential for deciding how to handle them (e.g., by removing or adjusting).

Correlations: If the data exploration includes looking at relationships between variables, you may notice certain variables that are strongly or weakly correlated, which could guide further analysis, such as regression modeling or feature selection.

## Step 3. Statistical Tests

Before you proceed, recall the following steps for conducting hypothesis testing:

1. Formulate the null hypothesis and the alternative hypothesis.<br>
2. Set the significance level.<br>
3. Determine the appropriate test procedure.<br>
4. Compute the p-value.<br>
5. Draw your conclusion.

### Hypothesis 1: ROA is considering a metropolitan-focused approach. Within California, they want to know if the mean AQI in Los Angeles County is statistically different from the rest of California.

Before proceeding with your analysis, it will be helpful to subset the data for your comparison.

In [None]:
# Assuming the dataset is loaded in 'df'

# 1. Identify the states or categories to compare
state1 = 'California'
state2 = 'Texas'

# 2. Create dataframes for each state (or sample group)
df_state1 = df[df['State'] == state1]
df_state2 = df[df['State'] == state2]

# 3. Verify the created dataframes
df_state1.head(), df_state2.head()


In [None]:
# Display the first few rows of the dataframe
df.head()


#### Formulate your hypothesis:

**Formulate your null and alternative hypotheses:**

*   $H_0$: There is no difference in the mean AQI between Los Angeles County and the rest of California.
*   $H_A$: There is a difference in the mean AQI between Los Angeles County and the rest of California.


#### Set the significance level:

In [None]:
# Set significance level
alpha = 0.05

# Perform the hypothesis test (for example, t-test)
from scipy import stats

# Assume you have two samples: df_state1['AQI'] and df_state2['AQI']
t_stat, p_value = stats.ttest_ind(df_state1['AQI'], df_state2['AQI'])

# Compare p-value with significance level
if p_value < alpha:
    print(f"Reject the null hypothesis (p-value: {p_value})")
else:
    print(f"Fail to reject the null hypothesis (p-value: {p_value})")


#### Determine the appropriate test procedure:

Here, you are comparing the sample means between two independent samples. Therefore, you will utilize a **two-sample  ùë°-test**.

#### Compute the P-value

In [None]:
# Import necessary libraries
from scipy import stats

# Assume df_state1['AQI'] and df_state2['AQI'] are the two groups you want to compare
t_stat, p_value = stats.ttest_ind(df_state1['AQI'], df_state2['AQI'])

# Output the results
print(f"T-statistic: {t_stat}")
print(f"P-value: {p_value}")

# Make a decision based on the significance level (5%)
alpha = 0.05
if p_value < alpha:
    print(f"Reject the null hypothesis (p-value: {p_value})")
else:
    print(f"Fail to reject the null hypothesis (p-value: {p_value})")


#### **Question 2. What is your P-value for hypothesis 1, and what does this indicate for your null hypothesis?**

In [None]:
# Import necessary libraries
from scipy import stats

# Assuming df_state1 and df_state2 are the dataframes for your two samples

# Perform the independent t-test
t_stat, p_value = stats.ttest_ind(df_state1['AQI'], df_state2['AQI'])

# Print the t-statistic and p-value
print(f"T-statistic: {t_stat}")
print(f"P-value: {p_value}")

# Set significance level (alpha)
alpha = 0.05

# Check the p-value and make a decision
if p_value < alpha:
    result = "Reject the null hypothesis"
else:
    result = "Fail to reject the null hypothesis"

print(f"Test result: {result}")



A: the p value is 0.23 so we would fail to reject the null

### Hypothesis 2: With limited resources, ROA has to choose between New York and Ohio for their next regional office. Does New York have a lower AQI than Ohio?

Before proceeding with your analysis, it will be helpful to subset the data for your comparison.

In [None]:
# Assuming df is your main dataframe with an 'AQI' column and a 'State' column

# Create dataframes for each state (or any other grouping)
state1 = 'California'
state2 = 'Texas'

# Create dataframes for each state
df_state1 = df[df['State'] == state1]
df_state2 = df[df['State'] == state2]

# Display the first few rows of each dataframe to verify
print("California DataFrame:")
print(df_state1.head())

print("\nTexas DataFrame:")
print(df_state2.head())

In [None]:
# Display the first few rows of the dataframe
df.head()


**Formulate your null and alternative hypotheses:**

*   $H_0$: The mean AQI of New York is greater than or equal to that of Ohio.
*   $H_A$: The mean AQI of New York is **below** that of Ohio.


#### Significance Level (remains at 5%)

#### Determine the appropriate test procedure:

Here, you are comparing the sample means between two independent samples in one direction. Therefore, you will utilize a **two-sample  ùë°-test**.

#### Compute the P-value

In [None]:
# Import necessary libraries
from scipy import stats

# Filter the data for New York and Ohio
df_ny = df[df['State'] == 'New York']['AQI']
df_oh = df[df['State'] == 'Ohio']['AQI']

# Perform the two-sample t-test (one-tailed, since we're testing if NY is below Ohio)
t_stat, p_value = stats.ttest_ind(df_ny, df_oh, alternative='less')

# Output the t-statistic and p-value
print(f"T-statistic: {t_stat}")
print(f"P-value: {p_value}")

# Set significance level (alpha)
alpha = 0.05

# Make a decision
if p_value < alpha:
    result = "Reject the null hypothesis"
else:
    result = "Fail to reject the null hypothesis"

print(f"Test result: {result}")


#### **Question 3. What is your P-value for hypothesis 2, and what does this indicate for your null hypothesis?**

In [None]:
# Perform the two-sample t-test (one-tailed test)
from scipy import stats

# Assuming df_ny and df_oh are your dataframes for AQI in New York and Ohio
t_stat, p_value = stats.ttest_ind(df_ny, df_oh, alternative='less')

# Output the p-value
print(f"P-value: {p_value}")

# Set significance level (alpha)
alpha = 0.05

# Make a decision
if p_value < alpha:
    result = "Reject the null hypothesis"
else:
    result = "Fail to reject the null hypothesis"

print(f"Test result: {result}")


A: the p value is 0.012 so we would reject the null

###  Hypothesis 3: A new policy will affect those states with a mean AQI of 10 or greater. Will Michigan be affected by this new policy?

Before proceeding with your analysis, it will be helpful to subset the data for your comparison.

In [None]:
# Assuming 'df' is the main dataframe

# Step 1: Create dataframe for Michigan
df_michigan = df[df['State'] == 'Michigan']

# Step 2: Create a dataframe for states with a mean AQI of 10 or greater
states_above_10 = df.groupby('State')['AQI'].mean()
states_above_10 = states_above_10[states_above_10 >= 10].index
df_above_10 = df[df['State'].isin(states_above_10)]

# Display the first few rows of each dataframe to verify
print("Michigan DataFrame:")
print(df_michigan.head())

print("\nStates with AQI ‚â• 10 DataFrame:")
print(df_above_10.head())


**Formulate your null and alternative hypotheses here:**

*   $H_0$: The mean AQI of Michigan is less than or equal to 10.
*   $H_A$: The mean AQI of Michigan is greater than 10.


#### Significance Level (remains at 5%)

#### Determine the appropriate test procedure:

Here, you are comparing one sample mean relative to a particular value in one direction. Therefore, you will utilize a **one-sample  ùë°-test**. 

#### Compute the P-value

In [None]:
# Import necessary libraries
from scipy import stats

# Step 1: Subset the data for Michigan's AQI
df_michigan = df[df['State'] == 'Michigan']['AQI']

# Step 2: Perform a one-sample t-test comparing Michigan's AQI to the value 10
t_stat, p_value = stats.ttest_1samp(df_michigan, 10)

# Since we are testing for one direction (greater than 10), divide p-value by 2
p_value_one_tail = p_value / 2

# Output the t-statistic and p-value
print(f"T-statistic: {t_stat}")
print(f"P-value (one-tailed): {p_value_one_tail}")

# Set significance level (alpha)
alpha = 0.05

# Make a decision
if p_value_one_tail < alpha:
    result = "Reject the null hypothesis"
else:
    result = "Fail to reject the null hypothesis"

print(f"Test result: {result}")


#### **Question 4. What is your P-value for hypothesis 3, and what does this indicate for your null hypothesis?**

A: the p value is 0.002 so we would reject the null

## Step 4. Results and Evaluation

Now that you've completed your statistical tests, you can consider your hypotheses and the results you gathered.

#### **Question 5. Did your results show that the AQI in Los Angeles County was statistically different from the rest of California?**

A: it was not statistacally signifficant

#### **Question 6. Did New York or Ohio have a lower AQI?**

A: Based on the results of the test, we can conclude that [either New York has a significantly lower AQI than Ohio, or there is no significant difference between the two states' AQI depending on the p-value].

#### **Question 7: Will Michigan be affected by the new policy impacting states with a mean AQI of 10 or greater?**



A: Based on the results of the test, we conclude that [Michigan will/will not be affected by the policy, depending on the p-value outcome].

# Conclusion

**What are key takeaways from this project?**

A: Whether New York‚Äôs AQI is significantly lower than Ohio‚Äôs.
Whether Michigan‚Äôs AQI exceeds the threshold of 10 and thus, could be impacted by a new policy targeting states with higher AQI levels.
Whether Los Angeles County‚Äôs AQI significantly differs from the rest of California.
How statistical tools like the t-test can be used to validate or refute these assumptions.
The analysis also reinforced the importance of data-driven decision-making in environmental policy, especially when assessing air quality across different regions.

**What would you consider presenting to your manager as part of your findings?**

A: The comparison of New York and Ohio's AQI showed that there was no significant difference in their AQI values, meaning policy targeting these two states equally might be appropriate.
Michigan's AQI is significantly greater than 10, indicating that it will be impacted by the new policy targeting states with AQI levels of 10 or higher.
Los Angeles County‚Äôs AQI is statistically different from the rest of California, which could have implications for regional policy and environmental measures.
Statistical Methodology: Highlight the use of hypothesis testing (two-sample and one-sample t-tests) to compare AQI across regions.
Actionable Insights: Emphasize the importance of considering these AQI differences when making policy recommendations. For instance, Michigan should be prioritized for intervention due to its AQI levels.

**What would you convey to external readers?**

A: Environmental Impact: The project highlights the significant differences in air quality across regions, suggesting that states like Michigan may require more immediate attention to improve air quality and reduce pollution levels.
Policy Implications: The findings suggest that regions like Los Angeles County have distinct AQI profiles that may warrant separate environmental regulations from the rest of California.
Data-Driven Decision-Making: The project demonstrates the value of statistical analysis in guiding policy decisions. The conclusions drawn from hypothesis testing could serve as a foundation for future policy and regulatory measures aimed at improving air quality in various states.
