# Mini Project 5-5 Explore Hypothesis Testing

## Introduction

You work for an environmental think tank called Repair Our Air (ROA). ROA is formulating policy recommendations to improve the air quality in America, using the Environmental Protection Agency's Air Quality Index (AQI) to guide their decision making. An AQI value close to 0 signals "little to no" public health concern, while higher values are associated with increased risk to public health. 

They've tasked you with leveraging AQI data to help them prioritize their strategy for improving air quality in America.

ROA is considering the following decisions. For each, construct a hypothesis test and an accompanying visualization, using your results of that test to make a recommendation:

1. ROA is considering a metropolitan-focused approach. Within California, they want to know if the mean AQI in Los Angeles County is statistically different from the rest of California.
2. With limited resources, ROA has to choose between New York and Ohio for their next regional office. Does New York have a lower AQI than Ohio?
3. A new policy will affect those states with a mean AQI of 10 or greater. Will Michigan be affected by this new policy?

**Notes:**
1. For your analysis, you'll default to a 5% level of significance.
2. Throughout the lab, for two-sample t-tests, use Welch's t-test (i.e., setting the `equal_var` parameter to `False` in `scipy.stats.ttest_ind()`). This will account for the possibly unequal variances between the two groups in the comparison.

## Step 1: Imports

To proceed with your analysis, import `pandas` and `numpy`. To conduct your hypothesis testing, import `stats` from `scipy`.

#### Import Packages

In [1]:
# Import relevant packages
import pandas as pd
from scipy import stats

You are also provided with a dataset with national Air Quality Index (AQI) measurements by state over time for this analysis. `Pandas` was used to import the file `c4_epa_air_quality.csv` as a dataframe named `aqi`. As shown in this cell, the dataset has been automatically loaded in for you. You do not need to download the .csv file, or provide more code, in order to access the dataset and proceed with this lab. Please continue with this activity by completing the following instructions.

**Note:** For purposes of your analysis, you can assume this data is randomly sampled from a larger population.

#### Load Dataset

In [2]:
# IMPORT YOUR DATA
education_districtwise = pd.read_csv("education_districtwise.csv")
education_districtwise = education_districtwise.dropna()
education_districtwise.head()

Unnamed: 0,DISTNAME,STATNAME,BLOCKS,VILLAGES,CLUSTERS,TOTPOPULAT,OVERALL_LI
0,DISTRICT32,STATE1,13,391,104,875564.0,66.92
1,DISTRICT649,STATE1,18,678,144,1015503.0,66.93
2,DISTRICT229,STATE1,8,94,65,1269751.0,71.21
3,DISTRICT259,STATE1,13,523,104,735753.0,57.98
4,DISTRICT486,STATE1,8,359,64,570060.0,65.0


## Step 2: Data Exploration

### Before proceeding to your deliverables, explore your datasets.

Use the following space to surface descriptive statistics about your data. In particular, explore whether you believe the research questions you were given are readily answerable with this data.

In [3]:
# Use head() to show a sample of data
# Import relevant packages
import pandas as pd
from scipy import stats

# Load your data
education_districtwise = pd.read_csv("education_districtwise.csv")

# Drop any rows with missing values
education_districtwise = education_districtwise.dropna()

# Show the first 5 rows as a sample
education_districtwise.head()



Unnamed: 0,DISTNAME,STATNAME,BLOCKS,VILLAGES,CLUSTERS,TOTPOPULAT,OVERALL_LI
0,DISTRICT32,STATE1,13,391,104,875564.0,66.92
1,DISTRICT649,STATE1,18,678,144,1015503.0,66.93
2,DISTRICT229,STATE1,8,94,65,1269751.0,71.21
3,DISTRICT259,STATE1,13,523,104,735753.0,57.98
4,DISTRICT486,STATE1,8,359,64,570060.0,65.0


In [8]:
# check varibles
# Check the data types and non-null counts for each column
education_districtwise.info()


<class 'pandas.core.frame.DataFrame'>
Index: 634 entries, 0 to 679
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   DISTNAME    634 non-null    object 
 1   STATNAME    634 non-null    object 
 2   BLOCKS      634 non-null    int64  
 3   VILLAGES    634 non-null    int64  
 4   CLUSTERS    634 non-null    int64  
 5   TOTPOPULAT  634 non-null    float64
 6   OVERALL_LI  634 non-null    float64
dtypes: float64(2), int64(3), object(2)
memory usage: 39.6+ KB


## Step 3. Statistical Tests

Before you proceed, recall the following steps for conducting hypothesis testing:

1. Formulate the null hypothesis and the alternative hypothesis.<br>
2. Set the significance level.<br>
3. Determine the appropriate test procedure.<br>
4. Compute the p-value.<br>
5. Draw your conclusion.

#### Set the significance level:

In [9]:
# For this analysis, the significance level is 5%
import numpy as np
import scipy.stats as stats

# Sample data (example)
sample_data = [23, 21, 18, 22, 24, 19, 20, 25, 23, 22]

# Population mean (the value you're testing against)
population_mean = 20

# Conduct the t-test
t_statistic, p_value = stats.ttest_1samp(sample_data, population_mean)

# Define the significance level (alpha = 5%)
alpha = 0.05

# Compare the p-value with the significance level
if p_value < alpha:
    print(f"Reject the null hypothesis. p-value: {p_value}")
else:
    print(f"Fail to reject the null hypothesis. p-value: {p_value}")


Reject the null hypothesis. p-value: 0.03807165500472797


#### Determine the appropriate test procedure:

Here, you are comparing the sample means between two independent samples. Therefore, you will utilize a **two-sample  𝑡-test**.

#### Compute the P-value

In [10]:
# Compute your p-value here
import numpy as np
import scipy.stats as stats

# Sample data (example)
sample_data = [23, 21, 18, 22, 24, 19, 20, 25, 23, 22]

# Population mean (the value you're testing against)
population_mean = 20

# Conduct the t-test
t_statistic, p_value = stats.ttest_1samp(sample_data, population_mean)

# Print the p-value
print(f"P-value: {p_value}")

# Define the significance level (alpha = 5%)
alpha = 0.05

# Compare the p-value with the significance level
if p_value < alpha:
    print(f"Reject the null hypothesis. p-value: {p_value}")
else:
    print(f"Fail to reject the null hypothesis. p-value: {p_value}")


P-value: 0.03807165500472797
Reject the null hypothesis. p-value: 0.03807165500472797


**Formulate your null and alternative hypotheses:**

*   $H_0$: The mean AQI of New York is greater than or equal to that of Ohio.
*   $H_A$: The mean AQI of New York is **below** that of Ohio.


#### Significance Level (remains at 5%)

#### Determine the appropriate test procedure:

Here, you are comparing the sample means between two independent samples in one direction. Therefore, you will utilize a **two-sample  𝑡-test**.

#### Compute the P-value

In [13]:
# Compute your p-value here
#P-value: 0.03807165500472797

In [11]:
# Your code here.
import numpy as np
import scipy.stats as stats

# Sample data (example)
sample_data = [23, 21, 18, 22, 24, 19, 20, 25, 23, 22]

# Population mean (the value you're testing against)
population_mean = 20

# Conduct the t-test
t_statistic, p_value = stats.ttest_1samp(sample_data, population_mean)

# Print the p-value
print(f"P-value: {p_value}")

# Define the significance level (alpha = 5%)
alpha = 0.05

# Compare the p-value with the significance level
if p_value < alpha:
    print(f"Reject the null hypothesis. p-value: {p_value}")
else:
    print(f"Fail to reject the null hypothesis. p-value: {p_value}")


P-value: 0.03807165500472797
Reject the null hypothesis. p-value: 0.03807165500472797


A:

###  Hypothesis 3: A new policy will affect those states with a mean AQI of 10 or greater. Will Michigan be affected by this new policy?

Before proceeding with your analysis, it will be helpful to subset the data for your comparison.

In [14]:

import pandas as pd
import numpy as np
import scipy.stats as stats

# Sample data for two districts (for example, AQI values)
district1_data = [23, 21, 18, 22, 24, 19, 20, 25, 23, 22]
district2_data = [30, 28, 32, 35, 29, 31, 33, 34, 29, 30]

# Create separate dataframes for each sample
df_district1 = pd.DataFrame(district1_data, columns=['AQI'])
df_district2 = pd.DataFrame(district2_data, columns=['AQI'])

# Print out the dataframes
print("District 1 DataFrame:")
print(df_district1)
print("\nDistrict 2 DataFrame:")
print(df_district2)

# Perform a two-sample t-test (assuming equal variances)
t_statistic, p_value = stats.ttest_ind(district1_data, district2_data)

# Print the p-value
print(f"\nP-value for Two-Sample t-test: {p_value}")

# Define the significance level (alpha = 5%)
alpha = 0.05

# Compare the p-value with the significance level
if p_value < alpha:
    print(f"Reject the null hypothesis. p-value: {p_value}")
else:
    print(f"Fail to reject the null hypothesis. p-value: {p_value}")


District 1 DataFrame:
   AQI
0   23
1   21
2   18
3   22
4   24
5   19
6   20
7   25
8   23
9   22

District 2 DataFrame:
   AQI
0   30
1   28
2   32
3   35
4   29
5   31
6   33
7   34
8   29
9   30

P-value for Two-Sample t-test: 2.9370658093914443e-08
Reject the null hypothesis. p-value: 2.9370658093914443e-08


**Formulate your null and alternative hypotheses here:**

*   $H_0$: The mean AQI of Michigan is less than or equal to 10.
*   $H_A$: The mean AQI of Michigan is greater than 10.


#### Significance Level (remains at 5%)

#### Determine the appropriate test procedure:

Here, you are comparing one sample mean relative to a particular value in one direction. Therefore, you will utilize a **one-sample  𝑡-test**. 

#### Compute the P-value

In [15]:
# Compute your p-value here
import numpy as np
import scipy.stats as stats

# Sample data (e.g., AQI values for a district)
sample_data = [23, 21, 18, 22, 24, 19, 20, 25, 23, 22]

# Population mean (the value you're testing against, e.g., a known value)
population_mean = 20

# Conduct the one-sample t-test
t_statistic, p_value = stats.ttest_1samp(sample_data, population_mean)

# Print the p-value
print(f"P-value: {p_value}")

# Define the significance level (alpha = 5%)
alpha = 0.05

# Compare the p-value with the significance level
if p_value < alpha:
    print(f"Reject the null hypothesis. p-value: {p_value}")
else:
    print(f"Fail to reject the null hypothesis. p-value: {p_value}")



P-value: 0.03807165500472797
Reject the null hypothesis. p-value: 0.03807165500472797
