<a href="https://colab.research.google.com/github/Glorthur/Data_analysis_project/blob/main/Activity_Explore_hypothesis_testing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Activity: Explore hypothesis testing

## Introduction

You work for an environmental think tank called Repair Our Air (ROA). ROA is formulating policy recommendations to improve the air quality in America, using the Environmental Protection Agency's Air Quality Index (AQI) to guide their decision making. An AQI value close to 0 signals "little to no" public health concern, while higher values are associated with increased risk to public health.

They've tasked you with leveraging AQI data to help them prioritize their strategy for improving air quality in America.

ROA is considering the following decisions. For each, construct a hypothesis test and an accompanying visualization, using your results of that test to make a recommendation:

1. ROA is considering a metropolitan-focused approach. Within California, they want to know if the mean AQI in Los Angeles County is statistically different from the rest of California.
2. With limited resources, ROA has to choose between New York and Ohio for their next regional office. Does New York have a lower AQI than Ohio?
3. A new policy will affect those states with a mean AQI of 10 or greater. Will Michigan be affected by this new policy?

**Notes:**
1. For your analysis, you'll default to a 5% level of significance.
2. Throughout the lab, for two-sample t-tests, use Welch's t-test (i.e., setting the `equal_var` parameter to `False` in `scipy.stats.ttest_ind()`). This will account for the possibly unequal variances between the two groups in the comparison.

## Step 1: Imports

To proceed with your analysis, import `pandas` and `numpy`. To conduct your hypothesis testing, import `stats` from `scipy`.

#### Import Packages

In [2]:
# Import relevant packages
import pandas as pd
import numpy as np
from scipy import stats
### YOUR CODE HERE ###

You are also provided with a dataset with national Air Quality Index (AQI) measurements by state over time for this analysis. `Pandas` was used to import the file `c4_epa_air_quality.csv` as a dataframe named `aqi`. As shown in this cell, the dataset has been automatically loaded in for you. You do not need to download the .csv file, or provide more code, in order to access the dataset and proceed with this lab. Please continue with this activity by completing the following instructions.

**Note:** For purposes of your analysis, you can assume this data is randomly sampled from a larger population.

#### Load Dataset

In [3]:
# RUN THIS CELL TO IMPORT YOUR DATA.
%cd /content/drive/MyDrive/Coursera/Files_11
### YOUR CODE HERE ###
aqi = pd.read_csv('c4_epa_air_quality.csv')

/content/drive/MyDrive/Coursera/Files_11


In [4]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Step 2: Data Exploration

### Before proceeding to your deliverables, explore your datasets.

Use the following space to surface descriptive statistics about your data. In particular, explore whether you believe the research questions you were given are readily answerable with this data.

In [5]:
# Explore your dataframe `aqi` here:
aqi.head()

### YOUR CODE HERE ###

Unnamed: 0.1,Unnamed: 0,date_local,state_name,county_name,city_name,local_site_name,parameter_name,units_of_measure,arithmetic_mean,aqi
0,0,2018-01-01,Arizona,Maricopa,Buckeye,BUCKEYE,Carbon monoxide,Parts per million,0.473684,7
1,1,2018-01-01,Ohio,Belmont,Shadyside,Shadyside,Carbon monoxide,Parts per million,0.263158,5
2,2,2018-01-01,Wyoming,Teton,Not in a city,Yellowstone National Park - Old Faithful Snow ...,Carbon monoxide,Parts per million,0.111111,2
3,3,2018-01-01,Pennsylvania,Philadelphia,Philadelphia,North East Waste (NEW),Carbon monoxide,Parts per million,0.3,3
4,4,2018-01-01,Iowa,Polk,Des Moines,CARPENTER,Carbon monoxide,Parts per million,0.215789,3


In [None]:
aqi.describe()


Unnamed: 0.1,Unnamed: 0,arithmetic_mean,aqi
count,260.0,260.0,260.0
mean,129.5,0.403169,6.757692
std,75.199734,0.317902,7.061707
min,0.0,0.0,0.0
25%,64.75,0.2,2.0
50%,129.5,0.276315,5.0
75%,194.25,0.516009,9.0
max,259.0,1.921053,50.0


In [None]:
aqi.value_counts()


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0,Unnamed: 7_level_0,Unnamed: 8_level_0,Unnamed: 9_level_0,count
Unnamed: 0,date_local,state_name,county_name,city_name,local_site_name,parameter_name,units_of_measure,arithmetic_mean,aqi,Unnamed: 10_level_1
0,2018-01-01,Arizona,Maricopa,Buckeye,BUCKEYE,Carbon monoxide,Parts per million,0.473684,7,1
131,2018-01-01,Arizona,Maricopa,Phoenix,CENTRAL PHOENIX,Carbon monoxide,Parts per million,1.110526,27,1
165,2018-01-01,Utah,Weber,Ogden,Ogden,Carbon monoxide,Parts per million,0.326316,7,1
166,2018-01-01,New Jersey,Hudson,Jersey City,Jersey City,Carbon monoxide,Parts per million,0.133333,3,1
167,2018-01-01,New York,New York,New York,CCNY,Carbon monoxide,Parts per million,0.200000,2,1
...,...,...,...,...,...,...,...,...,...,...
93,2018-01-01,Oklahoma,Oklahoma,Oklahoma City,Near Road,Carbon monoxide,Parts per million,0.284211,5,1
94,2018-01-01,Florida,Pinellas,Saint Petersburg,Sawgrass Lake Park (Near-Road),Carbon monoxide,Parts per million,0.315789,9,1
95,2018-01-01,Colorado,Mesa,Grand Junction,GRAND JUNCTION - PITKIN,Carbon monoxide,Parts per million,0.305263,6,1
96,2018-01-01,Minnesota,Dakota,Inver Grove Heights (RR name Inver Grove),Flint Hills Refinery 423,Carbon monoxide,Parts per million,0.200000,2,1


In [None]:
aqi['aqi'].value_counts()
aqi.shape

(260, 10)

In [None]:
# more descriptive statistics
aqi.mode().iloc[0]

Unnamed: 0,0
Unnamed: 0,0
date_local,2018-01-01
state_name,California
county_name,Los Angeles
city_name,Not in a city
local_site_name,Kapolei
parameter_name,Carbon monoxide
units_of_measure,Parts per million
arithmetic_mean,0.2
aqi,2.0


In [None]:
range = aqi['aqi'].max() - aqi['aqi'].min()
print(range)

50


In [None]:
iqr = aqi['aqi'].quantile(0.75) - aqi['aqi'].quantile(0.25)
print(iqr)

7.0


In [None]:
skewness = aqi['aqi'].skew()
skewness

2.686105789368923

In [None]:
kurtosis = kurtosis = aqi['aqi'].kurtosis()
kurtosis

9.755537864959088

<details>
  <summary><h4><strong>HINT 1</strong></h4></summary>

  Consider referring to the material on descriptive statisics.
</details>

<details>
  <summary><h4><strong>HINT 2</strong></h4></summary>

  Consider using `pandas` or `numpy` to explore the `aqi` dataframe.
</details>

<details>
  <summary><h4><strong>HINT 3</strong></h4></summary>

Any of the following functions may be useful:
- `pandas`: `describe()`,`value_counts()`,`shape()`, `head()`
- `numpy`: `unique()`,`mean()`
    
</details>

#### **Question 1: From the preceding data exploration, what do you recognize?**

Median (5.0): The AQI value at the midpoint of the dataset is 5. This indicates that half of the AQI values are below 5 and half are above.
Mode (2): The most frequently occurring AQI value is 2, suggesting that this value appears more often than any other in the dataset.
Range (50): The difference between the maximum (50) and minimum (0) AQI values is 50, indicating a wide spread of AQI values in the dataset.
Interquartile Range (IQR) (7): The IQR is 7, which means the middle 50% of the AQI values range from the 25th percentile (2) to the 75th percentile (9). This indicates moderate variability within the central part of the dataset.
Skewness (2.77): The positive skewness indicates that the AQI distribution is right-skewed, meaning there are more low AQI values and a long tail of higher AQI values.
Kurtosis (9.92): The high kurtosis value indicates a distribution with heavy tails and a sharp peak, suggesting the presence of outliers or extreme AQI values.

## Step 3. Statistical Tests

Before you proceed, recall the following steps for conducting hypothesis testing:

1. Formulate the null hypothesis and the alternative hypothesis.<br>
2. Set the significance level.<br>
3. Determine the appropriate test procedure.<br>
4. Compute the p-value.<br>
5. Draw your conclusion.

### Hypothesis 1: ROA is considering a metropolitan-focused approach. Within California, they want to know if the mean AQI in Los Angeles County is statistically different from the rest of California.

Before proceeding with your analysis, it will be helpful to subset the data for your comparison.

In [6]:
# Create dataframes for each sample being compared in your test
la_county = aqi[aqi['county_name'] == 'Los Angeles']
print(la_county)
### YOUR CODE HERE ###

     Unnamed: 0  date_local  state_name  county_name         city_name  \
33           33  2018-01-01  California  Los Angeles         Lancaster   
42           42  2018-01-01  California  Los Angeles     Santa Clarita   
61           61  2018-01-01  California  Los Angeles          Pasadena   
76           76  2018-01-01  California  Los Angeles       Los Angeles   
109         109  2018-01-01  California  Los Angeles       Los Angeles   
110         110  2018-01-01  California  Los Angeles       Los Angeles   
119         119  2018-01-01  California  Los Angeles            Reseda   
132         132  2018-01-01  California  Los Angeles           Compton   
163         163  2018-01-01  California  Los Angeles             Azusa   
172         172  2018-01-01  California  Los Angeles       Pico Rivera   
177         177  2018-01-01  California  Los Angeles        Long Beach   
189         189  2018-01-01  California  Los Angeles            Pomona   
233         233  2018-01-01  Californi

In [7]:
rest_of_ca = aqi[(aqi['state_name'] == 'California') & (aqi['county_name'] != 'Los Angeles')]

print(rest_of_ca)

     Unnamed: 0  date_local  state_name     county_name  \
16           16  2018-01-01  California  San Bernardino   
18           18  2018-01-01  California      Sacramento   
26           26  2018-01-01  California          Orange   
27           27  2018-01-01  California         Alameda   
34           34  2018-01-01  California          Fresno   
40           40  2018-01-01  California       San Mateo   
43           43  2018-01-01  California    Contra Costa   
45           45  2018-01-01  California           Butte   
46           46  2018-01-01  California       Riverside   
58           58  2018-01-01  California            Kern   
62           62  2018-01-01  California         Alameda   
63           63  2018-01-01  California      Sacramento   
75           75  2018-01-01  California     San Joaquin   
77           77  2018-01-01  California       Riverside   
81           81  2018-01-01  California   Santa Barbara   
86           86  2018-01-01  California     Santa Clara 

<details>
  <summary><h4><strong>HINT 1</strong></h4></summary>

  Consider referencing the material on subsetting dataframes.  
</details>

<details>
  <summary><h4><strong>HINT 2</strong></h4></summary>

  Consider creating two dataframes, one for Los Angeles, and one for all other California observations.
</details>

<details>
  <summary><h4><strong>HINT 3</strong></h4></summary>

For your first dataframe, filter to `county_name` of `Los Angeles`. For your second dataframe, filter to `state_name` of `Calfornia` and `county_name` not equal to `Los Angeles`.
    
</details>

#### Formulate your hypothesis:

**Formulate your null and alternative hypotheses:**

*   $H_0$: There is no difference in the mean AQI between Los Angeles County and the rest of California.
*   $H_A$: There is a difference in the mean AQI between Los Angeles County and the rest of California.


#### Set the significance level:

In [None]:
# For this analysis, the significance level is 5%
level_of_significance = 0.05
### YOUR CODE HERE

#### Determine the appropriate test procedure:

Here, you are comparing the sample means between two independent samples. Therefore, you will utilize a **two-sample  𝑡-test**.

#### Compute the P-value

In [None]:
# Compute your p-value here
### YOUR CODE HERE ###

In [9]:
t_test = stats.ttest_ind(a = la_county['aqi'], b= rest_of_ca['aqi'] , equal_var=False)
t_test

TtestResult(statistic=2.1107010796372014, pvalue=0.049839056842410995, df=17.08246830361151)

<details>
  <summary><h4><strong>HINT 1</strong></h4></summary>

  Consider referencing the material on how to perform a two-sample t-test.
</details>

<details>
  <summary><h4><strong>HINT 2</strong></h4></summary>

  In `ttest_ind()`, a is the aqi column from our "Los Angeles" dataframe, and b is the aqi column from the "Other California" dataframe.
</details>

<details>
  <summary><h4><strong>HINT 3</strong></h4></summary>

  Be sure to set `equal_var` = False.

</details>

#### **Question 2. What is your P-value for hypothesis 1, and what does this indicate for your null hypothesis?**

pvalue is 0.04983 therefore we reject the null hypothesis in favour of the alternate hypothesis.

### Hypothesis 2: With limited resources, ROA has to choose between New York and Ohio for their next regional office. Does New York have a lower AQI than Ohio?

Before proceeding with your analysis, it will be helpful to subset the data for your comparison.

In [15]:
# Create dataframes for each sample being compared in your test
New_york_df = aqi[aqi['state_name'] == 'New York']
New_york_df.head()

Ohio_df = aqi[aqi['state_name'] == 'Ohio']
Ohio_df.head()
### YOUR CODE HERE ###

Unnamed: 0.1,Unnamed: 0,date_local,state_name,county_name,city_name,local_site_name,parameter_name,units_of_measure,arithmetic_mean,aqi
1,1,2018-01-01,Ohio,Belmont,Shadyside,Shadyside,Carbon monoxide,Parts per million,0.263158,5
12,12,2018-01-01,Ohio,Hamilton,Cincinnati,Taft NCore,Carbon monoxide,Parts per million,0.252632,3
22,22,2018-01-01,Ohio,Stark,Canton,Canton,Carbon monoxide,Parts per million,0.394737,6
51,51,2018-01-01,Ohio,Summit,Akron,NIHF STEM MS,Carbon monoxide,Parts per million,0.083333,3
59,59,2018-01-01,Ohio,Cuyahoga,Cleveland,GT Craig NCore,Carbon monoxide,Parts per million,0.25,3


<details>
  <summary><h4><strong>HINT 1</strong></h4></summary>

  Consider referencing the materials on subsetting dataframes.  
</details>

<details>
  <summary><h4><strong>HINT 2</strong></h4></summary>

  Consider creating two dataframes, one for New York, and one for Ohio observations.
</details>

<details>
  <summary><h4><strong>HINT 3</strong></h4></summary>

For your first dataframe, filter to `state_name` of `New York`. For your second dataframe, filter to `state_name` of `Ohio`.
    
</details>

#### Formulate your hypothesis:

**Formulate your null and alternative hypotheses:**

*   $H_0$: The mean AQI of New York is greater than or equal to that of Ohio.
*   $H_A$: The mean AQI of New York is **below** that of Ohio.


#### Significance Level (remains at 5%)

#### Determine the appropriate test procedure:

Here, you are comparing the sample means between two independent samples in one direction. Therefore, you will utilize a **two-sample  𝑡-test**.

#### Compute the P-value

In [20]:
# Compute your p-value here
tstat, pvalue = stats.ttest_ind(a = New_york_df['aqi'], b = Ohio_df['aqi'], alternative= 'less' , equal_var = False)
tstat, pvalue
### YOUR CODE HERE ###

(-2.025951038880333, 0.03044650269193468)

<details>
  <summary><h4><strong>HINT 1</strong></h4></summary>

  Consider referencing the material on how to perform a two-sample t-test.
</details>

<details>
  <summary><h4><strong>HINT 2</strong></h4></summary>

  In `ttest_ind()`, a is the aqi column from the "New York" dataframe, an b is the aqi column from the "Ohio" dataframe.
</details>

<details>
  <summary><h4><strong>HINT 3</strong></h4></summary>

  You can assign `tstat`, `pvalue` to the output of `ttest_ind`. Be sure to include `alternative = less` as part of your code.  

</details>

#### **Question 3. What is your P-value for hypothesis 2, and what does this indicate for your null hypothesis?**

pvalue = 0.03
therefore we reject the null hypothesis.
The mean AQI of New York is below that of Ohio.

###  Hypothesis 3: A new policy will affect those states with a mean AQI of 10 or greater. Will Michigan be affected by this new policy?

Before proceeding with your analysis, it will be helpful to subset the data for your comparison.

In [22]:
# Create dataframes for each sample being compared in your test
michigan_df = aqi[aqi['state_name'] == 'Michigan']
michigan_df.head()
### YOUR CODE HERE ###

Unnamed: 0.1,Unnamed: 0,date_local,state_name,county_name,city_name,local_site_name,parameter_name,units_of_measure,arithmetic_mean,aqi
65,65,2018-01-01,Michigan,Wayne,Livonia,LIVONIA-NR,Carbon monoxide,Parts per million,0.338889,5
122,122,2018-01-01,Michigan,Wayne,Detroit,West corner,Carbon monoxide,Parts per million,0.394737,8
123,123,2018-01-01,Michigan,Wayne,Detroit,MARK TWAIN MIDDLE SCHOOL,Carbon monoxide,Parts per million,0.515789,9
129,129,2018-01-01,Michigan,Wayne,Detroit,ELIZA-NR,Carbon monoxide,Parts per million,0.616667,11
192,192,2018-01-01,Michigan,Wayne,Allen Park,Allen Park,Carbon monoxide,Parts per million,0.811111,13


<details>
  <summary><h4><strong>HINT 1</strong></h4></summary>

  Consider referencing the material on subsetting dataframes.  
</details>

<details>
  <summary><h4><strong>HINT 2</strong></h4></summary>

  Consider creating one dataframe which only includes Michigan.
</details>

#### Formulate your hypothesis:

**Formulate your null and alternative hypotheses here:**

*   $H_0$: The mean AQI of Michigan is less than or equal to 10.
*   $H_A$: The mean AQI of Michigan is greater than 10.


#### Significance Level (remains at 5%)

#### Determine the appropriate test procedure:

Here, you are comparing one sample mean relative to a particular value in one direction. Therefore, you will utilize a **one-sample  𝑡-test**.

#### Compute the P-value

In [26]:
# Compute your p-value here
tstat, pvalue = stats.ttest_1samp(a = michigan_df['aqi'], popmean = 10, alternative = 'greater')
tstat, pvalue
### YOUR CODE HERE ###

(-1.7395913343286131, 0.9399405193140109)

<details>
  <summary><h4><strong>HINT 1</strong></h4></summary>

  Consider referencing the material on how to perform a one-sample t-test.
</details>

<details>
  <summary><h4><strong>HINT 2</strong></h4></summary>

  In `ttest_1samp)`, you are comparing the aqi column from your Michigan data relative to 10, the new policy threshold.
</details>

<details>
  <summary><h4><strong>HINT 3</strong></h4></summary>

  You can assign `tstat`, `pvalue` to the output of `ttest_1samp`. Be sure to include `alternative = greater` as part of your code.  

</details>

#### **Question 4. What is your P-value for hypothesis 3, and what does this indicate for your null hypothesis?**

pvalue = 0.93
we fail to reject the null hypothesis.

## Step 4. Results and Evaluation

Now that you've completed your statistical tests, you can consider your hypotheses and the results you gathered.

#### **Question 5. Did your results show that the AQI in Los Angeles County was statistically different from the rest of California?**

 There is a difference in the mean AQI between Los Angeles County and the rest of California.

 we reject the null hypothesis.

#### **Question 6. Did New York or Ohio have a lower AQI?**

The mean AQI of New York is below that of Ohio.

we reject the null hypothesis.

#### **Question 7: Will Michigan be affected by the new policy impacting states with a mean AQI of 10 or greater?**



The mean AQI of Michigan is less than or equal to 10.

we fail to reject the null hypothesis.

# Conclusion

**What are key takeaways from this lab?**

**What would you consider presenting to your manager as part of your findings?**

**What would you convey to external stakeholders?**


**Congratulations!** You've completed this lab. However, you may not notice a green check mark next to this item on Coursera's platform. Please continue your progress regardless of the check mark. Just click on the "save" icon at the top of this notebook to ensure your work has been logged.