# Activity: Explore hypothesis testing<a href="#Activity:-Explore-hypothesis-testing"
class="anchor-link">¶</a>

## Introduction<a href="#Introduction" class="anchor-link">¶</a>

You work for an environmental think tank called Repair Our Air (ROA).
ROA is formulating policy recommendations to improve the air quality in
America, using the Environmental Protection Agency's Air Quality Index
(AQI) to guide their decision making. An AQI value close to 0 signals
"little to no" public health concern, while higher values are associated
with increased risk to public health.

They've tasked you with leveraging AQI data to help them prioritize
their strategy for improving air quality in America.

ROA is considering the following decisions. For each, construct a
hypothesis test and an accompanying visualization, using your results of
that test to make a recommendation:

1.  ROA is considering a metropolitan-focused approach. Within
    California, they want to know if the mean AQI in Los Angeles County
    is statistically different from the rest of California.
2.  With limited resources, ROA has to choose between New York and Ohio
    for their next regional office. Does New York have a lower AQI than
    Ohio?
3.  A new policy will affect those states with a mean AQI of 10 or
    greater. Will Michigan be affected by this new policy?

**Notes:**

1.  For your analysis, you'll default to a 5% level of significance.
2.  Throughout the lab, for two-sample t-tests, use Welch's t-test
    (i.e., setting the `equal_var` parameter to `False` in
    `scipy.stats.ttest_ind()`). This will account for the possibly
    unequal variances between the two groups in the comparison.

## Step 1: Imports<a href="#Step-1:-Imports" class="anchor-link">¶</a>

To proceed with your analysis, import `pandas` and `numpy`. To conduct
your hypothesis testing, import `stats` from `scipy`.

#### Import Packages<a href="#Import-Packages" class="anchor-link">¶</a>

In \[1\]:

    # Import relevant packages
    import pandas as pd
    import numpy as np
    from scipy import stats

You are also provided with a dataset with national Air Quality Index
(AQI) measurements by state over time for this analysis. `Pandas` was
used to import the file `c4_epa_air_quality.csv` as a dataframe named
`aqi`. As shown in this cell, the dataset has been automatically loaded
in for you. You do not need to download the .csv file, or provide more
code, in order to access the dataset and proceed with this lab. Please
continue with this activity by completing the following instructions.

**Note:** For purposes of your analysis, you can assume this data is
randomly sampled from a larger population.

#### Load Dataset<a href="#Load-Dataset" class="anchor-link">¶</a>

In \[2\]:

    # RUN THIS CELL TO IMPORT YOUR DATA.
    aqi = pd.read_csv('c4_epa_air_quality.csv')

## Step 2: Data Exploration<a href="#Step-2:-Data-Exploration" class="anchor-link">¶</a>

### Before proceeding to your deliverables, explore your datasets.<a
href="#Before-proceeding-to-your-deliverables,-explore-your-datasets."
class="anchor-link">¶</a>

Use the following space to surface descriptive statistics about your
data. In particular, explore whether you believe the research questions
you were given are readily answerable with this data.

In \[4\]:

    # Explore your dataframe `aqi` here:
    aqi.head(5)

Out\[4\]:

|     | Unnamed: 0 | date_local | state_name   | county_name  | city_name     | local_site_name                                   | parameter_name  | units_of_measure  | arithmetic_mean | aqi |
|-----|------------|------------|--------------|--------------|---------------|---------------------------------------------------|-----------------|-------------------|-----------------|-----|
| 0   | 0          | 2018-01-01 | Arizona      | Maricopa     | Buckeye       | BUCKEYE                                           | Carbon monoxide | Parts per million | 0.473684        | 7   |
| 1   | 1          | 2018-01-01 | Ohio         | Belmont      | Shadyside     | Shadyside                                         | Carbon monoxide | Parts per million | 0.263158        | 5   |
| 2   | 2          | 2018-01-01 | Wyoming      | Teton        | Not in a city | Yellowstone National Park - Old Faithful Snow ... | Carbon monoxide | Parts per million | 0.111111        | 2   |
| 3   | 3          | 2018-01-01 | Pennsylvania | Philadelphia | Philadelphia  | North East Waste (NEW)                            | Carbon monoxide | Parts per million | 0.300000        | 3   |
| 4   | 4          | 2018-01-01 | Iowa         | Polk         | Des Moines    | CARPENTER                                         | Carbon monoxide | Parts per million | 0.215789        | 3   |

In \[6\]:

    aqi.describe(include='all')

Out\[6\]:

|        | Unnamed: 0 | date_local | state_name | county_name | city_name     | local_site_name | parameter_name  | units_of_measure  | arithmetic_mean | aqi        |
|--------|------------|------------|------------|-------------|---------------|-----------------|-----------------|-------------------|-----------------|------------|
| count  | 260.000000 | 260        | 260        | 260         | 260           | 257             | 260             | 260               | 260.000000      | 260.000000 |
| unique | NaN        | 1          | 52         | 149         | 190           | 253             | 1               | 1                 | NaN             | NaN        |
| top    | NaN        | 2018-01-01 | California | Los Angeles | Not in a city | Kapolei         | Carbon monoxide | Parts per million | NaN             | NaN        |
| freq   | NaN        | 260        | 66         | 14          | 21            | 2               | 260             | 260               | NaN             | NaN        |
| mean   | 129.500000 | NaN        | NaN        | NaN         | NaN           | NaN             | NaN             | NaN               | 0.403169        | 6.757692   |
| std    | 75.199734  | NaN        | NaN        | NaN         | NaN           | NaN             | NaN             | NaN               | 0.317902        | 7.061707   |
| min    | 0.000000   | NaN        | NaN        | NaN         | NaN           | NaN             | NaN             | NaN               | 0.000000        | 0.000000   |
| 25%    | 64.750000  | NaN        | NaN        | NaN         | NaN           | NaN             | NaN             | NaN               | 0.200000        | 2.000000   |
| 50%    | 129.500000 | NaN        | NaN        | NaN         | NaN           | NaN             | NaN             | NaN               | 0.276315        | 5.000000   |
| 75%    | 194.250000 | NaN        | NaN        | NaN         | NaN           | NaN             | NaN             | NaN               | 0.516009        | 9.000000   |
| max    | 259.000000 | NaN        | NaN        | NaN         | NaN           | NaN             | NaN             | NaN               | 1.921053        | 50.000000  |

In \[8\]:

    aqi['state_name'].value_counts()

Out\[8\]:

    California              66
    Arizona                 14
    Ohio                    12
    Florida                 12
    Texas                   10
    New York                10
    Pennsylvania            10
    Michigan                 9
    Colorado                 9
    Minnesota                7
    New Jersey               6
    Indiana                  5
    North Carolina           4
    Massachusetts            4
    Maryland                 4
    Oklahoma                 4
    Virginia                 4
    Nevada                   4
    Connecticut              4
    Kentucky                 3
    Missouri                 3
    Wyoming                  3
    Iowa                     3
    Hawaii                   3
    Utah                     3
    Vermont                  3
    Illinois                 3
    New Hampshire            2
    District Of Columbia     2
    New Mexico               2
    Montana                  2
    Oregon                   2
    Alaska                   2
    Georgia                  2
    Washington               2
    Idaho                    2
    Nebraska                 2
    Rhode Island             2
    Tennessee                2
    Maine                    2
    South Carolina           1
    Puerto Rico              1
    Arkansas                 1
    Kansas                   1
    Mississippi              1
    Alabama                  1
    Louisiana                1
    Delaware                 1
    South Dakota             1
    West Virginia            1
    North Dakota             1
    Wisconsin                1
    Name: state_name, dtype: int64

#### **Question 1: From the preceding data exploration, what do you recognize?**<a
href="#Question-1:-From-the-preceding-data-exploration,-what-do-you-recognize?"
class="anchor-link">¶</a>

-   You have county-level data for the first hypothesis.
-   Ohio and New York both have a higher number of observations to work
    with in this dataset.

## Step 3. Statistical Tests<a href="#Step-3.-Statistical-Tests" class="anchor-link">¶</a>

Before you proceed, recall the following steps for conducting hypothesis
testing:

1.  Formulate the null hypothesis and the alternative hypothesis.  
2.  Set the significance level.  
3.  Determine the appropriate test procedure.  
4.  Compute the p-value.  
5.  Draw your conclusion.

### Hypothesis 1: ROA is considering a metropolitan-focused approach. Within California, they want to know if the mean AQI in Los Angeles County is statistically different from the rest of California.<a
href="#Hypothesis-1:-ROA-is-considering-a-metropolitan-focused-approach.-Within-California,-they-want-to-know-if-the-mean-AQI-in-Los-Angeles-County-is-statistically-different-from-the-rest-of-California."
class="anchor-link">¶</a>

Before proceeding with your analysis, it will be helpful to subset the
data for your comparison.

In \[9\]:

    # Create dataframes for each sample being compared in your test
    ca_la = aqi[aqi['county_name']== 'Los Angeles']
    ca_other = aqi[(aqi['state_name']=='California') & (aqi['county_name']!='Los Angeles')]

#### Formulate your hypothesis:<a href="#Formulate-your-hypothesis:" class="anchor-link">¶</a>

**Formulate your null and alternative hypotheses:**

-   \$H_0\$: There is no difference in the mean AQI between Los Angeles
    County and the rest of California.
-   \$H_A\$: There is a difference in the mean AQI between Los Angeles
    County and the rest of California.

#### Set the significance level:<a href="#Set-the-significance-level:" class="anchor-link">¶</a>

In \[10\]:

    # For this analysis, the significance level is 5%
    significance_level = 0.05
    significance_level

Out\[10\]:

    0.05

#### Determine the appropriate test procedure:<a href="#Determine-the-appropriate-test-procedure:"
class="anchor-link">¶</a>

Here, you are comparing the sample means between two independent
samples. Therefore, you will utilize a **two-sample 𝑡-test**.

#### Compute the P-value<a href="#Compute-the-P-value" class="anchor-link">¶</a>

In \[12\]:

    # Compute your p-value here
    stats.ttest_ind(a=ca_la['aqi'], b=ca_other['aqi'], equal_var=False)

Out\[12\]:

    Ttest_indResult(statistic=2.1107010796372014, pvalue=0.049839056842410995)

#### **Question 2. What is your P-value for hypothesis 1, and what does this indicate for your null hypothesis?**<a
href="#Question-2.-What-is-your-P-value-for-hypothesis-1,-and-what-does-this-indicate-for-your-null-hypothesis?"
class="anchor-link">¶</a>

With a p-value (0.49) being less than 0.05 (as your significance level
5%), reject the null hypotheis in favor of the alternative hypothesis.

Therefore, a metropolitan strategy may make sense in this case.

### Hypothesis 2: With limited resources, ROA has to choose between New York and Ohio for their next regional office. Does New York have a lower AQI than Ohio?<a
href="#Hypothesis-2:-With-limited-resources,-ROA-has-to-choose-between-New-York-and-Ohio-for-their-next-regional-office.-Does-New-York-have-a-lower-AQI-than-Ohio?"
class="anchor-link">¶</a>

Before proceeding with your analysis, it will be helpful to subset the
data for your comparison.

In \[13\]:

    # Create dataframes for each sample being compared in your test
    ny = aqi[aqi['state_name']=='New York']
    ohio = aqi[aqi['state_name']=='Ohio']

#### Formulate your hypothesis:<a href="#Formulate-your-hypothesis:" class="anchor-link">¶</a>

**Formulate your null and alternative hypotheses:**

-   \$H_0\$: The mean AQI of New York is greater than or equal to that
    of Ohio.
-   \$H_A\$: The mean AQI of New York is **below** that of Ohio.

#### Significance Level (remains at 5%)<a href="#Significance-Level-(remains-at-5%)" class="anchor-link">¶</a>

#### Determine the appropriate test procedure:<a href="#Determine-the-appropriate-test-procedure:"
class="anchor-link">¶</a>

Here, you are comparing the sample means between two independent samples
in one direction. Therefore, you will utilize a **two-sample 𝑡-test**.

#### Compute the P-value<a href="#Compute-the-P-value" class="anchor-link">¶</a>

In \[14\]:

    # Compute your p-value here
    tstat, pvalue = stats.ttest_ind(a=ny['aqi'], b=ohio['aqi'], equal_var=False)
    print(tstat)
    print(pvalue)

    -2.025951038880333
    0.060893005383869395

#### **Question 3. What is your P-value for hypothesis 2, and what does this indicate for your null hypothesis?**<a
href="#Question-3.-What-is-your-P-value-for-hypothesis-2,-and-what-does-this-indicate-for-your-null-hypothesis?"
class="anchor-link">¶</a>

With a p-value (0.030) of less than 0.05 (as your significance level is
5%) and a t-statistic \< 0 (-2.036), **reject the null hypothesis in
favor of the alternative hypothesis**.

Therefore, you can conclude at the 5% significance level that New York
has a lower mean AQI than Ohio.

### Hypothesis 3: A new policy will affect those states with a mean AQI of 10 or greater. Will Michigan be affected by this new policy?<a
href="#Hypothesis-3:-A-new-policy-will-affect-those-states-with-a-mean-AQI-of-10-or-greater.-Will-Michigan-be-affected-by-this-new-policy?"
class="anchor-link">¶</a>

Before proceeding with your analysis, it will be helpful to subset the
data for your comparison.

In \[17\]:

    # Create dataframes for each sample being compared in your test
    michigan = aqi[aqi['state_name']=='Michigan']

#### Formulate your hypothesis:<a href="#Formulate-your-hypothesis:" class="anchor-link">¶</a>

**Formulate your null and alternative hypotheses here:**

-   \$H_0\$: The mean AQI of Michigan is less than or equal to 10.
-   \$H_A\$: The mean AQI of Michigan is greater than 10.

#### Significance Level (remains at 5%)<a href="#Significance-Level-(remains-at-5%)" class="anchor-link">¶</a>

#### Determine the appropriate test procedure:<a href="#Determine-the-appropriate-test-procedure:"
class="anchor-link">¶</a>

Here, you are comparing one sample mean relative to a particular value
in one direction. Therefore, you will utilize a **one-sample 𝑡-test**.

#### Compute the P-value<a href="#Compute-the-P-value" class="anchor-link">¶</a>

In \[18\]:

    # Compute your p-value here
    tstat, pvalue = stats.ttest_1samp(michigan['aqi'], 10, alternative='greater')
    print(tstat)
    print(pvalue)

    -1.7395913343286131
    0.9399405193140109

#### **Question 4. What is your P-value for hypothesis 3, and what does this indicate for your null hypothesis?**<a
href="#Question-4.-What-is-your-P-value-for-hypothesis-3,-and-what-does-this-indicate-for-your-null-hypothesis?"
class="anchor-link">¶</a>

With a p-value (0.940) being greater than 0.05 (as you significance
level is 5%) and a t-statistic \< 0(-1.74), **fail to reject the null
hypothesis**.

Therefore, you cannot conclude at the 5% significance level that
Michigan's mean AQI is greater than 10. This implies that Michigan would
most likely not be affected by the new policy.

## Step 4. Results and Evaluation<a href="#Step-4.-Results-and-Evaluation" class="anchor-link">¶</a>

Now that you've completed your statistical tests, you can consider your
hypotheses and the results you gathered.

#### **Question 5. Did your results show that the AQI in Los Angeles County was statistically different from the rest of California?**<a
href="#Question-5.-Did-your-results-show-that-the-AQI-in-Los-Angeles-County-was-statistically-different-from-the-rest-of-California?"
class="anchor-link">¶</a>

Yes, the results indicated that the AQI in Los Angeles County was in
fact different from the rest of California.

#### **Question 6. Did New York or Ohio have a lower AQI?**<a href="#Question-6.-Did-New-York-or-Ohio-have-a-lower-AQI?"
class="anchor-link">¶</a>

Using a 5% significance level, you can conclude that New York has a
lower AQI than Ohio based on the results.

#### **Question 7: Will Michigan be affected by the new policy impacting states with a mean AQI of 10 or greater?**<a
href="#Question-7:-Will-Michigan-be-affected-by-the-new-policy-impacting-states-with-a-mean-AQI-of-10-or-greater?"
class="anchor-link">¶</a>

Based on the tests, you would fail to reject the null hypothesis,
meaning you can't conclude that the mean AQI greater than 10. Thus, it
is unlikely that Michigan would be affected by the new policy.

# Conclusion<a href="#Conclusion" class="anchor-link">¶</a>

**What are key takeaways from this lab?**

Even with small sample sizes, the variation within the data is enough to
allow you to make statistically significant conclusions. You identified
at the 5% significance level that the Los Angeles mean AQI was
stastitically different from the rest of California, and that New York
does have a lower mean AQI than Ohio. However, you were unable to
conclude at the 5% significance level that Michigan's mean AQI was
greater than 10.

**What would you consider presenting to your manager as part of your
findings?** For each test, you would present the null and alternative
hypothesis, then describe your conclusion and the resulting p-value that
drove that conclusion. As the setup of t-test's have a few key
configurations that dictate how you interpret the result, you would
specify the type of test you chose, whether that tail was one-tail or
two-tailed, and how you performed the t-test from stats.

**What would you convey to external stakeholders?**

In answer to the research questions posed, you would convey the level of
significance (5%) and your conclusion. Additionally, providing the
sample statistics being compared in each case will likely provide
important context for stakeholders to quickly understand the difference
between your results.

**Congratulations!** You've completed this lab. However, you may not
notice a green check mark next to this item on Coursera's platform.
Please continue your progress regardless of the check mark. Just click
on the "save" icon at the top of this notebook to ensure your work has
been logged.