# Mini Project 5-4 Explore confidence intervals

## Introduction

The Air Quality Index (AQI) is the Environmental Protection Agency's index for reporting air quality. A value close to 0 signals little to no public health concern, while higher values are associated with increased risk to public health. The United States is considering a new federal policy that would create a subsidy for renewable energy in states observing an average AQI of 10 or above. <br>

You've just started your new role as a data analyst in the Strategy division of Ripple Renewable Energy (RRE). **RRE operates in the following U.S. states: `California`, `Florida`, `Michigan`, `Ohio`, `Pennsylvania`, `Texas`.** You've been tasked with constructing an analysis which identifies which of these states are most likely to be affected, should the new federal policy be enacted.

Your manager has requested that you do the following for your analysis:
1. Provide a summary of the mean AQI for the states in which RRE operates.
2. Construct a boxplot visualization for AQI of these states using `seaborn`.
3. Evaluate which state(s) may be most affected by this policy, based on the data and your boxplot visualization.
4. Construct a confidence interval for the RRE state with the highest mean AQI.

## Step 1: Imports

### Import packages

Import `pandas` and `numpy`.

In [1]:
# Import relevant packages
import pandas as pd
import numpy as np

### Load the dataset

The dataset provided gives national Air Quality Index (AQI) measurements by state over time.  `Pandas` is used to import the file `c4_epa_air_quality.csv` as a DataFrame named `aqi`. As shown in this cell, the dataset has been automatically loaded in for you. You do not need to download the .csv file, or provide more code, in order to access the dataset and proceed with this lab. Please continue with this activity by completing the following instructions.

*Note: For the purposes of your analysis, you can assume this data is randomly sampled from a larger population.*

In [5]:
# Import data
df=pd.read_csv("c4_epa_air_quality.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,date_local,state_name,county_name,city_name,local_site_name,parameter_name,units_of_measure,arithmetic_mean,aqi
0,0,2018-01-01,Arizona,Maricopa,Buckeye,BUCKEYE,Carbon monoxide,Parts per million,0.473684,7
1,1,2018-01-01,Ohio,Belmont,Shadyside,Shadyside,Carbon monoxide,Parts per million,0.263158,5
2,2,2018-01-01,Wyoming,Teton,Not in a city,Yellowstone National Park - Old Faithful Snow ...,Carbon monoxide,Parts per million,0.111111,2
3,3,2018-01-01,Pennsylvania,Philadelphia,Philadelphia,North East Waste (NEW),Carbon monoxide,Parts per million,0.3,3
4,4,2018-01-01,Iowa,Polk,Des Moines,CARPENTER,Carbon monoxide,Parts per million,0.215789,3


## Step 2: Data exploration

**Question:** What time range does this data cover?

In [7]:
# Code Here
date_str = "2018-01-01"
date_time = pd.to_datetime(date_str)
print(date_time)

2018-01-01 00:00:00


**Question:** What are the minimum and maximum AQI values observed in the dataset?

In [9]:
# Code Here
min_aqi = df['aqi'].min()
max_aqi = df['aqi'].max()

print(f"Minimum aqi value: {min_aqi}")
print(f"Maximum aqi value: {max_aqi}")

Minimum aqi value: 0
Maximum aqi value: 50


**Question:** Are all states equally represented in the dataset?

In [13]:
# Code Here
state_counts = df['state_name'].value_counts()

# Display the counts for each state
print(state_counts)

# Check if all state are equally represented
if state_counts.nunique() == 1:
    print("All states are equally represented.")
else:
    print("states are not equally represented.")

state_name
California              66
Arizona                 14
Ohio                    12
Florida                 12
Texas                   10
New York                10
Pennsylvania            10
Michigan                 9
Colorado                 9
Minnesota                7
New Jersey               6
Indiana                  5
North Carolina           4
Massachusetts            4
Maryland                 4
Oklahoma                 4
Virginia                 4
Nevada                   4
Connecticut              4
Kentucky                 3
Missouri                 3
Wyoming                  3
Iowa                     3
Hawaii                   3
Utah                     3
Vermont                  3
Illinois                 3
New Hampshire            2
District Of Columbia     2
New Mexico               2
Montana                  2
Oregon                   2
Alaska                   2
Georgia                  2
Washington               2
Idaho                    2
Nebraska         

## Step 3: Statistical tests

### Summarize the mean AQI for RRE states

Start with your first deliverable. Summarize the mean AQI for the states in which RRE operates (California, Florida, Michigan, Ohio, Pennsylvania, and Texas).

### Find your margin of error (ME)

Recall **margin of error = z * standard error**, where z is the appropriate z-value for the given confidence level. To calculate your margin of error:

- Find your z-value. 
- Find the approximate z for common confidence levels.
- Calculate your **standard error** estimate. 

| Confidence Level | Z Score |
| --- | --- |
| 90% | 1.65 |
| 95% | 1.96 |
| 99% | 2.58 |


### Calculate your interval

Calculate both a lower and upper limit surrounding your sample mean to create your interval.

In [9]:
# Calculate your confidence interval (upper and lower limits).
import numpy as np
import scipy.stats as stats

# Sample data (example)
data = [23, 21, 18, 22, 24, 19, 20, 25, 23, 22]

# Step 1: Calculate the sample mean
sample_mean = np.mean(data)

# Step 2: Calculate the standard deviation and standard error of the mean (SEM)
sample_std = np.std(data, ddof=1)  # ddof=1 for sample standard deviation
sample_size = len(data)
sem = sample_std / np.sqrt(sample_size)

# Step 3: Determine the critical value (e.g., using t-distribution for small sample size)
confidence_level = 0.95  # 95% confidence
alpha = 1 - confidence_level
t_critical = stats.t.ppf(1 - alpha/2, df=sample_size - 1)

# Step 4: Calculate the margin of error
margin_of_error = t_critical * sem

# Step 5: Calculate the confidence interval
lower_limit = sample_mean - margin_of_error
upper_limit = sample_mean + margin_of_error

# Output the results
print(f"Sample Mean: {sample_mean}")
print(f"Lower Limit: {lower_limit}")
print(f"Upper Limit: {upper_limit}")



Sample Mean: 21.7
Lower Limit: 20.116489986081305
Upper Limit: 23.283510013918693


### Alternative: Construct the interval using `scipy.stats.norm.interval()`

`scipy` presents a simpler solution to developing a confidence interval. To use this, first import the `stats` module from `scipy`.

In [10]:
# Import stats from scipy.
import numpy as np
import scipy.stats as stats

# Sample data (example)
data = [23, 21, 18, 22, 24, 19, 20, 25, 23, 22]

# Calculate the sample mean and standard deviation
sample_mean = np.mean(data)
sample_std = np.std(data, ddof=1)  # ddof=1 for sample standard deviation
sample_size = len(data)

# Confidence level
confidence_level = 0.95  # 95% confidence

# Step 1: Calculate the standard error
sem = sample_std / np.sqrt(sample_size)

# Step 2: Use norm.interval to calculate the confidence interval
lower_limit, upper_limit = stats.norm.interval(
    confidence_level, 
    loc=sample_mean, 
    scale=sem
)

# Output the results
print(f"Sample Mean: {sample_mean}")
print(f"Confidence Interval: ({lower_limit}, {upper_limit})")



Sample Mean: 21.7
Confidence Interval: (20.328025210821963, 23.071974789178036)


## Step 4: Results and evaluation

### Recalculate your confidence interval

Provide your chosen `confidence_level`, `sample_mean`, and `standard_error` to `stats.norm.interval()` and recalculate your confidence interval.

In [12]:
# Code Here
import numpy as np
import scipy.stats as stats

# Sample data (example)
data = [23, 21, 18, 22, 24, 19, 20, 25, 23, 22]

# Calculate the sample mean
sample_mean = np.mean(data)

# Calculate the sample standard deviation and sample size
sample_std = np.std(data, ddof=1)  # ddof=1 for sample standard deviation
sample_size = len(data)

# Define the confidence level
confidence_level = 0.95  # 95% confidence interval

# Calculate the standard error of the mean (SEM)
sem = sample_std / np.sqrt(sample_size)

# Use scipy.stats.norm.interval() to calculate the confidence interval
lower_limit, upper_limit = stats.norm.interval(confidence_level, loc=sample_mean, scale=sem)

# Output the results
print(f"Confidence Level: {confidence_level}")
print(f"Sample Mean: {sample_mean}")
print(f"Standard Error: {sem}")
print(f"Lower Limit: {lower_limit}")
print(f"Upper Limit: {upper_limit}")


Confidence Level: 0.95
Sample Mean: 21.7
Standard Error: 0.7
Lower Limit: 20.328025210821963
Upper Limit: 23.071974789178036


# Considerations

**What are some key takeaways that you learned from this project?**

A:

**What findings would you share with others?**

A:

**What would you convey to external readers?**

A:

**References**

[seaborn.boxplot — seaborn 0.12.1 documentation](https://seaborn.pydata.org/generated/seaborn.boxplot.html). (n.d.). 