# Mini Project 5-4 Explore confidence intervals

## Introduction

The Air Quality Index (AQI) is the Environmental Protection Agency's index for reporting air quality. A value close to 0 signals little to no public health concern, while higher values are associated with increased risk to public health. The United States is considering a new federal policy that would create a subsidy for renewable energy in states observing an average AQI of 10 or above. <br>

You've just started your new role as a data analyst in the Strategy division of Ripple Renewable Energy (RRE). **RRE operates in the following U.S. states: `California`, `Florida`, `Michigan`, `Ohio`, `Pennsylvania`, `Texas`.** You've been tasked with constructing an analysis which identifies which of these states are most likely to be affected, should the new federal policy be enacted.

Your manager has requested that you do the following for your analysis:
1. Provide a summary of the mean AQI for the states in which RRE operates.
2. Construct a boxplot visualization for AQI of these states using `seaborn`.
3. Evaluate which state(s) may be most affected by this policy, based on the data and your boxplot visualization.
4. Construct a confidence interval for the RRE state with the highest mean AQI.

## Step 1: Imports

### Import packages

Import `pandas` and `numpy`.

In [None]:
import pandas as pd
import numpy as np

### Load the dataset

The dataset provided gives national Air Quality Index (AQI) measurements by state over time.  `Pandas` is used to import the file `c4_epa_air_quality.csv` as a DataFrame named `aqi`. As shown in this cell, the dataset has been automatically loaded in for you. You do not need to download the .csv file, or provide more code, in order to access the dataset and proceed with this lab. Please continue with this activity by completing the following instructions.

*Note: For the purposes of your analysis, you can assume this data is randomly sampled from a larger population.*

In [None]:
import pandas as pd

# Load the dataset
aqi = pd.read_csv("c4_epa_air_quality.csv")

# Display the first few rows of the dataset
print(aqi.head())


## Step 2: Data exploration

### Explore your dataset

Before proceeding to your deliverables, spend some time exploring the `aqi` DataFrame. 

In [None]:
# Code Here

In [None]:
# Code Here

**Question:** What time range does this data cover?

In [None]:
# Code Here

A: over a few years

**Question:** What are the minimum and maximum AQI values observed in the dataset?

In [None]:
# Find the minimum and maximum AQI values
min_aqi = aqi['AQI'].min()
max_aqi = aqi['AQI'].max()

print(f"Minimum AQI: {min_aqi}")
print(f"Maximum AQI: {max_aqi}")


**Question:** Are all states equally represented in the dataset?

In [None]:
# Code Here

In [None]:
# Code Here

A: In this case, states like California and Texas would be overrepresented compared to others, indicating that not all states are equally represented. Conversely, if the counts are close to each other, it suggests more equal representation

## Step 3: Statistical tests

### Summarize the mean AQI for RRE states

Start with your first deliverable. Summarize the mean AQI for the states in which RRE operates (California, Florida, Michigan, Ohio, Pennsylvania, and Texas).

In [None]:
# List of RRE states
rre_states = ['California', 'Florida', 'Michigan', 'Ohio', 'Pennsylvania', 'Texas']

# Filter the dataset for these states
rre_aqi = aqi[aqi['state'].isin(rre_states)]

# Calculate the mean AQI for each of these states
mean_aqi_by_state = rre_aqi.groupby('state')['AQI'].mean()

# Display the results
print(mean_aqi_by_state)


### Construct a boxplot visualization for the AQI of these states

Seaborn is a simple visualization library, commonly imported as `sns`. Import `seaborn`. Then utilize a boxplot visualization from this library to compare the distributions of AQI scores by state.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Set up the figure and size
plt.figure(figsize=(10, 6))

# Create a boxplot for AQI by state for the RRE states
sns.boxplot(data=rre_aqi, x='state', y='AQI')

# Add title and labels
plt.title('Boxplot of AQI by State (RRE States)', fontsize=16)
plt.xlabel('State', fontsize=12)
plt.ylabel('AQI', fontsize=12)

# Rotate x-axis labels for readability
plt.xticks(rotation=45)

# Show the plot
plt.show()


### Create an in-line visualization showing the distribution of `aqi` by `state_name`

Now, create an in-line visualization showing the distribution of `aqi` by `state_name`.

In [None]:
# Create an in-line visualization showing the distribution of AQI by state_name
plt.figure(figsize=(12, 6))

# Use seaborn's violinplot to visualize the distribution
sns.violinplot(data=aqi, x='state', y='AQI')

# Add title and labels
plt.title('Distribution of AQI by State', fontsize=16)
plt.xlabel('State', fontsize=12)
plt.ylabel('AQI', fontsize=12)

# Rotate x-axis labels for readability
plt.xticks(rotation=45)

# Show the plot
plt.show()


**Question:** Based on the data and your visualizations, which state(s) do you suspect will be most affected by this policy?

A: California, Texas, and Florida show higher AQI values and a broader distribution of AQI, they would likely be more impacted by the policy, especially if the policy addresses high pollution levels or aims to reduce AQI scores overall.

### Construct a confidence interval for the RRE state with the highest mean AQI

Recall the 4-step process in constructing a confidence interval:

1.   Identify a sample statistic.
2.   Choose a confidence level.
3.   Find the margin of error. 
4.   Calculate the interval.

### Construct your sample statistic

To contruct your sample statistic, find the mean AQI for CA.

In [None]:
# Filter the dataset for California
ca_aqi = aqi[aqi['state'] == 'California']

# Calculate the mean AQI for California
mean_aqi_ca = ca_aqi['AQI'].mean()

print(f"Mean AQI for California: {mean_aqi_ca}")


### Choose your confidence level

Choose your confidence level for your analysis. The most typical confidence level chosen is 95%; however, you can choose 90% or 99% if you want decrease or increase (respectively) your level of confidence about your result.

In [None]:
import scipy.stats as stats
import numpy as np

# Set the confidence level (e.g., 95% confidence level)
confidence_level = 0.95

# Calculate the sample mean and standard error for California AQI
sample_mean = mean_aqi_ca
sample_std = ca_aqi['AQI'].std()
sample_size = ca_aqi.shape[0]

# Calculate the margin of error using the t-distribution (since it's a sample)
margin_of_error = stats.t.ppf((1 + confidence_level) / 2, df=sample_size - 1) * (sample_std / np.sqrt(sample_size))

# Calculate the confidence interval
confidence_interval = (sample_mean - margin_of_error, sample_mean + margin_of_error)

# Print the results
print(f"Confidence Level: {confidence_level * 100}%")
print(f"Mean AQI for California: {sample_mean}")
print(f"Confidence Interval: {confidence_interval}")


### Find your margin of error (ME)

Recall **margin of error = z * standard error**, where z is the appropriate z-value for the given confidence level. To calculate your margin of error:

- Find your z-value. 
- Find the approximate z for common confidence levels.
- Calculate your **standard error** estimate. 

| Confidence Level | Z Score |
| --- | --- |
| 90% | 1.65 |
| 95% | 1.96 |
| 99% | 2.58 |


In [None]:
# Given confidence level Z-value for 95%
z_value = 1.96

# Calculate the standard error
standard_error = sample_std / np.sqrt(sample_size)

# Calculate the margin of error (ME)
margin_of_error = z_value * standard_error

# Print the results
print(f"Standard Error: {standard_error}")
print(f"Margin of Error (ME): {margin_of_error}")




### Calculate your interval

Calculate both a lower and upper limit surrounding your sample mean to create your interval.

In [None]:
# Calculate the lower and upper limits of the confidence interval
lower_limit = sample_mean - margin_of_error
upper_limit = sample_mean + margin_of_error

# Print the results
print(f"Confidence Interval: ({lower_limit}, {upper_limit})")


### Alternative: Construct the interval using `scipy.stats.norm.interval()`

`scipy` presents a simpler solution to developing a confidence interval. To use this, first import the `stats` module from `scipy`.

In [None]:
import scipy.stats as stats
import numpy as np

# Given data
confidence_level = 0.95  # 95% confidence level
sample_mean = mean_aqi_ca  # Already calculated sample mean for California
sample_std = ca_aqi['AQI'].std()  # Standard deviation of the AQI data for California
sample_size = ca_aqi.shape[0]  # Sample size (number of data points)

# Calculate the confidence interval using scipy's norm.interval()
confidence_interval = stats.norm.interval(confidence_level, loc=sample_mean, scale=sample_std/np.sqrt(sample_size))

# Print the confidence interval
print(f"Confidence Interval: {confidence_interval}")


## Step 4: Results and evaluation

### Recalculate your confidence interval

Provide your chosen `confidence_level`, `sample_mean`, and `standard_error` to `stats.norm.interval()` and recalculate your confidence interval.

In [None]:
import scipy.stats as stats
import numpy as np

# Given data
confidence_level = 0.95  # 95% confidence level
sample_mean = mean_aqi_ca  # Sample mean for California (previously calculated)
sample_std = ca_aqi['AQI'].std()  # Standard deviation for California's AQI data
sample_size = ca_aqi.shape[0]  # Sample size (number of data points)

# Calculate the standard error
standard_error = sample_std / np.sqrt(sample_size)

# Calculate the confidence interval using scipy's norm.interval()
confidence_interval = stats.norm.interval(confidence_level, loc=sample_mean, scale=standard_error)

# Print the confidence interval
print(f"Confidence Level: {confidence_level * 100}%")
print(f"Sample Mean: {sample_mean}")
print(f"Standard Error: {standard_error}")
print(f"Confidence Interval: {confidence_interval}")


# Considerations

**What are some key takeaways that you learned from this project?**

A: Understanding Confidence Intervals: One of the most important lessons was how to calculate and interpret confidence intervals. By using the sample mean, standard error, and a chosen confidence level, you can determine a range (confidence interval) where the true population mean is likely to fall. This is a critical concept in statistics, particularly for making inferences about a population based on a sample.

Visualization for Data Insights: The use of visualizations like boxplots and violin plots helped in understanding the distribution of AQI values across different states. It allowed for insights into which states have higher pollution levels and how that might affect policy decisions. Visual tools make it easier to see patterns and outliers in data.

Sample Statistic Calculation: Calculating the sample mean AQI for California showed how to summarize a large dataset in a single value, and how that mean can be used in further statistical analyses, such as calculating confidence intervals.

Application of Statistical Tests: The project demonstrated how to apply statistical tests to real-world data, providing insights into how states may be affected by policy changes based on AQI levels. By calculating the margin of error and confidence intervals, we were able to quantify the uncertainty around our estimates and make more informed decisions.

Practical Use of Python Libraries: Using libraries like pandas, seaborn, and scipy.stats, I learned how to efficiently manage data, visualize it, and perform statistical analyses. These libraries are powerful tools for handling large datasets and performing complex analyses.

State-Specific Policy Insights: By looking at the AQI values for specific states like California, Texas, and Florida, it became clear that some states are likely to be more impacted by air quality policies than others. States with higher AQI values and a wider range of pollution would likely see more significant benefits from policy changes aimed at improving air quality.


**What findings would you share with others?**

A: State-Specific AQI Distribution: Certain states, such as California, Texas, and Florida, have a wider distribution of AQI values, indicating higher pollution levels. These states may be more impacted by policies aimed at reducing air pollution. For example, California’s AQI showed a higher spread and more extreme values, suggesting that air quality issues in these areas might need more immediate or robust interventions.

Confidence Interval Insights: The confidence interval we calculated for California’s AQI (for instance, 95% confidence) provides a range of values within which the true population mean AQI is likely to fall. This helps quantify the uncertainty in our estimates and gives us more confidence in making decisions about potential policy impacts.

Visualizing Data: Visualization tools, such as boxplots and violin plots, are crucial for understanding the distribution of data. They allow for a clearer view of where AQI values cluster, where outliers lie, and which states have more extreme values. This can be helpful in identifying areas where air quality interventions might be most needed.

State Representation in the Dataset: Some states were overrepresented or underrepresented in the dataset, which could affect the generalizability of any findings to the broader population. It's important to account for this when drawing conclusions and considering policy implications.

Policy Implications: Given that states with higher AQI values may see more significant improvements from air quality policies, focusing efforts on states with the highest pollution levels (e.g., California, Texas) could yield the most immediate benefits. Additionally, policies should consider regional differences in AQI distributions and target areas with the most severe air quality issues.

Confidence in Results: Using the 95% confidence level gave us a high degree of certainty that the true population mean AQI for California falls within the calculated range. This is an important consideration when making decisions based on sample data and highlights the value of using statistical methods to assess uncertainty.

**What would you convey to external readers?**

A: Data Insights on Air Quality: The analysis of AQI values across different states reveals critical patterns about air quality in the U.S. States like California, Texas, and Florida show higher levels of pollution, with wider distributions of AQI values. This indicates a significant need for targeted air quality management policies in these regions, as they may benefit most from air quality improvement initiatives.

Quantifying Uncertainty: Using statistical tools, such as confidence intervals, we were able to quantify the uncertainty surrounding our estimates. This is especially important when making data-driven decisions that affect public health and policy. By understanding where the true population mean AQI for California lies (e.g., within a specific range with 95% confidence), we can more accurately predict the impact of potential policies.

Visualizing Data for Clarity: Visualizations such as boxplots and violin plots not only make the data more accessible and understandable but also help identify key trends, outliers, and distribution patterns. These visuals show that certain states face more significant air quality challenges and could benefit from focused policy changes. Such tools make complex data easier to interpret for decision-makers, researchers, and the public.

Policy Recommendations: Based on the analysis, I would recommend focusing policy efforts on the states with the highest pollution levels (e.g., California, Texas) as they will likely experience the most improvement from targeted air quality policies. Additionally, the variability in AQI values across states suggests that a one-size-fits-all policy may not be effective, and more localized approaches might be needed.

Importance of Data Representation: The dataset's representation of states plays a key role in the reliability of the conclusions. Some states were overrepresented or underrepresented, which could influence the generalizability of the results. Understanding this is vital for interpreting the findings and ensuring that policies are based on representative data.

Confidence and Decision-Making: The confidence interval methodology offers a reliable way to assess the likely range of true AQI values, which is crucial for making informed decisions about public health and regulatory interventions. It highlights the importance of using robust statistical methods when evaluating environmental data to ensure that policies are based on sound evidence.

**References**

[seaborn.boxplot — seaborn 0.12.1 documentation](https://seaborn.pydata.org/generated/seaborn.boxplot.html). (n.d.). 