# EPA Carbon Monoxide AQI Analysis

## Introduction

In this notebook, I will take a deeper look at air quality data from the Environmental Protection Agency (EPA), building on my previous analysis of the Air Quality Index (AQI). You can explore my earlier work here:

 - [GitHub](https://github.com/Cyberoctane29/EPA-Air-Quality-AQI-Analysis): https://github.com/Cyberoctane29/EPA-Air-Quality-AQI-Analysis
 - [Kaggle](https://www.kaggle.com/code/saswatsethda/epa-air-quality-aqi-analysis): https://www.kaggle.com/code/saswatsethda/epa-air-quality-aqi-analysis
 
While my previous project focused on basic statistical analysis, exploratory data analysis (EDA), and data structures, this notebook expands on that foundation by incorporating statistical methods, probability analysis, outlier detection, sampling techniques, and hypothesis testing. The primary focus is on carbon monoxide levels and their impact on air pollution and public health.

As a member of an analytics team for the United States Environmental Protection Agency (EPA), I have been assigned to analyze air quality data with respect to carbon monoxide, a major air pollutant. The dataset includes information from more than 200 monitoring sites across various states, counties, and cities. By applying statistical techniques, I will explore patterns, detect outliers, and conduct hypothesis testing to generate meaningful insights. These findings will help inform environmental policy decisions, identify regions requiring intervention, and assess how air quality trends impact public health strategies.


## **Overview**  

To achieve this, I will:  

- **Perform descriptive statistics** to summarize air quality data across different regions.  
- **Determine probability distributions** that best fit the dataset and analyze the spread of AQI values.  
- **Detect outliers** using z-scores and other statistical techniques.  
- **Apply effective sampling methods** to optimize analysis on large datasets.  
- **Conduct hypothesis tests** to assess differences in AQI across locations, helping guide policy decisions.  
- **Visualize key trends** in air pollution data using graphs and charts to enhance interpretability.  

By carrying out these analyses, I aim to identify **which regions require intervention**, understand **how air quality trends impact public health**, and provide **data-driven insights to support environmental policies**.


## **Dataset Structure**  

### **Air Quality Datasets**  
These datasets contain air quality data collected by the Environmental Protection Agency (EPA), specifically focusing on **carbon monoxide** levels across multiple locations in the United States. The data comes from over 200 monitoring sites, each identified by state, county, city, and local site names. The datasets provide key information for analyzing air pollution trends and their potential public health impacts.

#### **Dataset 1-c4_epa_air_quality.csv: Air Quality Measurements**  
This dataset contains raw air quality data, including:  
- **date_local**: The date when the air quality measurement was recorded.  
- **state_name**: The U.S. state where the air quality was measured.  
- **county_name**: The county where the monitoring site is located.  
- **city_name**: The city (if applicable) where the air quality was recorded.  
- **local_site_name**: The name of the specific monitoring station.  
- **parameter_name**: The pollutant measured, which in this case is carbon monoxide.  
- **units_of_measure**: The unit used for measurement (Parts per million).  
- **arithmetic_mean**: The average concentration of carbon monoxide for the given date and location.  
- **aqi**: The Air Quality Index (AQI) value derived from the carbon monoxide concentration.  

#### **Dataset 2-c4_epa_air_quality.csv: Log-Transformed AQI Data**  
This dataset contains a **log-transformed** version of the AQI values, which helps in analyzing data distribution and handling skewness in air pollution measurements. It includes:  
- **date_local, state_name, county_name, city_name, local_site_name, parameter_name, units_of_measure** (same as Dataset 1).  
- **aqi_log**: The natural logarithm of the AQI value for improved statistical analysis.  

By using these datasets, this notebook will **analyze air pollution trends, detect outliers, apply hypothesis testing, and provide insights into environmental policies aimed at improving air quality.**  


## Importing Required Libraries
Before beginning the analysis, it is essential to import all necessary libraries. 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import statsmodels.api as sm

# Exploring Air Quality Through Descriptive Statistics

### **Introduction**  

In this section, we utilize descriptive statistics to analyze and summarize air quality data from the United States Environmental Protection Agency (EPA), uncovering patterns and trends that support informed decision-making on environmental policies and public health initiatives. By employing Python libraries such as **pandas** and **numpy**, we compute key statistical measures—including **mean, median, standard deviation, and percentiles**—to understand the central tendencies and variability of AQI values. Through this exploration, we emphasize the crucial role of descriptive statistics in data interpretation and effective communication of findings.

I will load the dataset and display a sample of the data.


In [2]:
epa_data = pd.read_csv(r"C:\Users\saswa\Documents\GitHub\EPA-Carbon-Monoxide-AQI-Analysis\Data\c4_epa_air_quality.csv", index_col = 0)

To understand how the dataset is structured, I display the first 10 rows of the data.

In [3]:
epa_data.head(10)

Unnamed: 0,date_local,state_name,county_name,city_name,local_site_name,parameter_name,units_of_measure,arithmetic_mean,aqi
0,2018-01-01,Arizona,Maricopa,Buckeye,BUCKEYE,Carbon monoxide,Parts per million,0.473684,7
1,2018-01-01,Ohio,Belmont,Shadyside,Shadyside,Carbon monoxide,Parts per million,0.263158,5
2,2018-01-01,Wyoming,Teton,Not in a city,Yellowstone National Park - Old Faithful Snow ...,Carbon monoxide,Parts per million,0.111111,2
3,2018-01-01,Pennsylvania,Philadelphia,Philadelphia,North East Waste (NEW),Carbon monoxide,Parts per million,0.3,3
4,2018-01-01,Iowa,Polk,Des Moines,CARPENTER,Carbon monoxide,Parts per million,0.215789,3
5,2018-01-01,Hawaii,Honolulu,Not in a city,Kapolei,Carbon monoxide,Parts per million,0.994737,14
6,2018-01-01,Hawaii,Honolulu,Not in a city,Kapolei,Carbon monoxide,Parts per million,0.2,2
7,2018-01-01,Pennsylvania,Erie,Erie,,Carbon monoxide,Parts per million,0.2,2
8,2018-01-01,Hawaii,Honolulu,Honolulu,Honolulu,Carbon monoxide,Parts per million,0.4,5
9,2018-01-01,Colorado,Larimer,Fort Collins,Fort Collins - CSU - S. Mason,Carbon monoxide,Parts per million,0.3,6


### Understanding the AQI Column

The `aqi` column in the dataset represents the **Air Quality Index (AQI)**, which measures air pollution levels. It helps assess potential health impacts, with higher values indicating poorer air quality.

To gain more insights, I use the `info()` function:

In [4]:
epa_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 260 entries, 0 to 259
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   date_local        260 non-null    object 
 1   state_name        260 non-null    object 
 2   county_name       260 non-null    object 
 3   city_name         260 non-null    object 
 4   local_site_name   257 non-null    object 
 5   parameter_name    260 non-null    object 
 6   units_of_measure  260 non-null    object 
 7   arithmetic_mean   260 non-null    float64
 8   aqi               260 non-null    int64  
dtypes: float64(1), int64(1), object(7)
memory usage: 20.3+ KB


This provides details on the dataset’s structure, including data types and missing values.

Next, I generate a table of descriptive statistics using the `describe()` function:

In [5]:
epa_data.describe()

Unnamed: 0,arithmetic_mean,aqi
count,260.0,260.0
mean,0.403169,6.757692
std,0.317902,7.061707
min,0.0,0.0
25%,0.2,2.0
50%,0.276315,5.0
75%,0.516009,9.0
max,1.921053,50.0


#### Observations:

- The aqi column has a count of 260, meaning there are 260 recorded AQI values. This suggests the dataset represents 260 different air quality observations across various locations and time periods.
- The 25th percentile value for AQI is 2, meaning that 25% of the AQI values are below this level, indicating good air quality.
- The 75th percentile for AQI is 9, meaning that 75% of the values fall below this threshold. This suggests that most recorded air quality levels are on the lower end of the scale, indicating generally satisfactory conditions.

### Analyzing AQI Across States
To analyze the distribution of AQI across states, I use:

In [6]:
epa_data['state_name'].describe()


count            260
unique            52
top       California
freq              66
Name: state_name, dtype: object

#### Key Insights:
- The dataset includes **52 unique states**.
- **California appears most frequently (66 times)**, indicating that a significant portion of air quality data is collected from this state, possibly due to higher pollution levels or more monitoring stations.

### Statistical Analysis of AQI
To understand central tendencies and dispersion, I calculate key statistical metrics for the AQI column.

#### Mean AQI:


In [7]:
np.mean(epa_data["aqi"])


np.float64(6.757692307692308)

- The mean **AQI is approximately 6.76**, indicating that, on average, the air quality index in this dataset remains within the safe range.

#### Median AQI:

In [8]:
np.median(epa_data["aqi"])


np.float64(5.0)

- The median **AQI is 5**, which confirms that most air quality readings are relatively low.
- Since the mean is slightly higher than the median, the AQI values may be slightly **right-skewed**.

#### Minimum AQI (Best Recorded Air Quality):

In [9]:
np.min(epa_data["aqi"])

np.int64(0)

- The minimum** AQI value is 0**, representing the cleanest air observed in the dataset.


#### Maximum AQI (Worst Recorded Air Quality):

In [10]:
np.max(epa_data["aqi"])

np.int64(50)

- The maximum **AQI value is 50**, which is still well below the threshold of 100, indicating that even the worst recorded air quality remains within the safe range.

#### Standard Deviation of AQI:


In [11]:
np.std(epa_data["aqi"], ddof=1)


np.float64(7.061706678820724)

The s**tandard deviation is approximately 7.06**, indicating moderate variability in AQI values across the dataset.

## Key Takeaways  

- The **pandas** and **numpy** libraries provide powerful functions for computing descriptive statistics.  
- The **describe()** function in pandas provides a comprehensive summary of numerical or categorical columns.  
- Functions like **mean()**, **median()**, **min()**, **max()**, and **std()** in numpy allow for precise calculation of individual statistics.  

## Summary of Findings  

- The **arithmetic_mean** and **aqi** columns are **slightly right-skewed**.  
- The **range** of **arithmetic_mean** is **1.9**, while the **range** of **aqi** is **50**.  
- The **median values** are **0.276** (arithmetic_mean) and **5** (aqi).  
- The **standard deviations** are **0.3** for arithmetic_mean and **7.06** for aqi.  
- The **mean values** are **0.4** (arithmetic_mean) and **6.57** (aqi).  
- These insights suggest that **arithmetic_mean has less variability**, while **AQI shows moderate variability** in the dataset.  


### Presentation of Findings to Others  
To effectively present the findings to others, I would highlight key AQI statistics and relate them to air quality standards provided by AirNow.gov.  

- The **average AQI value** in this dataset is approximately **6.76**, which falls well within the **"safe" range**.  
- **75% of the AQI values** are below **9**, indicating that most of the data represents **satisfactory air quality**.  
- The **AQI values** in this dataset **do not exceed 50**, which is significantly below the threshold of **100**. This implies that **even for sensitive groups, the air quality is considered safe**.  
- For **carbon monoxide**, an **AQI of 50** corresponds to **4.7 parts per million (ppm)**, which is well below the **harmful level of 9.4 ppm** associated with an AQI of **100**.  


### Summary for Stakeholders  
To provide stakeholders with a clear summary, I would focus on key descriptive statistics and their implications:  

- **AQI Range and Distribution**: The AQI values range from **0 to 50**, with **75% of the values below 9**. This suggests that the air quality is predominantly within the **"good" range**, as defined by AirNow.gov.  
- **Central Tendency**: The **mean AQI is 6.76**, reflecting **satisfactory air quality**. Additionally, the **median AQI is 5**, showing that most values are clustered toward the lower end of the range.  
- **Variability**: The **standard deviation of AQI is 7.06**, indicating **moderate variability** in air quality measurements across the dataset.  
- **Interpretation for Stakeholders**: All AQI values in the dataset are **well below 100**, meaning the air quality is **safe even for sensitive groups**. For carbon monoxide, an **AQI of 50 corresponds to 4.7 ppm**, which is **significantly below the harmful level of 9.4 ppm** associated with an AQI of **100**.  

### Actionable Insight  
While this dataset indicates **safe air quality**, **funding could be allocated to monitor regions showing higher AQI values (closer to 50)**. This would ensure **proactive efforts** in maintaining and improving air quality, particularly for **sensitive groups**.  
