# EPA Carbon Monoxide AQI Analysis

## Introduction

In this notebook, I will take a deeper look at air quality data from the Environmental Protection Agency (EPA), building on my previous analysis of the Air Quality Index (AQI). You can explore my earlier work here:

 - [GitHub](https://github.com/Cyberoctane29/EPA-Air-Quality-AQI-Analysis): https://github.com/Cyberoctane29/EPA-Air-Quality-AQI-Analysis
 - [Kaggle](https://www.kaggle.com/code/saswatsethda/epa-air-quality-aqi-analysis): https://www.kaggle.com/code/saswatsethda/epa-air-quality-aqi-analysis
 
While my previous project focused on basic statistical analysis, exploratory data analysis (EDA), and data structures, this notebook expands on that foundation by incorporating statistical methods, probability analysis, outlier detection, sampling techniques, and hypothesis testing. The primary focus is on carbon monoxide levels and their impact on air pollution and public health.

As a member of an analytics team for the United States Environmental Protection Agency (EPA), I have been assigned to analyze air quality data with respect to carbon monoxide, a major air pollutant. The dataset includes information from more than 200 monitoring sites across various states, counties, and cities. By applying statistical techniques, I will explore patterns, detect outliers, and conduct hypothesis testing to generate meaningful insights. These findings will help inform environmental policy decisions, identify regions requiring intervention, and assess how air quality trends impact public health strategies.


## **Overview**  

To achieve this, I will:  

- **Perform descriptive statistics** to summarize air quality data across different regions.  
- **Determine probability distributions** that best fit the dataset and analyze the spread of AQI values.  
- **Detect outliers** using z-scores and other statistical techniques.  
- **Apply effective sampling methods** to optimize analysis on large datasets.  
- **Conduct hypothesis tests** to assess differences in AQI across locations, helping guide policy decisions.  
- **Visualize key trends** in air pollution data using graphs and charts to enhance interpretability.  

By carrying out these analyses, I aim to identify **which regions require intervention**, understand **how air quality trends impact public health**, and provide **data-driven insights to support environmental policies**.


## **Dataset Structure**  

### **Air Quality Datasets**  
These datasets contain air quality data collected by the Environmental Protection Agency (EPA), specifically focusing on **carbon monoxide** levels across multiple locations in the United States. The data comes from over 200 monitoring sites, each identified by state, county, city, and local site names. The datasets provide key information for analyzing air pollution trends and their potential public health impacts.

#### **Dataset 1-c4_epa_air_quality.csv: Air Quality Measurements**  
This dataset contains raw air quality data, including:  
- **date_local**: The date when the air quality measurement was recorded.  
- **state_name**: The U.S. state where the air quality was measured.  
- **county_name**: The county where the monitoring site is located.  
- **city_name**: The city (if applicable) where the air quality was recorded.  
- **local_site_name**: The name of the specific monitoring station.  
- **parameter_name**: The pollutant measured, which in this case is carbon monoxide.  
- **units_of_measure**: The unit used for measurement (Parts per million).  
- **arithmetic_mean**: The average concentration of carbon monoxide for the given date and location.  
- **aqi**: The Air Quality Index (AQI) value derived from the carbon monoxide concentration.  

#### **Dataset 2-c4_epa_air_quality.csv: Log-Transformed AQI Data**  
This dataset contains a **log-transformed** version of the AQI values, which helps in analyzing data distribution and handling skewness in air pollution measurements. It includes:  
- **date_local, state_name, county_name, city_name, local_site_name, parameter_name, units_of_measure** (same as Dataset 1).  
- **aqi_log**: The natural logarithm of the AQI value for improved statistical analysis.  

By using these datasets, this notebook will **analyze air pollution trends, detect outliers, apply hypothesis testing, and provide insights into environmental policies aimed at improving air quality.**  


## Importing Required Libraries
Before beginning the analysis, it is essential to import all necessary libraries. 

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import statsmodels.api as sm

# Exploring Air Quality Through Descriptive Statistics

### **Introduction**  

In this section, we utilize descriptive statistics to analyze and summarize air quality data from the United States Environmental Protection Agency (EPA), uncovering patterns and trends that support informed decision-making on environmental policies and public health initiatives. By employing Python libraries such as **pandas** and **numpy**, we compute key statistical measures—including **mean, median, standard deviation, and percentiles**—to understand the central tendencies and variability of AQI values. Through this exploration, we emphasize the crucial role of descriptive statistics in data interpretation and effective communication of findings.

I will load the dataset and display a sample of the data.


In [8]:
epa_data = pd.read_csv(r"C:\Users\saswa\Documents\GitHub\EPA-Carbon-Monoxide-AQI-Analysis\Data\c4_epa_air_quality.csv", index_col = 0)

To understand how the dataset is structured, I display the first 10 rows of the data.

In [6]:
epa_data.head(10)

Unnamed: 0.1,Unnamed: 0,date_local,state_name,county_name,city_name,local_site_name,parameter_name,units_of_measure,arithmetic_mean,aqi
0,0,2018-01-01,Arizona,Maricopa,Buckeye,BUCKEYE,Carbon monoxide,Parts per million,0.473684,7
1,1,2018-01-01,Ohio,Belmont,Shadyside,Shadyside,Carbon monoxide,Parts per million,0.263158,5
2,2,2018-01-01,Wyoming,Teton,Not in a city,Yellowstone National Park - Old Faithful Snow ...,Carbon monoxide,Parts per million,0.111111,2
3,3,2018-01-01,Pennsylvania,Philadelphia,Philadelphia,North East Waste (NEW),Carbon monoxide,Parts per million,0.3,3
4,4,2018-01-01,Iowa,Polk,Des Moines,CARPENTER,Carbon monoxide,Parts per million,0.215789,3
5,5,2018-01-01,Hawaii,Honolulu,Not in a city,Kapolei,Carbon monoxide,Parts per million,0.994737,14
6,6,2018-01-01,Hawaii,Honolulu,Not in a city,Kapolei,Carbon monoxide,Parts per million,0.2,2
7,7,2018-01-01,Pennsylvania,Erie,Erie,,Carbon monoxide,Parts per million,0.2,2
8,8,2018-01-01,Hawaii,Honolulu,Honolulu,Honolulu,Carbon monoxide,Parts per million,0.4,5
9,9,2018-01-01,Colorado,Larimer,Fort Collins,Fort Collins - CSU - S. Mason,Carbon monoxide,Parts per million,0.3,6
