# **EPA Air Quality Analysis**  

## **Introduction**  

Air pollution is a growing concern, and the Air Quality Index (AQI) is a crucial metric used to assess pollution levels. The AQI ranges from 0 to 500, where higher values indicate greater pollution and potential health risks. For example, an AQI below 50 represents good air quality, while an AQI above 300 is considered hazardous.  

In this project, I analyze air quality data collected by the U.S. Environmental Protection Agency (EPA) to gain insights into pollution levels across different states and counties. Python data structures such as lists, dictionaries, and sets are used to store and analyze AQI data efficiently.  

### Dataset Structure  

#### **1. c2_epa_air_quality**  
This dataset contains air quality index (AQI) data along with state and county identifiers. It includes the following fields:  

- **state_code**: A two-digit string representing the state code.  
- **state_name**: The full name of the state.  
- **county_code**: A three-digit string representing the county code within the state.  
- **county_name**: The name of the county.  
- **aqi**: A numerical value representing the air quality index for the county.  
- **state_code_int**: An integer representation of the state code.  
- **county_code_int**: An integer representation of the county code.  

---

#### **2. epa_ca_tx_pa**  
This dataset focuses on air quality data for California, Texas, and Pennsylvania. It includes:  

- **state_code**: A numerical code identifying the state.  
- **state_name**: The full name of the state.  
- **county_code**: A numerical code identifying the county.  
- **county_name**: The full name of the county.  
- **aqi**: A numerical value representing the air quality index for that county.  

---

#### **3. epa_others**  
This dataset contains air quality data for states other than California, Texas, and Pennsylvania. It shares the same structure as *epa_ca_tx_pa*:

- **state_code**: A numerical identifier for the state.  
- **state_name**: The full name of the state.  
- **county_code**: A numerical identifier for the county.  
- **county_name**: The full name of the county.  
- **aqi**: A numerical value representing the air quality index for that county.  

All three datasets follow a consistent structure, allowing for easy merging and analysis.


#### **Importing Required Libraries**
First, I import the necessary libraries for analysis. 

In [None]:
import pandas as pd
import numpy as np
import statistics  


To begin, I load the dataset containing AQI records along with the state and county where each reading was recorded. My goal is to structure this data for easy retrieval and analysis.

I start by converting the raw data into a structured format. Specifically, I create a list of tuples where each tuple contains three key pieces of information:

- State Name
- County Name
- AQI Value
  
This format ensures that each AQI reading is associated with its respective location.

In [None]:

# Load dataset  
epa_data = pd.read_csv(r'C:\Users\saswa\Documents\GitHub\EPA-Air-Quality-AQI-Analysis\Data\c2_epa_air_quality.csv')  

# Convert columns to lists  
state_list = epa_data['state_name'].to_list()  
county_list = epa_data['county_name'].to_list()  
aqi_list = epa_data['aqi'].to_list()  

# Create list of tuples  
epa_tuples = list(zip(state_list, county_list, aqi_list))  

# Display first five records  
epa_tuples[:5]

### **Creating a Dictionary for AQI Data**
To allow for efficient data retrieval, the tuples are converted into a dictionary where states act as keys, and the values are lists of county-level AQI records. This enables quick lookups and better organization of air quality information.

In [None]:
aqi_dict = {}  

for state, county, aqi in epa_tuples:  
    if state in aqi_dict:  
        aqi_dict[state].append((county, aqi))  
    else:  
        aqi_dict[state] = [(county, aqi)]  

# Example: Retrieve data for Vermont  
aqi_dict['Vermont']

### **Exploring the Data**
Now that I have a structured dataset, I can extract meaningful insights from the AQI records.

### **Number of AQI Readings for Arizona**
The total number of AQI readings recorded in Arizona is calculated using Python’s len() function.

In [None]:
recordings_cnt = len(aqi_dict['Arizona'])  
print(recordings_cnt)  


### **Mean AQI for California**
To determine the average air quality in California, I compute the mean of all AQI values recorded in the state.

In [None]:
# Using sum and len  
aqi_mean = sum(aqi for county, aqi in aqi_dict['California']) / len(aqi_dict['California'])  
print(aqi_mean)  

# Using statistics.mean()  
aqi_mean1 = statistics.mean(aqi for county, aqi in aqi_dict['California'])  
print(aqi_mean1)  

### **Analyzing County-Specific AQI Readings**
A function is defined to count how many times a county appears in a given state's AQI records. This allows me to assess data distribution across different regions.

In [None]:
def county_counter(state_name):  
    county_dict = {}  
    for county, aqi in aqi_dict[state_name]:  
        if county in county_dict:  
            county_dict[county] += 1  
        else:  
            county_dict[county] = 1  
    return county_dict  


### **Checking AQI Readings for Washington County, PA**
Using county_counter(), I determine how many AQI readings are recorded for Washington County in Pennsylvania.

In [None]:
pa_dict = county_counter('Pennsylvania')  
pa_dict['Washington']  


### **Listing Counties in Indiana**
Retrieving the names of counties represented in Indiana’s AQI records.

In [None]:
county_counter("Indiana").keys()


### **Finding Counties with Duplicate Names**
To check how many counties share the same name across different states, I compile a list of all counties and analyze name repetitions.

In [None]:
# Constructing a list of every county in the dataset  
all_counties = []  
for state_name in aqi_dict.keys():  
    all_counties += list(county_counter(state_name).keys())  

# Finding total number of counties  
print(len(all_counties))  


### **Calculating Counties with Duplicate Names**
Using a set and list methods, I determine the number of county names that appear more than once across states.

In [None]:
shared_count = 0  

for county in set(all_counties):  
    count = all_counties.count(county)  
    if count > 1:  
        shared_count += count  

print(shared_count)  


Note: This doesn't tell us how many different county names are duplicated. Further analysis could uncover more details about this.

If I want to find out how many different county names are duplicated:

In [None]:
county_count = 0  

for county in set(all_counties):  
    count1 = all_counties.count(county)  
    if count1 > 1:  
        county_count += 1  

print(county_count)  


## **Vectorized Operations: Using NumPy for Efficient Analysis**

In this section, I used NumPy, a powerful library for numerical computing, to analyze AQI data. NumPy provides fast operations on large arrays, enabling efficient calculations on the dataset.

### **Converting AQI Data into an Array**
Using the AQI data list, I convert it into a NumPy ndarray for efficient computation and analysis.

In [None]:
aqi_list = epa_data['aqi'].to_list()
aqi_array = np.array(aqi_list) 
print(len(aqi_array)) 
print(aqi_array[:5])

### **Deriving Summary Statistics**
Next, I calculate some key summary statistics of the AQI data:

- Maximum AQI value
- Minimum AQI value
- Median AQI value
  
Standard deviation of AQI

In [None]:
print(f"Max = {np.max(aqi_array)}")
print(f"Min = {np.min(aqi_array)}")
print(f"Median = {np.median(aqi_array)}")
print(f"Std = {np.std(aqi_array)}")

### **Analyzing Cleanest AQI Readings**
The goal is to analyze how many readings represent the cleanest air (defined as AQI values of 5 or less). Using NumPy’s element-wise operations, I can quickly calculate the percentage of readings that fall into this category.

In [None]:
# Create a boolean mask where True represents AQI values ≤5
boolean_aqi = (aqi_array <= 5)  

# Calculate the percentage of AQI values that are 5 or less
percent_under_6 = boolean_aqi.sum() / len(boolean_aqi)  
print(f"Percentage of AQI readings ≤5: {percent_under_6:.2%}")  


## **Data Manipulation with pandas: Exploring Air Quality Data**

### **Loading the AQI Data into a DataFrame**
For this analysis, I started with a dataset containing air quality data from three states: California, Texas, and Pennsylvania. The first step was to load the data into a pandas DataFrame and inspect its structure.

In [None]:
# Load the dataset
top3 = pd.read_csv(r'C:\Users\saswa\Documents\GitHub\EPA-Air-Quality-AQI-Analysis\Data\epa_ca_tx_pa.csv')

# Display the first five rows of the dataset
top3.head()

### **Examining Metadata and Summary Statistics**
Once the dataset is loaded, the next step is to explore the basic information, such as the number of rows and columns, column names, and data types.

In [None]:
# Get metadata of the dataframe
top3.info()

# Generate summary statistics for the numeric columns
top3.describe()

### **Exploring the Data: Insights by State and AQI**
To better understand the distribution of the data, I'll start by examining the number of observations per state and sorting the data by AQI values.

In [None]:
# Count the number of rows per state
top3['state_name'].value_counts()

# Sort the data by AQI values in descending order
top3_sorted = top3.sort_values('aqi', ascending=False)

# Display the top 10 rows with the highest AQI values
top3_sorted.head(10)


### **Investigating the California Data**
Now, focusing on the California data, I'll use Boolean masking to filter and analyze the relevant records.

In [None]:
# Apply a Boolean mask to filter California data
mask = top3_sorted['state_name'] == 'California'
ca_df = top3_sorted[mask]

# Display the first five rows of California data
ca_df.head()

# Verify the number of rows in the California data
ca_df.shape

# Count occurrences of each county in California
ca_df['county_name'].value_counts()

# Calculate the mean AQI for Los Angeles county
mean_aqi = ca_df[ca_df['county_name'] == 'Los Angeles']['aqi'].mean()
print(mean_aqi)

### **Grouping Data by State**
To calculate the average AQI per state, I can group the data by the state_name and apply aggregation functions.

In [None]:
# Group the data by state and calculate the mean AQI
top3.groupby('state_name').agg({'aqi': 'mean'})


### **Adding Data from Other States**
To expand the dataset, the next step is to load data from another file containing information for the remaining states, and then combine it with the current dataset.

In [None]:
# Load the second dataset
other_states = pd.read_csv(r'C:\Users\saswa\Documents\GitHub\EPA-Air-Quality-AQI-Analysis\Data\epa_others.csv')

# Display the first five rows of the new dataset
other_states.head(5)

# Concatenate the data from top3 and other_states into a new dataframe
combined_df = pd.concat([top3, other_states])

# Verify the length of the combined dataframe
len(combined_df) == len(top3) + len(other_states)


### **Filtering Washington Data with Moderate AQI**
In this section, I'll apply complex Boolean masking to filter data from Washington where AQI values are considered "Moderate" (betIen 51 and 100).

In [None]:
# Apply a Boolean mask to filter Washington data with Moderate AQI
wa_df = combined_df[(combined_df['state_name'] == 'Washington') & (combined_df['aqi'] >= 51)]

wa_df


# **Conclusion**
By leveraging the power of NumPy and pandas, this project successfully analyzed and summarized a large dataset of AQI readings across various U.S. states. Through careful data manipulation, statistical analysis, and efficient use of vectorized operations, I was able to extract meaningful insights that could inform environmental decision-making. Using dictionaries and sets alloId for quick organization and analysis of data by state and county, while NumPy provided the tools for performing complex statistical calculations efficiently.

This analysis demonstrated the effectiveness of structured data manipulation using Python. By combining dictionary-based storage with NumPy’s computational efficiency and pandas’ data handling capabilities, key insights into AQI trends across U.S. states Ire extracted. This process is vital for making timely, data-driven decisions to improve environmental policies and protect public health. Pandas, with its intuitive methods for data handling, played a crucial role in simplifying complex data analysis tasks and enhancing decision-making efficiency.


# **Top Three Recommendations**
- Focus on High AQI Areas: By identifying regions with consistently high AQI values, particularly in states like California, targeted interventions could be introduced, such as air quality improvement initiatives and health advisories.
- Enhance Data Reporting: Regularly update and expand the AQI dataset to include more regions and time periods, which would offer better insights into long-term trends and the effectiveness of air quality policies.
- Promote Public Awareness Campaigns: Using the insights from AQI data, create targeted awareness campaigns in high-pollution areas, focusing on the health risks associated with poor air quality and encouraging actions to reduce emissions.

# **Application of Insights**
The insights gained from the AQI data can be directly applied to public health and environmental policy. By identifying regions with high pollution levels, policymakers can prioritize those areas for interventions such as stricter emission controls, improved urban planning, or better pollution monitoring. Additionally, this analysis can inform public health campaigns aimed at educating communities on the risks of high AQI levels and encouraging proactive measures to reduce exposure.



# **Next Steps**
- Further Data Expansion: To deepen the analysis, consider incorporating AQI data from more regions and additional environmental factors such as temperature or traffic density, which could affect air quality.
- Collaborate with Environmental Agencies: Share the findings with local environmental protection agencies to help guide more effective air quality control policies and interventions.
- Develop Predictive Models: Explore the possibility of building predictive models that forecast AQI levels based on various factors (e.g., Iather, traffic, industrial activity), helping authorities take preventative actions in advance.

These steps provide a clear path forward for further utilizing data to make impactful decisions regarding air quality management and public health.