You are a member of an analytics team for the United States Environmental Protection Agency (EPA). You are assigned to analyze data on air quality with respect to carbon monoxide, a major air pollutant. The data includes information from more than 200 sites, identified by state, county, city, and local site names. You work is to gather statistics about air quality, then share insights with stakeholders.

In [4]:
# Import relevant Python libraries.

import pandas as pd
import numpy as np

Load the dataset into a DataFrame. The dataset provided is in the form of a .csv file named c4_epa_air_quality.csv. It contains a subset of data from the U.S. EPA.

In [6]:
# IMPORTING THE DATA.

epa_data = pd.read_csv("c4_epa_air_quality.csv", index_col = 0)

# Data Exploration

In [10]:
# Display first 10 rows of the data.

epa_data.head(10)


Unnamed: 0,date_local,state_name,county_name,city_name,local_site_name,parameter_name,units_of_measure,arithmetic_mean,aqi
0,2018-01-01,Arizona,Maricopa,Buckeye,BUCKEYE,Carbon monoxide,Parts per million,0.473684,7
1,2018-01-01,Ohio,Belmont,Shadyside,Shadyside,Carbon monoxide,Parts per million,0.263158,5
2,2018-01-01,Wyoming,Teton,Not in a city,Yellowstone National Park - Old Faithful Snow ...,Carbon monoxide,Parts per million,0.111111,2
3,2018-01-01,Pennsylvania,Philadelphia,Philadelphia,North East Waste (NEW),Carbon monoxide,Parts per million,0.3,3
4,2018-01-01,Iowa,Polk,Des Moines,CARPENTER,Carbon monoxide,Parts per million,0.215789,3
5,2018-01-01,Hawaii,Honolulu,Not in a city,Kapolei,Carbon monoxide,Parts per million,0.994737,14
6,2018-01-01,Hawaii,Honolulu,Not in a city,Kapolei,Carbon monoxide,Parts per million,0.2,2
7,2018-01-01,Pennsylvania,Erie,Erie,,Carbon monoxide,Parts per million,0.2,2
8,2018-01-01,Hawaii,Honolulu,Honolulu,Honolulu,Carbon monoxide,Parts per million,0.4,5
9,2018-01-01,Colorado,Larimer,Fort Collins,Fort Collins - CSU - S. Mason,Carbon monoxide,Parts per million,0.3,6


Question: What does the aqi column represent?

The aqi column represents the EPA's Air Quality Index (AQI).

### Get a table that contains some descriptive statistics about the data.

In [18]:
# Get descriptive stats.

epa_data.describe()

Unnamed: 0,arithmetic_mean,aqi
count,260.0,260.0
mean,0.403169,6.757692
std,0.317902,7.061707
min,0.0,0.0
25%,0.2,2.0
50%,0.276315,5.0
75%,0.516009,9.0
max,1.921053,50.0


Question: Based on the table of descriptive statistics, what do you notice about the count value for the aqi column?

The count value for the aqi column is 260. This means there are 260 aqi measurements represented in this dataset.

Question: What do you notice about the 25th percentile for the aqi column? This is an important measure for understanding where the aqi values lie.

The 25th percentile for the aqi column is 2. This means that 25% of the aqi values in the data are below 2.

Question: What do you notice about the 75th percentile for the aqi column? This is another important measure for understanding where the aqi values lie.

The 75th percentile for the aqi column is 9. This means that 75% of the aqi values in the data are below 9.

# Statistical Test

### Next, get some descriptive statistics about the states in the data.

In [25]:
# Get descriptive stats about the states in the data.

epa_data["state_name"].describe()

count            260
unique            52
top       California
freq              66
Name: state_name, dtype: object

Question: What do you notice while reviewing the descriptive statistics about the states in the data? 

There are 260 state values, and 52 of them are unique. California is the most commonly occurring state in the data, with a frequency of 66. (In other words, 66 entries in the data correspond to aqi measurements taken in California.)

# Results and Evaluation

Now, compute the mean value from the aqi column.

In [33]:
# Compute the mean value from the aqi column.

np.mean(epa_data["aqi"])

6.757692307692308

Question: What do you notice about the mean value from the aqi column?

The mean value for the aqi column is approximately 6.76 (rounding to 2 decimal places here). This means that the average aqi from the data is approximately 6.76.

### Next, compute the median value from the aqi column.

In [38]:
# Compute the median value from the aqi column.

np.median(epa_data["aqi"])

5.0


Question: What do you notice about the median value from the aqi column? 

The median value for the aqi column is 5.0. This means that half of the aqi values in the data are below 5.

### Next, identify the minimum value from the aqi column.

In [43]:
# Identify the minimum value from the aqi column.
np.min(epa_data["aqi"])

0

Question: What do you notice about the minimum value from the aqi column?

The minimum value for the aqi column is 0. This means that the smallest aqi value in the data is 0.

### Now, identify the maximum value from the aqi column.

In [47]:
# Identify the maximum value from the aqi column.

np.max(epa_data["aqi"])

50

**Question:** What do you notice about the maximum value from the `aqi` column?
This is an important measure, as it tells you which value in the data corresponds to the worst air quality observed in the data.

The maximum value for the `aqi` column is 50. This means that the largest aqi value in the data is 50.

### Now, compute the standard deviation for the aqi column.

In [51]:
# Compute the standard deviation for the aqi column.

np.std(epa_data["aqi"], ddof=1)

7.0617066788207215

**Question:** What do you notice about the standard deviation for the `aqi` column? 
This is an important measure of how spread out the aqi values are.


The standard deviation for the aqi column is approximately 7.05 (rounding to 2 decimal places here). This is a measure of how spread out the aqi values are in the data.

## **Considerations**

**Some key takeaways that I learned during this lab?**

Functions in the `pandas` and `numpy` libraries can be used to find statistics that describe a dataset. The `describe()` function from `pandas` generates a table of descriptive statistics about numerical or categorical columns. The `mean()`, `median()`, `min()`, `max()`, and `std()` functions from `numpy` are useful for finding individual statistics about numerical data.

**How would I present my findings to others? Considering the following relevant points noted by AirNow.gov as I respond:**

- "AQI values at or below 100 are generally thought of as satisfactory. When AQI values are above 100, air quality is considered to be unhealthy—at first for certain sensitive groups of people, then for everyone as AQI values increase."
- "An AQI of 100 for carbon monoxide corresponds to a level of 9.4 parts per million."

The average AQI value in the data is approximately 6.76, which is considered safe with respect to carbon monoxide. Further, 75% of the AQI values are below 9. 

**What summary Ild you provide to stakehospond.**

- 75% of the AQI values in the data are below 9, which is considered good air quality. 
- Funding should be allocated for further investigation of the less healthy regions in order to learn how to improve the conditions.
