# Exploring descriptive statistics

## **Introduction**


This is an assignment to analyze data on air quality with respect to carbon monoxide, a major air pollutant. The data includes information from more than 200 sites, identified by state, county, city, and local site names. Target is to gather statistics about air quality, then share insights with stakeholders.

In [1]:
# Import relevant Python libraries.

import pandas as pd
import numpy as np

Loadig the dataset into a DataFrame. The dataset provided is in the form of a .csv file named `c4_epa_air_quality.csv`. It contains a subset of data from the U.S. EPA.

In [3]:
epa_data = pd.read_csv("/content/sample_data/c4_epa_air_quality.csv", index_col = 0)

## **Data exploration**

To understand how the dataset is structured, displaying the first 10 rows of the data.

In [4]:
# Display first 10 rows of the data.

epa_data.head(10)

Unnamed: 0,date_local,state_name,county_name,city_name,local_site_name,parameter_name,units_of_measure,arithmetic_mean,aqi
0,2018-01-01,Arizona,Maricopa,Buckeye,BUCKEYE,Carbon monoxide,Parts per million,0.473684,7
1,2018-01-01,Ohio,Belmont,Shadyside,Shadyside,Carbon monoxide,Parts per million,0.263158,5
2,2018-01-01,Wyoming,Teton,Not in a city,Yellowstone National Park - Old Faithful Snow ...,Carbon monoxide,Parts per million,0.111111,2
3,2018-01-01,Pennsylvania,Philadelphia,Philadelphia,North East Waste (NEW),Carbon monoxide,Parts per million,0.3,3
4,2018-01-01,Iowa,Polk,Des Moines,CARPENTER,Carbon monoxide,Parts per million,0.215789,3
5,2018-01-01,Hawaii,Honolulu,Not in a city,Kapolei,Carbon monoxide,Parts per million,0.994737,14
6,2018-01-01,Hawaii,Honolulu,Not in a city,Kapolei,Carbon monoxide,Parts per million,0.2,2
7,2018-01-01,Pennsylvania,Erie,Erie,,Carbon monoxide,Parts per million,0.2,2
8,2018-01-01,Hawaii,Honolulu,Honolulu,Honolulu,Carbon monoxide,Parts per million,0.4,5
9,2018-01-01,Colorado,Larimer,Fort Collins,Fort Collins - CSU - S. Mason,Carbon monoxide,Parts per million,0.3,6



The `aqi` column represents the EPA's Air Quality Index (AQI).

Constructing a table that contains some descriptive statistics about the data.

In [5]:
# Get descriptive stats

epa_data.describe()

Unnamed: 0,arithmetic_mean,aqi
count,260.0,260.0
mean,0.403169,6.757692
std,0.317902,7.061707
min,0.0,0.0
25%,0.2,2.0
50%,0.276315,5.0
75%,0.516009,9.0
max,1.921053,50.0


The count value for the `aqi` column is 260. This means there are 260 aqi measurements represented in this dataset.

The 25th percentile for the `aqi` column is 2. This means that 25% of the aqi values in the data are below 2.


The 75th percentile for the aqi column is 9. This means that 75% of the aqi values in the data are below 9.

## **Statistical tests** ##

Some descriptive statistics about the states in the data.

In [6]:
# Get descriptive stats about the states in the data.

epa_data["state_name"].describe()

Unnamed: 0,state_name
count,260
unique,52
top,California
freq,66




There are 260 state values, and 52 of them are unique. California is the most commonly occurring state in the data, with a frequency of 66. (In other words, 66 entries in the data correspond to aqi measurements taken in California.)

## **Results and evaluation**


Computing the mean value from the `aqi` column.

In [7]:
# Compute the mean value from the aqi column.
np.mean(epa_data["aqi"])

6.757692307692308


This is an important measure, as it tells about what the average air quality is based on the data.

The mean value for the `aqi` column is approximately 6.76 (rounding to 2 decimal places here). This means that the average aqi from the data is approximately 6.76.

Computig the median value from the aqi column.

In [8]:
# Compute the median value from the aqi column.

np.median(epa_data["aqi"])

5.0

The median value is an important measure for understanding the central location of the data.

The median value for the aqi column is 5.0. This means that half of the aqi values in the data are below 5.

Identification of the minimum value from the `aqi` column.

In [9]:
# Identify the minimum value from the aqi column.
np.min(epa_data["aqi"])

0

the minimum value from the aqi column is an important measure, as it tell about the best air quality observed in the data.

The minimum value for the `aqi` column is 0. This means that the smallest aqi value in the data is 0.

Identification of the maximum value from the `aqi` column.

In [11]:
# Identify the maximum value from the aqi column

np.max(epa_data["aqi"])

50

The maximum value from the `aqi` column is an important measure, as it tells about which value in the data corresponds to the worst air quality observed in the data.

The maximum value for the `aqi` column is 50. This means that the largest aqi value in the data is 50.

Computing the standard deviation for the `aqi` column.

By default, the `numpy` library uses 0 as the Delta Degrees of Freedom, while `pandas` library uses 1. To get the same value for standard deviation using either library, specify the `ddof` parameter to 1 when calculating standard deviation.

In [12]:
# Compute the standard deviation for the aqi column.

np.std(epa_data["aqi"], ddof=1)

7.061706678820724

The standard deviation for the `aqi` column is an important measure of how spread out the aqi values are.

The standard deviation for the aqi column is approximately 7.05 (rounding to 2 decimal places here). This is a measure of how spread out the aqi values are in the data.

## **Discussion**

**Some key takeaways that I learned :**<br>
Functions in the `pandas` and `numpy` libraries can be used to find statistics that describe a dataset. The `describe()` function from `pandas` generates a table of descriptive statistics about numerical or categorical columns. The `mean()`, `median()`, `min()`, `max()`, and `std()` functions from `numpy` are useful for finding individual statistics about numerical data.

**My findings with relavent points consideration**
- "AQI values at or below 100 are generally thought of as satisfactory. When AQI values are above 100, air quality is considered to be unhealthy—at first for certain sensitive groups of people, then for everyone as AQI values increase."
- "An AQI of 100 for carbon monoxide corresponds to a level of 9.4 parts per million."

The average AQI value in the data is approximately 6.76, which is considered safe with respect to carbon monoxide. Further, 75% of the AQI values are below 9.

**Summary of information provided previously**

- 75% of the AQI values in the data are below 9, which is considered good air quality.
- Funding should be allocated for further investigation of the less healthy regions in order to learn how to improve the conditions.


**References**

[Air Quality Index - A Guide to Air Quality and Your Health](https://www.airnow.gov/sites/default/files/2018-04/aqi_brochure_02_14_0.pdf). (2014,February)

[Numpy.Std — NumPy v1.23 Manual](https://numpy.org/doc/stable/reference/generated/numpy.std.html)

US EPA, OAR. (2014, 8 July).[*Air Data: Air Quality Data Collected at Outdoor Monitors Across the US*](https://www.epa.gov/outdoor-air-quality-data).