# Activity: Explore descriptive statistics

## **Introduction**

Data professionals often use descriptive statistics to understand the data they are working with and provide collaborators with a summary of the relative location of values in the data, as well an information about its spread. 

For this activity, you are a member of an analytics team for the United States Environmental Protection Agency (EPA). You are assigned to analyze data on air quality with respect to carbon monoxide, a major air pollutant. The data includes information from more than 200 sites, identified by state, county, city, and local site names. You will use Python functions to gather statistics about air quality, then share insights with stakeholders.

## **Step 1: Imports** 


Import the relevant Python libraries `pandas` and `numpy`.

In [12]:
# Import relevant Python libraries.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt


The dataset provided is in the form of a .csv file named `c4_epa_air_quality.csv`. It contains a subset of data from the U.S. EPA. As shown in this cell, the dataset has been automatically loaded in for you. You do not need to download the .csv file, or provide more code, in order to access the dataset and proceed with this lab. Please continue with this activity by completing the following instructions.

In [13]:
# RUN THIS CELL TO IMPORT YOUR DATA.

### YOUR CODE HERE
epa_data = pd.read_csv(r"C:\Users\saswa\Documents\GitHub\Python-For-Data-Analysis\Course-4\Module-1\Data\c4_epa_air_quality.csv", index_col = 0)

## **Step 2: Data exploration** 

To understand how the dataset is structured, display the first 10 rows of the data.

In [14]:
# Display first 10 rows of the data.

epa_data.head(10)

Unnamed: 0,date_local,state_name,county_name,city_name,local_site_name,parameter_name,units_of_measure,arithmetic_mean,aqi
0,2018-01-01,Arizona,Maricopa,Buckeye,BUCKEYE,Carbon monoxide,Parts per million,0.473684,7
1,2018-01-01,Ohio,Belmont,Shadyside,Shadyside,Carbon monoxide,Parts per million,0.263158,5
2,2018-01-01,Wyoming,Teton,Not in a city,Yellowstone National Park - Old Faithful Snow ...,Carbon monoxide,Parts per million,0.111111,2
3,2018-01-01,Pennsylvania,Philadelphia,Philadelphia,North East Waste (NEW),Carbon monoxide,Parts per million,0.3,3
4,2018-01-01,Iowa,Polk,Des Moines,CARPENTER,Carbon monoxide,Parts per million,0.215789,3
5,2018-01-01,Hawaii,Honolulu,Not in a city,Kapolei,Carbon monoxide,Parts per million,0.994737,14
6,2018-01-01,Hawaii,Honolulu,Not in a city,Kapolei,Carbon monoxide,Parts per million,0.2,2
7,2018-01-01,Pennsylvania,Erie,Erie,,Carbon monoxide,Parts per million,0.2,2
8,2018-01-01,Hawaii,Honolulu,Honolulu,Honolulu,Carbon monoxide,Parts per million,0.4,5
9,2018-01-01,Colorado,Larimer,Fort Collins,Fort Collins - CSU - S. Mason,Carbon monoxide,Parts per million,0.3,6


**Question:** What does the `aqi` column represent?

The aqi column represents the Air Quality Index (AQI), which is used to communicate how polluted the air is and what associated health effects may be a concern for the general population. The AQI is based on measurements of pollutants like particulate matter, ozone, carbon monoxide, sulfur dioxide, and nitrogen dioxide.

In [15]:
epa_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 260 entries, 0 to 259
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   date_local        260 non-null    object 
 1   state_name        260 non-null    object 
 2   county_name       260 non-null    object 
 3   city_name         260 non-null    object 
 4   local_site_name   257 non-null    object 
 5   parameter_name    260 non-null    object 
 6   units_of_measure  260 non-null    object 
 7   arithmetic_mean   260 non-null    float64
 8   aqi               260 non-null    int64  
dtypes: float64(1), int64(1), object(7)
memory usage: 20.3+ KB


Now, get a table that contains some descriptive statistics about the data.

In [16]:
# Get descriptive stats.

epa_data.describe()

Unnamed: 0,arithmetic_mean,aqi
count,260.0,260.0
mean,0.403169,6.757692
std,0.317902,7.061707
min,0.0,0.0
25%,0.2,2.0
50%,0.276315,5.0
75%,0.516009,9.0
max,1.921053,50.0


**Question:** Based on the table of descriptive statistics, what do you notice about the count value for the `aqi` column?

The count value for the aqi column is 260, indicating that there are 260 recorded AQI measurements in the dataset. This suggests that the data represents 260 individual air quality observations, likely corresponding to different time points or locations.

**Question:** What do you notice about the 25th percentile for the `aqi` column?

This is an important measure for understanding where the aqi values lie. 

The 25th percentile for the aqi column is 2, meaning that 25% of the AQI values in the dataset are below this threshold. This percentile provides insight into the lower range of air quality, indicating that a significant portion of the data falls within the lower AQI values, which may correspond to better air quality.

**Question:** What do you notice about the 75th percentile for the `aqi` column?

This is another important measure for understanding where the aqi values lie. 

The 75th percentile for the aqi column is 9, meaning that 75% of the AQI values in the dataset fall below this value. This indicates that the majority of air quality measurements are concentrated within the lower range, suggesting relatively lower pollution levels for a large portion of the data.

## **Step 3: Statistical tests** 

Next, get some descriptive statistics about the states in the data.

In [17]:
# Get descriptive stats about the states in the data.

epa_data['state_name'].describe()

count            260
unique            52
top       California
freq              66
Name: state_name, dtype: object

**Question:** What do you notice while reviewing the descriptive statistics about the states in the data? 

Note: Sometimes you have to individually calculate statistics. To review to that approach, use the `numpy` library to calculate each of the main statistics in the preceding table for the `aqi` column.

There are 260 entries in the dataset for the state column, with 52 unique states. Among these, California appears most frequently, with 66 occurrences, indicating that a significant portion of the AQI data is collected from this state. This may suggest that the dataset includes a higher number of measurements from California compared to other states.

## **Step 4. Results and evaluation**

Now, compute the mean value from the `aqi` column.

In [18]:
# Compute the mean value from the aqi column.

np.mean(epa_data["aqi"])

np.float64(6.757692307692308)

**Question:** What do you notice about the mean value from the `aqi` column?

This is an important measure, as it tells you what the average air quality is based on the data.

The mean value for the aqi column is approximately 6.76, indicating that the average air quality index in the dataset is around this value. This provides a general idea of the overall air quality, with values lower than the mean suggesting better air quality and values higher indicating poorer air quality on average.

Next, compute the median value from the aqi column.

In [19]:
# Compute the median value from the aqi column.

np.median(epa_data["aqi"])

np.float64(5.0)

**Question:** What do you notice about the median value from the `aqi` column?

This is an important measure for understanding the central location of the data.

The mean value for the aqi column is approximately 5. This indicates that the average air quality index in the dataset is around this value. Values lower than the mean suggest better air quality, while values higher than the mean indicate poorer air quality on average.

Next, identify the minimum value from the `aqi` column.

In [20]:
# Identify the minimum value from the aqi column.

np.min(epa_data["aqi"])

np.int64(0)

**Question:** What do you notice about the minimum value from the `aqi` column?

This is an important measure, as it tell you the best air quality observed in the data.

The minimum value for the aqi column is 0. This indicates that the best air quality observed in the dataset corresponds to an Air Quality Index of 0, representing the lowest possible measurement.

Now, identify the maximum value from the `aqi` column.

In [21]:
# Identify the maximum value from the aqi column.

np.max(epa_data["aqi"])

np.int64(50)

**Question:** What do you notice about the maximum value from the `aqi` column?

This is an important measure, as it tells you which value in the data corresponds to the worst air quality observed in the data.

The maximum value for the aqi column is 50. This indicates that the worst air quality observed in the dataset corresponds to an Air Quality Index of 50, representing the highest recorded measurement in the data.

Now, compute the standard deviation for the `aqi` column.

By default, the `numpy` library uses 0 as the Delta Degrees of Freedom, while `pandas` library uses 1. To get the same value for standard deviation using either library, specify the `ddof` parameter to 1 when calculating standard deviation.

In [22]:
# Compute the standard deviation for the aqi column.

np.std(epa_data["aqi"], ddof=1)

np.float64(7.061706678820724)

**Question:** What do you notice about the standard deviation for the `aqi` column? 

This is an important measure of how spread out the aqi values are.

The standard deviation for the aqi column is approximately 7.06 (rounded to 2 decimal places). This indicates the extent to which the aqi values deviate from the mean, reflecting the variability or spread of air quality index values in the dataset.

## **Considerations**


**What are some key takeaways that you learned during this lab?**

During this lab, I learned that the pandas and numpy libraries provide powerful functions to compute descriptive statistics for datasets. The describe() function in pandas gives a comprehensive summary of numerical or categorical columns, while functions like mean(), median(), min(), max(), and std() in numpy allow for precise calculation of individual statistics.

From the data, I observed that:

- The arithmetic_mean and aqi columns are slightly right-skewed.
- The range of arithmetic_mean is 1.9, while the range of aqi is 50.
- The median values are 0.276 (arithmetic_mean) and 5 (aqi).
- The standard deviations are 0.3 for arithmetic_mean and 7.06 for aqi.
- The mean values are 0.4 (arithmetic_mean) and 6.57 (aqi).
- 
These insights suggest that the arithmetic_mean has less variability (as its mean is greater than its standard deviation), while aqi shows moderate variability in the dataset.

**How would you present your findings from this lab to others? Consider the following relevant points noted by AirNow.gov as you respond:**
- "AQI values at or below 100 are generally thought of as satisfactory. When AQI values are above 100, air quality is considered to be unhealthy—at first for certain sensitive groups of people, then for everyone as AQI values increase."
- "An AQI of 100 for carbon monoxide corresponds to a level of 9.4 parts per million."

I would present my findings as follows:

The average AQI value in this dataset is approximately 6.76, which falls well within the "safe" range according to AirNow.gov. Additionally, 75% of the AQI values are below 9, indicating that the majority of the data represents satisfactory air quality.

It is important to note that the AQI values in this dataset range only up to 50, which is significantly below the threshold of 100. This implies that even for sensitive groups, the air quality represented in this dataset is considered safe. Furthermore, for carbon monoxide, an AQI of 50 corresponds to a concentration level of 4.7 parts per million (ppm), which is well below the harmful level of 9.4 ppm associated with an AQI of 100.

These findings suggest that the air quality in the sampled data is generally satisfactory and poses minimal risk.

**What summary would you provide to stakeholders? Use the same information provided previously from AirNow.gov as you respond.**

To provide stakeholders with a clear summary, I would focus on key descriptive statistics and their implications:

- AQI Range and Distribution: The AQI values in this dataset range from 0 to 50, with 75% of the values below 9. This indicates that the air quality in the data is predominantly within the "good" range, as defined by AirNow.gov.
- Central Tendency: The mean AQI is approximately 6.76, which reflects satisfactory air quality. Additionally, the median AQI is 5, showing that most values are clustered toward the lower end of the range.
- Variability: The standard deviation of the AQI is approximately 7.06, indicating moderate variability in air quality measurements across the dataset.
- Interpretation for Stakeholders: All AQI values in the data are well below 100, meaning the air quality is safe even for sensitive groups. For carbon monoxide, an AQI of 50 corresponds to 4.7 ppm, which is significantly below the harmful level of 9.4 ppm associated with an AQI of 100.
  
Actionable Insight: While this dataset indicates safe air quality, funding could be allocated to monitor regions showing higher AQI values (closer to 50). This would ensure proactive efforts in maintaining and improving air quality, particularly for sensitive groups.

**References**

[Air Quality Index - A Guide to Air Quality and Your Health](https://www.airnow.gov/sites/default/files/2018-04/aqi_brochure_02_14_0.pdf). (2014,February)

[Numpy.Std — NumPy v1.23 Manual](https://numpy.org/doc/stable/reference/generated/numpy.std.html)

US EPA, OAR. (2014, 8 July).[*Air Data: Air Quality Data Collected at Outdoor Monitors Across the US*](https://www.epa.gov/outdoor-air-quality-data). 