# Mini Project 5-1 Explore Descriptive Statistics with Python

## **Introduction**

Data professionals often use descriptive statistics to understand the data they are working with and provide collaborators with a summary of the relative location of values in the data, as well an information about its spread. 

For this activity, you are a member of an analytics team for the United States Environmental Protection Agency (EPA). You are assigned to analyze data on air quality with respect to carbon monoxide, a major air pollutant. The data includes information from more than 200 sites, identified by state, county, city, and local site names. You will use Python functions to gather statistics about air quality, then share insights with stakeholders.

## **Step 1: Imports** 


Import the relevant Python libraries `pandas` and `numpy`.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

The dataset provided is in the form of a .csv file named `c4_epa_air_quality.csv`. It contains a susbet of data from the U.S. EPA. As shown in this cell, the dataset has been automatically loaded in for you. You do not need to download the .csv file, or provide more code, in order to access the dataset and proceed with this Project. Please continue with this activity by completing the following instructions.

In [None]:

import pandas as pd

education_districtwise = pd.read_csv('education_districtwise.csv')


# Load dataset (assuming it's already provided in the environment)
df = pd.read_csv("c4_epa_air_quality.csv")

# Display basic info
print(df.info())

# Display first few rows
print(df.head())

print(df.isnull().sum())
print(df.describe())

import matplotlib.pyplot as plt
import seaborn as sns

# Histogram of a numerical column (replace 'column_name' with an actual column)
sns.histplot(df['column_name'], bins=30, kde=True)
plt.title("Distribution of column_name")
plt.show()

import statsmodels.api as sm

# Define dependent and independent variables (replace with actual column names)
X = df[['independent_var1', 'independent_var2']]  # Independent variables
y = df['dependent_var']  # Dependent variable

# Add a constant term for the regression
X = sm.add_constant(X)

# Fit the regression model
model = sm.OLS(y, X).fit()

# Display summary results
print(model.summary())

<details>
  <summary><h4><strong>Hint 1</strong></h4></summary>

  There is a function in the `pandas` library that allows you to read in data from a .csv file and load it into a DataFrame. 

</details>

<details>
  <summary><h4><strong>Hint 2</strong></h4></summary>

  Use the `read_csv` function from the pandas `library`. The `index_col` parameter can be set to `0` to read in the first column as an index (and to avoid `"Unnamed: 0"` appearing as a column in the resulting DataFrame).

</details>

## **Step 2: Data exploration** 

To understand how the dataset is structured, display the first 10 rows of the data.

In [None]:
# Display first 10 rows of the data.
education_districtwise.head(10)

<details>
  <summary><h4><strong>Hint 1</strong></h4></summary>

  There is a function in the `pandas` library that allows you to get a specific number of rows from the top of a DataFrame. 

</details>

<details>
  <summary><h4><strong>Hint 2</strong></h4></summary>

  Use the `head()` function from the `pandas` library.

</details>

**Question:** What does the `aqi` column represent?

A: The aqi column in the dataset likely represents the Air Quality Index (AQI), a standardized measure used by the U.S. Environmental Protection Agency (EPA) to indicate air pollution levels and their potential health effects.

Understanding AQI:
AQI values range from 0 to 500, with higher values indicating worse air quality.
It is calculated based on pollutants such as PM2.5, PM10, CO, SO2, NO2, and O3.


**Question:** In what units are the aqi values expressed?

A: In Python, the AQI values are unitless since they represent an index rather than a direct concentration measurement

Now, get a table that contains some descriptive statistics about the data.

In [None]:
# Generate descriptive statistics for numerical columns
stats_table = df.describe()

# Display the table
print(stats_table)
print(stats_table.T)


<details>
  <summary><h4><strong>Hint 1</strong></h4></summary>

  There is a function in the `pandas` library that allows you to generate a table of basic descriptive statistics about the numeric columns in a DataFrame.

</details>

<details>
  <summary><h4><strong>Hint 2</strong></h4></summary>

  Use the `describe()` function from the `pandas` library.

</details>

**Question:** Based on the table of descriptive statistics, what do you notice about the count value for the `aqi` column?

A: The count value represents the number of non-null (non-missing) observations in the aqi column.
If the count is lower than the total number of rows in the dataset, it means there are missing (NaN) values in this column. If count matches the dataset size → No missing values.
If count is lower than total rows → There are missing AQI values that may need handling (e.g., imputation or removal).


**Question:** What do you notice about the 25th percentile for the `aqi` column?

This is an important measure for understanding where the aqi values lie. 

A: Interpretation:
The 25th percentile (Q1) is the value below which 25% of AQI observations fall.
This helps us understand the lower quartile of air quality conditions in the dataset.
A low Q1 value suggests that a significant portion of the data represents relatively good air quality.
A high Q1 value indicates that even the lower range of AQI values may be concerning.

**Question:** What do you notice about the 75th percentile for the `aqi` column?

This is another important measure for understanding where the aqi values lie. 

A:The 75th percentile (Q3) is the value below which 75% of AQI observations fall.
This represents the upper quartile, meaning the top 25% of AQI values are above this threshold.
A high Q3 value suggests that a significant portion of the dataset includes poor air quality readings.
Comparing Q1 (25th percentile) and Q3 (75th percentile) helps assess the spread of AQI values and possible outliers.

## **Step 3: Statistical tests** 

Next, get some descriptive statistics about the states in the data.

In [None]:
# Count occurrences of each state
state_counts = df['state'].value_counts()
print(state_counts)
# Group by state and calculate descriptive statistics for AQI
state_stats = df.groupby('state')['aqi'].describe()
print(state_stats)
# Find state with the highest average AQI
highest_aqi_state = df.groupby('state')['aqi'].mean().idxmax()
highest_aqi_value = df.groupby('state')['aqi'].mean().max()

# Find state with the lowest average AQI
lowest_aqi_state = df.groupby('state')['aqi'].mean().idxmin()
lowest_aqi_value = df.groupby('state')['aqi'].mean().min()

print(f"State with highest average AQI: {highest_aqi_state} ({highest_aqi_value})")
print(f"State with lowest average AQI: {lowest_aqi_state} ({lowest_aqi_value})")
import matplotlib.pyplot as plt
import seaborn as sns

# Plot AQI distribution by state (top 10 states with most data points)
top_states = df['state'].value_counts().index[:10]
plt.figure(figsize=(12,6))
sns.boxplot(x='state', y='aqi', data=df[df['state'].isin(top_states)])
plt.xticks(rotation=45)
plt.title("AQI Distribution in Top 10 States")
plt.show()




<details>
  <summary><h4><strong>Hint 1</strong></h4></summary>

  There is a function in the `pandas` library that allows you to generate basic descriptive statistics about a DataFrame or a column you are interested in.

</details>

<details>
  <summary><h4><strong>Hint 2</strong></h4></summary>

 Use the `describe()` function from the `pandas` library. Note that this function can be used:
- "on a DataFrame (to find descriptive statistics about the numeric columns)" 
- "directly on a column containing categorical data (to find pertinent descriptive statistics)"

</details>

**Question:** What do you notice while reviewing the descriptive statistics about the states in the data? 

Note: Sometimes you have to individually calculate statistics. To review to that approach, use the `numpy` library to calculate each of the main statistics in the preceding table for the `aqi` column.

A: Observations While Reviewing the Statistics
Variability Between States:

Some states have significantly higher AQI averages, indicating poor air quality.
Other states may consistently show lower AQI values, representing cleaner air.
Outliers in AQI:

If a state's Max AQI is much higher than its 75th percentile (Q3), it suggests occasional extreme pollution events.
A high Standard Deviation means AQI fluctuates widely in that state.
Comparing Median & Mean AQI:

If Mean AQI > Median AQI, the distribution is right-skewed (some extreme pollution values pulling the average up).
If they are close, AQI distribution in that state is more balanced.
High and Low AQI States:

The state with the highest mean AQI likely experiences consistent pollution issues.
The state with the lowest mean AQI has better overall air quality.

## **Step 4. Results and evaluation**

Now, compute the mean value from the `aqi` column.

In [None]:
df2 = df.mean(axis=0)
print("Get column-wise mean:\n", df2)

<details>
  <summary><h4><strong>Hint 1</strong></h4></summary>

  There is a function in the `numpy` library that allows you to get the mean value from an array or a Series of values.

</details>

<details>
  <summary><h4><strong>Hint 2</strong></h4></summary>

  Use the `mean()` function from the `numpy` library.

</details>

**Question:** What do you notice about the mean value from the `aqi` column?

This is an important measure, as it tells you what the average air quality is based on the data.

A: The mean value of the aqi column represents the average air quality across all the observations in the dataset. Here's how you might interpret it:

If the mean is relatively low, it suggests that, on average, the air quality in the dataset is good. This could mean the majority of AQI values fall within the "Good" or "Moderate" ranges.

If the mean is higher, it indicates poorer air quality overall, as the average value may be skewed by higher AQI readings. In this case, the data might contain more instances of Unhealthy or Very Unhealthy air quality.

Skewed Distribution:

If the mean is higher than the median, it indicates that the data distribution is right-skewed, meaning there are some extreme high AQI values pulling the average up.
If the mean is close to the median, the distribution of AQI values is more balanced.

Next, compute the median value from the aqi column.

In [None]:
import pandas as pd

def calculate_median_aqi(df, column_name='aqi'):
  """
  Calculates the median value of the specified column in a Pandas DataFrame.

  Args:
    df: Pandas DataFrame containing the AQI data.
    column_name: Name of the column containing AQI values (default is 'aqi').

  Returns:
    The median AQI value, or None if the column is not found or empty.
  """
  if column_name not in df.columns:
    print(f"Error: Column '{column_name}' not found in DataFrame.")
    return None
  
  aqi_column = df[column_name]
  if aqi_column.empty:
      print(f"Error: Column '{column_name}' is empty.")
      return None

  median_aqi = aqi_column.median()
  return median_aqi

# Example usage:
data = {'location': ['A', 'B', 'C', 'D', 'E'],
        'aqi': [50, 80, 120, 35, 65]}
df = pd.DataFrame(data)

median_value = calculate_median_aqi(df)

if median_value is not None:
    print(f"The median AQI value is: {median_value}")

<details>
  <summary><h4><strong>Hint 1</strong></h4></summary>

  There is a function in the `numpy` library that allows you to get the median value from an array or a series of values.

</details>

<details>
  <summary><h4><strong>Hint 2</strong></h4></summary>

  Use the `median()` function from the `numpy` library.

</details>

**Question:** What do you notice about the median value from the `aqi` column?

This is an important measure for understanding the central location of the data.

A: The median value of the aqi column represents the middle value of the dataset when sorted. It divides the data into two equal halves, so 50% of the AQI values are below the median, and 50% are above it. Here's how you can interpret the median:

If the median is relatively low, it indicates that most of the AQI values fall on the lower end of the scale (e.g., "Good" air quality).

If the median is high, it suggests that a majority of the air quality readings in the dataset are poor, with values falling into categories like "Unhealthy" or "Very Unhealthy."

Comparison with the Mean:

If the median is lower than the mean, this suggests the presence of some higher AQI values (possibly outliers) that are skewing the average up, making the data right-skewed.
If the median and mean are similar, the AQI distribution is likely more balanced, and there are fewer extreme values.

Next, identify the minimum value from the `aqi` column.

In [None]:
import pandas as pd

# Assuming your DataFrame is named 'df'
min_aqi = df['aqi'].min()

print(min_aqi)

<details>
  <summary><h4><strong>Hint 1</strong></h4></summary>

  There is a function in the `numpy` library that allows you to get the minimum value from an array or a Series of values.

</details>

<details>
  <summary><h4><strong>Hint 2</strong></h4></summary>

  Use the `min()` function from the `numpy` library.

</details>

**Question:** What do you notice about the minimum value from the `aqi` column?

This is an important measure, as it tell you the best air quality observed in the data.

The minimum value for the aqi column is 0. This means that the smallest aqi value in the data is 0 parts per million.


Now, identify the maximum value from the `aqi` column.

In [None]:
import pandas as pd

# Sample DataFrame (replace with your actual data)
data = {'aqi': [50, 150, 80, 250, 100]}
df = pd.DataFrame(data)

max_aqi = df['aqi'].max()

print(max_aqi)


<details>
  <summary><h4><strong>Hint 1</strong></h4></summary>

  There is a function in the `numpy` library that allows you to get the maximum value from an array or a Series of values.

</details>

<details>
  <summary><h4><strong>Hint 2</strong></h4></summary>

  Use the `max()` function from the `numpy` library.

</details>

**Question:** What do you notice about the maximum value from the `aqi` column?

This is an important measure, as it tells you which value in the data corresponds to the worst air quality observed in the data.

A: The maximum value from the aqi column represents the worst air quality observed in the dataset. Here's what you might notice about the maximum AQI value:

Extreme Pollution: A high maximum AQI indicates the presence of extreme pollution events that significantly degrade air quality. The higher the value, the more severe the air quality issue, potentially impacting the health of the general population.

Comparison to Health Standards:

If the maximum AQI is above 300, the air quality falls into the Hazardous category, where everyone may experience serious health effects, and emergency measures may be required.
If the maximum value is closer to 100, it falls into the Unhealthy for Sensitive Groups category, indicating that only vulnerable individuals (like children, elderly, or people with respiratory conditions) are affected.
Outliers: A large difference between the maximum and the 25th or 75th percentiles may indicate that there are a few outliers (extremely high AQI values) that significantly affect the maximum but may not be representative of the overall air quality.

Health Implications:

An extremely high maximum AQI suggests that some locations in the dataset experience severe air pollution events, which might require urgent action (e.g., issuing health warnings, restricting outdoor activities).


Now, compute the standard deviation for the `aqi` column.

By default, the `numpy` library uses 0 as the Delta Degrees of Freedom, while `pandas` library uses 1. To get the same value for standard deviation using either library, specify the `ddof` parameter to 1 when calculating standard deviation.

In [None]:
import pandas as pd
import numpy as np

# Sample DataFrame (replace with your actual data)
data = {'aqi': [50, 80, 120, 70, 90, 150]}
df = pd.DataFrame(data)

# Calculate standard deviation using pandas with ddof=1
std_dev_pandas = df['aqi'].std(ddof=1)
print(f"Standard deviation using pandas: {std_dev_pandas}")

# Calculate standard deviation using numpy with ddof=1
std_dev_numpy = np.std(df['aqi'], ddof=1)
print(f"Standard deviation using numpy: {std_dev_numpy}")

<details>
  <summary><h4><strong>Hint 1</strong></h4></summary>

  There is a function in the `numpy` library that allows you to get the standard deviation from an array or a series of values.

</details>

<details>
  <summary><h4><strong>Hint 2</strong></h4></summary>

Use the `std()` function from the `numpy` library. Make sure to specify the `ddof` parameter as 1. To read more about this function,  refer to its documentation in the references section of this lab.

</details>

**Question:** What do you notice about the standard deviation for the `aqi` column? 

This is an important measure of how spread out the aqi values are.

A: The standard deviation of the aqi column measures the spread or variability of the AQI values from the mean. Here's how you might interpret it:

A low standard deviation means that the AQI values are closely clustered around the mean, indicating that most of the air quality readings are similar and there is less variability in the data.

A high standard deviation means that the AQI values are spread out over a wide range, indicating significant variability in the air quality. This suggests that the dataset includes areas with both good and poor air quality, or possibly some extreme values (outliers) affecting the distribution.

Understanding the context:

If the standard deviation is large, this indicates that the air quality fluctuates a lot across different locations or time periods.
A low standard deviation indicates more consistent air quality across the dataset.

## **Considerations**


**What are some key takeaways that you learned during this Project?**

A: there is a lot i still need to learn about this topic

**How would you present your findings from this Project to others? Consider the following relevant points noted by AirNow.gov as you respond:**
- "AQI values at or below 100 are generally thought of as satisfactory. When AQI values are above 100, air quality is considered to be unhealthy—at first for certain sensitive groups of people, then for everyone as AQI values increase."
- "An AQI of 100 for carbon monoxide corresponds to a level of 9 parts per million."

A:1. Overview of the AQI Scale and Its Importance
Begin by introducing the Air Quality Index (AQI) and its significance in measuring air pollution levels. Explain that:

The AQI ranges from 0 to 500, with values above 100 indicating potential health risks.
AQI values at or below 100 are generally satisfactory, while values above 100 signify unhealthy air quality, initially affecting sensitive groups and then the general population as values rise.
2. Key Findings from the Data Analysis
General Air Quality Trend: Based on the analysis of AQI values in the dataset, highlight the mean, median, and percentile values. For instance:

If the mean AQI is above 100, it suggests that a significant portion of the data corresponds to unhealthy air quality levels.
If the median AQI is close to or above 100, it indicates that the central tendency of the data is leaning toward unhealthy conditions.
State-Level Insights: Present findings on AQI distribution by state. You could emphasize:

States with the highest average AQI (indicating poor air quality).
States with the lowest average AQI, showing relatively better air quality.
Variation within states, using percentiles or standard deviation to highlight states with fluctuating air quality.
3. Health Implications of AQI Values
Explain the health impacts of AQI levels based on AirNow.gov guidelines:

AQI ≤ 100: Air quality is considered satisfactory, with minimal risk for the general population. However, sensitive groups (e.g., children, elderly, or people with respiratory conditions) may experience minor effects at the higher end of this range.
AQI > 100: Air quality is considered unhealthy for sensitive groups (e.g., individuals with asthma, heart disease, or children). At this point, it is important for these groups to limit exposure.
AQI ≥ 150: Air quality is unhealthy for the general population. Everyone may begin to experience health effects, and it’s recommended to limit outdoor activities.
4. Explanation of Carbon Monoxide and AQI
Mention that, according to AirNow.gov, an AQI of 100 for carbon monoxide corresponds to 9 parts per million (ppm). This can be useful for interpreting AQI values in the context of specific pollutants, like carbon monoxide, that might be responsible for poor air quality in certain areas.

5. Visual Aids to Enhance Understanding
Box Plots and Histograms: These can illustrate the distribution of AQI values across different states and highlight trends (e.g., more extreme pollution in certain areas).
Bar Graphs: To compare the mean AQI values between different states or regions.
Health Impact Zones: A simple chart or diagram showing AQI ranges and their associated health impacts (e.g., color-coded zones for Good, Moderate, Unhealthy, etc.).
6. Recommendations Based on Findings
Finally, provide actionable recommendations based on the findings:

Encourage efforts in states or regions with high AQI to improve air quality through stricter regulations on pollution.
For areas with consistently unhealthy air quality, recommend public health measures such as air quality alerts and encouraging sensitive groups to limit outdoor exposure.
Promote data-driven policies that focus on reducing air pollution from specific sources like industrial emissions or traffic.

**What summary would you provide to readers? Use the same information provided previously from AirNow.gov as you respond.**

A:

**References**

[Air Quality Index - A Guide to Air Quality and Your Health](https://www.airnow.gov/sites/default/files/2018-04/aqi_brochure_02_14_0.pdf). (2014,February)

[Numpy.Std — NumPy v1.23 Manual](https://numpy.org/doc/stable/reference/generated/numpy.std.html)

US EPA, OAR. (2014, 8 July).[*Air Data: Air Quality Data Collected at Outdoor Monitors Across the US*](https://www.epa.gov/outdoor-air-quality-data). 