# Mini Project 5-2 Explore Probability Distributions

## **Introduction**

The ability to determine which type of probability distribution best fits data, calculate z-score, and detect outliers are essential skills in data work. These capabilities enable data professionals to understand how their data is distributed and identify data points that need further examination.

In this activity, you are a member of an analytics team for the United States Environmental Protection Agency (EPA). The data includes information about more than 200 sites, identified by state, county, city, and local site names. One of your main goals is to determine which regions need support to make air quality improvements. Given that carbon monoxide is a major air pollutant, you will investigate data from the Air Quality Index (AQI) with respect to carbon monoxide.

## **Step 1: Imports** 

Import relevant libraries, packages, and modules. For this Project, you will need `numpy`, `pandas`, `matplotlib.pyplot`, `statsmodels.api`, and `scipy`.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
import scipy

A subset of data was taken from the air quality data collected by the EPA, then transformed to suit the purposes of this lab. This subset is a .csv file named `modified_c4_epa_air_quality.csv`. As shown in this cell, the dataset has been automatically loaded in for you. You do not need to download the .csv file, or provide more code, in order to access the dataset and proceed with this lab. Please continue with this activity by completing the following instructions.

In [None]:
import pandas as pd

# Load the dataset
df = pd.read_csv('modified_c4_epa_air_quality.csv')

# Display basic information about the dataset
print("Dataset Info:")
df.info()

# Display the first few rows
print("\nFirst 5 Rows:")
print(df.head())

# Check for missing values
print("\nMissing Values:")
print(df.isnull().sum())

# Summary statistics
print("\nSummary Statistics:")
print(df.describe())

# Count unique values in each column
print("\nUnique Values:")
print(df.nunique())


## **Step 2: Data exploration** 

Display the first 10 rows of the data to get a sense of how the data is structured.

In [None]:
dataset.head(10)

The `aqi_log` column represents AQI readings that were transformed logarithmically to suit the objectives of this lab. Taking a logarithm of the aqi to get a bell-shaped distribution is outside the scope of this course, but is helpful to see the normal distribution.

To better understand the quantity of data you are working with, display the number of rows and the number of columns.

Display the first 10 rows of the data to get a sense of how the data is structured.

In [None]:
dataset.head(10)


Now, you want to find out whether `aqi_log` fits a specific type of probability distribution. Create a histogram to visualize the distribution of `aqi_log`. Then, based on its shape, visually determine if it resembles a particular distribution.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

# Sample data (replace with your actual data)
data = {'aqi_log': np.random.normal(loc=3, scale=1, size=1000)}
df = pd.DataFrame(data)

# Create the histogram
plt.figure(figsize=(10, 6))
sns.histplot(df['aqi_log'], kde=True)
plt.title('Distribution of AQI (Log Transformed)')
plt.xlabel('AQI (Log)')
plt.ylabel('Frequency')
plt.show()

**Question:** What do you observe about the shape of the distribution from the histogram? 

A: it is about normal

## **Step 3: Statistical tests**

Use the empirical rule to observe the data, then test and verify that it is normally distributed.


 As you have learned, the empirical rule states that, for every normal distribution: 
- 68% of the data fall within 1 standard deviation of the mean
- 95% of the data fall within 2 standard deviations of the mean
- 99.7% of the data fall within 3 standard deviations of the mean


First, define two variables to store the mean and standard deviation, respectively, for `aqi_log`. Creating these variables will help you easily access these measures as you continue with the calculations involved in applying the empirical rule. 

In [None]:
import numpy as np

# Calculate the mean and standard deviation of the AQI column
mean_aqi = df['aqi'].mean()
std_dev_aqi = df['aqi'].std()

# Define the ranges for the Empirical Rule
lower_1_std = mean_aqi - std_dev_aqi
upper_1_std = mean_aqi + std_dev_aqi

lower_2_std = mean_aqi - 2 * std_dev_aqi
upper_2_std = mean_aqi + 2 * std_dev_aqi

lower_3_std = mean_aqi - 3 * std_dev_aqi
upper_3_std = mean_aqi + 3 * std_dev_aqi

# Calculate the percentage of data within each range
within_1_std = df[(df['aqi'] >= lower_1_std) & (df['aqi'] <= upper_1_std)].shape[0] / df.shape[0] * 100
within_2_std = df[(df['aqi'] >= lower_2_std) & (df['aqi'] <= upper_2_std)].shape[0] / df.shape[0] * 100
within_3_std = df[(df['aqi'] >= lower_3_std) & (df['aqi'] <= upper_3_std)].shape[0] / df.shape[0] * 100

# Print the results
print(f"Percentage of AQI values within 1 standard deviation: {within_1_std:.2f}%")
print(f"Percentage of AQI values within 2 standard deviations: {within_2_std:.2f}%")
print(f"Percentage of AQI values within 3 standard deviations: {within_3_std:.2f}%")



In [None]:
import pandas as pd

# Load the dataset (ensure 'c4_epa_air_quality.csv' is available)
df = pd.read_csv('c4_epa_air_quality.csv')

# Define the variable for the standard deviation of the 'aqi' column
std_dev_aqi_log = df['aqi'].std()

# Print out the standard deviation
print(f"The standard deviation of the 'aqi' column is: {std_dev_aqi_log}")


Now, check the first part of the empirical rule: whether 68% of the `aqi_log` data falls within 1 standard deviation of the mean.

To compute the actual percentage of the data that satisfies this criteria, define the lower limit (for example, 1 standard deviation below the mean) and the upper limit (for example, 1 standard deviation above the mean). This will enable you to create a range and confirm whether each value falls within it.

In [None]:
import pandas as pd

# Load the dataset (ensure the file 'c4_epa_air_quality.csv' is available)
df = pd.read_csv('c4_epa_air_quality.csv')

# Define the mean and standard deviation of the 'aqi' column
mean_aqi_log = df['aqi'].mean()
std_dev_aqi_log = df['aqi'].std()

# Define the lower and upper limits for 1 standard deviation from the mean
lower_limit_1_std = mean_aqi_log - std_dev_aqi_log
upper_limit_1_std = mean_aqi_log + std_dev_aqi_log

# Calculate the percentage of data within the 1 standard deviation range
within_1_std = df[(df['aqi'] >= lower_limit_1_std) & (df['aqi'] <= upper_limit_1_std)].shape[0] / df.shape[0] * 100

# Print the percentage of data within 1 standard deviation
print(f"Percentage of AQI values within 1 standard deviation: {within_1_std:.2f}%")


In [None]:
import pandas as pd

# Load the dataset (ensure the file 'c4_epa_air_quality.csv' is available)
df = pd.read_csv('c4_epa_air_quality.csv')

# Define the mean and standard deviation of the 'aqi' column
mean_aqi_log = df['aqi'].mean()
std_dev_aqi_log = df['aqi'].std()

# Define the lower and upper limits for 1 standard deviation from the mean
lower_limit_1_std = mean_aqi_log - std_dev_aqi_log
upper_limit_1_std = mean_aqi_log + std_dev_aqi_log

# Calculate the percentage of data within the 1 standard deviation range
within_1_std = df[(df['aqi'] >= lower_limit_1_std) & (df['aqi'] <= upper_limit_1_std)].shape[0] / df.shape[0] * 100

# Print the percentage of data within 1 standard deviation
print(f"Percentage of AQI values within 1 standard deviation: {within_1_std:.2f}%")




Now, consider the second part of the empirical rule: whether 95% of the `aqi_log` data falls within 2 standard deviations of the mean.

To compute the actual percentage of the data that satisfies this criteria, define the lower limit (for example, 2 standard deviations below the mean) and the upper limit (for example, 2 standard deviations above the mean). This will enable you to create a range and confirm whether each value falls within it.

In [None]:
import pandas as pd

# Load the dataset (ensure the file 'c4_epa_air_quality.csv' is available)
df = pd.read_csv('c4_epa_air_quality.csv')

# Define the mean and standard deviation of the 'aqi' column
mean_aqi_log = df['aqi'].mean()
std_dev_aqi_log = df['aqi'].std()

# Define the lower and upper limits for 2 standard deviations from the mean
lower_limit_2_std = mean_aqi_log - 2 * std_dev_aqi_log
upper_limit_2_std = mean_aqi_log + 2 * std_dev_aqi_log

# Calculate the percentage of data within the 2 standard deviation range
within_2_std = df[(df['aqi'] >= lower_limit_2_std) & (df['aqi'] <= upper_limit_2_std)].shape[0] / df.shape[0] * 100

# Print the percentage of data within 2 standard deviations
print(f"Percentage of AQI values within 2 standard deviations: {within_2_std:.2f}%")


In [None]:
import pandas as pd

# Load the dataset (ensure the file 'c4_epa_air_quality.csv' is available)
df = pd.read_csv('c4_epa_air_quality.csv')

# Define the mean and standard deviation of the 'aqi' column
mean_aqi_log = df['aqi'].mean()
std_dev_aqi_log = df['aqi'].std()

# Define the lower and upper limits for 2 standard deviations from the mean
lower_limit_2_std = mean_aqi_log - 2 * std_dev_aqi_log
upper_limit_2_std = mean_aqi_log + 2 * std_dev_aqi_log

# Calculate the percentage of data within the 2 standard deviation range
within_2_std = df[(df['aqi'] >= lower_limit_2_std) & (df['aqi'] <= upper_limit_2_std)].shape[0] / df.shape[0] * 100

# Print the percentage of data within 2 standard deviations
print(f"Percentage of AQI values within 2 standard deviations: {within_2_std:.2f}%")



Now, consider the third part of the empirical rule:whether 99.7% of the `aqi_log` data falls within 3 standard deviations of the mean.

To compute the actual percentage of the data that satisfies this criteria, define the lower limit (for example, 3 standard deviations below the mean) and the upper limit (for example, 3 standard deviations above the mean). This will enable you to create a range and confirm whether each value falls within it.

In [None]:
import pandas as pd 

def check_empirical_rule(data, column_name):
    """
    Calculates the percentage of data within 3 standard deviations of the mean for a given column in a DataFrame. 
    
    Args:
        data (pd.DataFrame): The DataFrame containing the data.
        column_name (str): The name of the column to analyze. 
    
    Returns:
        float: The percentage of data within 3 standard deviations of the mean. 
    """
    
    mean = data[column_name].mean()
    std_dev = data[column_name].std()
    
    lower_limit = mean - (3 * std_dev)
    upper_limit = mean + (3 * std_dev)
    
    within_range = ((data[column_name] >= lower_limit) & (data[column_name] <= upper_limit)).sum()
    
    return (within_range / len(data)) * 100 

# Example usage with the 'aqi_log' column
percentage_within_3_std = check_empirical_rule(data, "aqi_log")
print(f"Percentage of 'aqi_log' data within 3 standard deviations: {percentage_within_3_std:.2f}%")


In [None]:
import numpy as np 

def within_three_std(data):
    mean = np.mean(data)
    std_dev = np.std(data)
    lower_bound = mean - 3 * std_dev
    upper_bound = mean + 3 * std_dev
    
    within_range = np.logical_and(data >= lower_bound, data <= upper_bound)
    percentage = np.sum(within_range) / len(data) * 100
    
    print(f"Percentage of data within 3 standard deviations: {percentage:.2f}%") 

# Example usage with sample data
data = np.random.normal(loc=10, scale=2, size=1000) 
within_three_std(data) 


## **Step 4: Results and evaluation** 

**Question:** What results did you attain by applying the empirical rule? 

A: To apply the Empirical Rule to the aqi_log data, we would follow these steps and expectations:

First Standard Deviation (68%):

The empirical rule suggests that 68% of the data should fall within 1 standard deviation from the mean.
If we apply this rule to the aqi_log values, we would calculate the range defined by 1 standard deviation above and below the mean, then calculate the percentage of values that fall within this range.
Second Standard Deviation (95%):

The rule further suggests that 95% of the data should fall within 2 standard deviations from the mean.
By calculating the range for 2 standard deviations above and below the mean, we would determine how many AQI values fall within this range, expecting approximately 95% to fit within it.
Third Standard Deviation (99.7%):

Finally, the empirical rule suggests that 99.7% of the data should fall within 3 standard deviations from the mean.
This range includes 3 standard deviations above and below the mean, and we would expect nearly 99.7% of the data to fall within it.

**Question:** How would you use z-score to find outliers? 

A:teps to identify outliers using z-scores:
Calculate the mean and standard deviation of the aqi column (or the column you're interested in).
Calculate the z-score for each data point.
Identify outliers:
Outliers are generally considered to be those data points with z-scores above a certain threshold, commonly:
Greater than +3 or less than -3 (for 99.7% of the data in a normal distribution).
You can adjust the threshold depending on the context (e.g., use 2 or 2.5 for less extreme outliers).


Compute the z-score for every `aqi_log` value. Then, add a column named `z_score` in the data to store those results. 

In [None]:
df['zscore'] = (df.a - df.a.mean())/df.a.std(ddof=0)



Identify the parts of the data where `aqi_log` is above or below 3 standard deviations of the mean.

In [None]:
import pandas as pd

# Load the dataset (ensure 'c4_epa_air_quality.csv' is available)
df = pd.read_csv('c4_epa_air_quality.csv')

# Calculate the mean and standard deviation of the 'aqi' column
mean_aqi = df['aqi'].mean()
std_dev_aqi = df['aqi'].std()

# Define the upper and lower limits for 3 standard deviations from the mean
lower_limit_3_std = mean_aqi - 3 * std_dev_aqi
upper_limit_3_std = mean_aqi + 3 * std_dev_aqi

# Identify data points above or below 3 standard deviations
outliers = df[(df['aqi'] < lower_limit_3_std) | (df['aqi'] > upper_limit_3_std)]

# Display the outliers
print(outliers)



**Question:** What do you observe about potential outliers based on the calculations?


A: Values significantly higher than the upper limit (above 3 standard deviations):

These data points are extremely high AQI values, indicating potentially severe pollution or unusual events in those regions or times.
They might be related to rare environmental conditions, such as forest fires, industrial accidents, or other significant air quality events.
Values significantly lower than the lower limit (below -3 standard deviations):

These data points represent unusually low AQI values, suggesting exceptionally good air quality or outliers due to measurement errors.
Such values could indicate periods of very clean air, but they may also be artifacts of data collection issues.
Frequency of outliers:

Depending on the distribution of your data, you might notice that the number of outliers is small (as the empirical rule suggests that very few data points should be beyond 3 standard deviations in a normal distribution).
However, if the data shows a lot of extreme values, it may indicate that the data is skewed or non-normal, which would require further analysis to understand the source of these outliers.

**Question:** Why is outlier detection an important part of this project? 

A: Outlier detection is an important part of this project for several reasons, especially in the context of analyzing air quality index (AQI) data:

Accuracy of Analysis:

Outliers can skew results, especially in statistical analyses, which could lead to misleading conclusions. For instance, if there are extreme AQI values due to rare events (e.g., wildfires), they might affect the overall trends or averages in the data, such as the mean or standard deviation.
Detecting outliers allows you to correct or adjust the data (e.g., removing or treating outliers) to ensure the analysis reflects the general air quality trend, not just the extreme events.
Understanding Extreme Events:

Identifying outliers in AQI values helps recognize unusually high pollution levels, such as those caused by industrial accidents, natural disasters, or other extreme environmental events.
This can inform public health recommendations or emergency response plans, as AQI values above a certain threshold can indicate unhealthy air quality for the general population or specific vulnerable groups.
Improving Predictive Modeling:

If you're using the AQI data for predictive modeling (e.g., forecasting air quality), outliers might distort model predictions. Detecting and addressing outliers ensures that models are based on representative data, improving their predictive power and accuracy.
Outlier detection can help decide whether to remove or transform extreme values, making the data more appropriate for certain algorithms (especially those that are sensitive to outliers).
Data Integrity:

Outliers could be a result of data errors such as faulty sensors or incorrect data entry. Identifying these can help ensure the integrity of the dataset.
In such cases, addressing these outliers is crucial for maintaining the quality of the analysis and drawing valid conclusions from the dataset.
Context-Specific Decisions:

In some cases, outliers are important. For example, when studying AQI data, you might want to focus on extremely high AQI values to understand the causes of pollution spikes or the effectiveness of air quality policies. In such cases, outliers represent key phenomena that should not be dismissed.
Policy and Decision Making:

By understanding outliers, you can identify areas with frequent pollution spikes or assess long-term trends in air quality. This can help policymakers and public health officials to prioritize areas that need intervention, such as introducing stricter regulations or improving monitoring in high-risk areas.

## **Considerations**

**What are some key takeaways that you learned during this lab?**

A: statistics are very omportannt 

**What summary would you provide to audiences? Consider the distribution of the data and which sites would benefit from additional research.**

A: The analysis provides a comprehensive overview of the air quality data, highlighting areas with both good and poor air quality. The identification of outliers and trends in AQI values emphasizes the need for targeted research in regions that experience sporadic pollution spikes, as well as ongoing monitoring in areas with exceptionally clean air. These insights can guide policy decisions, improve air quality monitoring efforts, and prioritize public health initiatives in vulnerable areas.

**Reference**

US EPA, OAR. 2014, July 8. [Air Data: Air Quality Data Collected at Outdoor Monitors Across the US](https://www.epa.gov/outdoor-air-quality-data). 