# Activity: Explore probability distributions

## **Introduction**

The ability to determine which type of probability distribution best fits data, calculate z-score, and detect outliers are essential skills in data work. These capabilities enable data professionals to understand how their data is distributed and identify data points that need further examination.

In this activity, you are a member of an analytics team for the United States Environmental Protection Agency (EPA). The data includes information about more than 200 sites, identified by state, county, city, and local site names. One of your main goals is to determine which regions need support to make air quality improvements. Given that carbon monoxide is a major air pollutant, you will investigate data from the Air Quality Index (AQI) with respect to carbon monoxide.

## **Step 1: Imports** 

Import relevant libraries, packages, and modules. For this lab, you will need `numpy`, `pandas`, `matplotlib.pyplot`, `statsmodels.api`, and `scipy`.

In [1]:
# Import relevant libraries, packages, and modules.

import pandas as pd

A subset of data was taken from the air quality data collected by the EPA, then transformed to suit the purposes of this lab. This subset is a .csv file named `modified_c4_epa_air_quality.csv`. As shown in this cell, the dataset has been automatically loaded in for you. You do not need to download the .csv file, or provide more code, in order to access the dataset and proceed with this lab. Please continue with this activity by completing the following instructions.

In [2]:
# RUN THIS CELL TO IMPORT YOUR DATA.

### YOUR CODE HERE ###
data = pd.read_csv(r"C:\Users\saswa\Documents\GitHub\Python-For-Data-Analysis\Course-4\Module-2\Data\modified_c4_epa_air_quality.csv")


## **Step 2: Data exploration** 

Display the first 10 rows of the data to get a sense of how the data is structured.

In [3]:
# Display first 10 rows of the data.

### YOUR CODE HERE ###



The `aqi_log` column represents AQI readings that were transformed logarithmically to suit the objectives of this lab. Taking a logarithm of the aqi to get a bell-shaped distribution is outside the scope of this course, but is helpful to see the normal distribution.

To better understand the quantity of data you are working with, display the number of rows and the number of columns.

In [4]:
# Display number of rows, number of columns.

### YOUR CODE HERE ###



Now, you want to find out whether `aqi_log` fits a specific type of probability distribution. Create a histogram to visualize the distribution of `aqi_log`. Then, based on its shape, visually determine if it resembles a particular distribution.

In [5]:
# Create a histogram to visualize distribution of aqi_log.

### YOUR CODE HERE ###



**Question:** What do you observe about the shape of the distribution from the histogram? 

[Write your response here. Double-click (or enter) to edit.]

## **Step 3: Statistical tests**

Use the empirical rule to observe the data, then test and verify that it is normally distributed.


 As you have learned, the empirical rule states that, for every normal distribution: 
- 68% of the data fall within 1 standard deviation of the mean
- 95% of the data fall within 2 standard deviations of the mean
- 99.7% of the data fall within 3 standard deviations of the mean


First, define two variables to store the mean and standard deviation, respectively, for `aqi_log`. Creating these variables will help you easily access these measures as you continue with the calculations involved in applying the empirical rule. 

In [6]:
# Define variable for aqi_log mean.

### YOUR CODE HERE ###


# Print out the mean.

### YOUR CODE HERE ###



In [7]:
# Define variable for aqi_log standard deviation.

### YOUR CODE HERE ###



# Print out the standard deviation.

### YOUR CODE HERE ###



Now, check the first part of the empirical rule: whether 68% of the `aqi_log` data falls within 1 standard deviation of the mean.

To compute the actual percentage of the data that satisfies this criteria, define the lower limit (for example, 1 standard deviation below the mean) and the upper limit (for example, 1 standard deviation above the mean). This will enable you to create a range and confirm whether each value falls within it.

In [8]:
# Define variable for lower limit, 1 standard deviation below the mean.

### YOUR CODE HERE ###



# Define variable for upper limit, 1 standard deviation above the mean.

### YOUR CODE HERE ###




# Display lower_limit, upper_limit.

### YOUR CODE HERE ###



In [9]:
# Display the actual percentage of data that falls within 1 standard deviation of the mean.

### YOUR CODE HERE ### 



Now, consider the second part of the empirical rule: whether 95% of the `aqi_log` data falls within 2 standard deviations of the mean.

To compute the actual percentage of the data that satisfies this criteria, define the lower limit (for example, 2 standard deviations below the mean) and the upper limit (for example, 2 standard deviations above the mean). This will enable you to create a range and confirm whether each value falls within it.

In [10]:
# Define variable for lower limit, 2 standard deviations below the mean.

### YOUR CODE HERE ###




# Define variable for upper limit, 2 standard deviations below the mean.

### YOUR CODE HERE ###




# Display lower_limit, upper_limit.

### YOUR CODE HERE ###



In [11]:
# Display the actual percentage of data that falls within 2 standard deviations of the mean.

### YOUR CODE HERE ### 



Now, consider the third part of the empirical rule:whether 99.7% of the `aqi_log` data falls within 3 standard deviations of the mean.

To compute the actual percentage of the data that satisfies this criteria, define the lower limit (for example, 3 standard deviations below the mean) and the upper limit (for example, 3 standard deviations above the mean). This will enable you to create a range and confirm whether each value falls within it.

In [12]:
# Define variable for lower limit, 3 standard deviations below the mean.

### YOUR CODE HERE ###



# Define variable for upper limit, 3 standard deviations above the mean.

### YOUR CODE HERE ###




# Display lower_limit, upper_limit.

### YOUR CODE HERE ###



In [13]:
# Display the actual percentage of data that falls within 3 standard deviations of the mean.

### YOUR CODE HERE ### 



## **Step 4: Results and evaluation** 

**Question:** What results did you attain by applying the empirical rule? 

[Write your response here. Double-click (or enter) to edit.]

**Question:** How would you use z-score to find outliers? 

[Write your response here. Double-click (or enter) to edit.]

Compute the z-score for every `aqi_log` value. Then, add a column named `z_score` in the data to store those results. 

In [14]:
# Compute the z-score for every aqi_log value, and add a column named z_score in the data to store those results.

### YOUR CODE HERE ###




# Display the first 5 rows to ensure that the new column was added.

### YOUR CODE HERE ###



Identify the parts of the data where `aqi_log` is above or below 3 standard deviations of the mean.

In [15]:
# Display data where `aqi_log` is above or below 3 standard deviations of the mean

### YOUR CODE HERE ###



**Question:** What do you observe about potential outliers based on the calculations?


[Write your response here. Double-click (or enter) to edit.]

**Question:** Why is outlier detection an important part of this project? 

[Write your response here. Double-click (or enter) to edit.]

## **Considerations**

**What are some key takeaways that you learned during this lab?**

[Write your response here. Double-click (or enter) to edit.]

**What summary would you provide to stakeholders? Consider the distribution of the data and which sites would benefit from additional research.**

[Write your response here. Double-click (or enter) to edit.]

**Reference**

US EPA, OAR. 2014, July 8. [Air Data: Air Quality Data Collected at Outdoor Monitors Across the US](https://www.epa.gov/outdoor-air-quality-data). 