### Activity: Explore probability distributions

#### **Introduction**

The ability to determine which type of probability distribution best fits data, calculate z-score, and detect outliers are essential skills in data work. These capabilities enable data professionals to understand how their data is distributed and identify data points that need further examination.

In this activity, you are a member of an analytics team for the United States Environmental Protection Agency (EPA). The data includes information about more than 200 sites, identified by state, county, city, and local site names. One of your main goals is to determine which regions need support to make air quality improvements. Given that carbon monoxide is a major air pollutant, you will investigate data from the Air Quality Index (AQI) with respect to carbon monoxide.

#### **Step 1: Imports** 

In [1]:
# Import relevant libraries


In [2]:
# RUN THIS CELL TO IMPORT YOUR DATA.

### YOUR CODE HERE ###
data = pd.read_csv("modified_c4_epa_air_quality.csv")


#### **Step 2: Data exploration** 

Display the first 4 rows of the data to get a sense of how the data is structured.

In [17]:
# Display first 4 rows of the data.

### YOUR CODE HERE



The `aqi_log` column represents AQI readings that were transformed logarithmically to suit the objectives of this lab. Taking a logarithm of the aqi to get a bell-shaped distribution is outside the scope of this course, but is helpful to see the normal distribution.

To better understand the quantity of data you are working with, display the number of rows and the number of columns.

Now, you want to find out whether `aqi_log` fits a specific type of probability distribution. Create a histogram to visualize the distribution of `aqi_log`. Then, based on its shape, visually determine if it resembles a particular distribution.

In [1]:
# Create a histogram to visualize distribution of aqi_log.

**Question:** What do you observe about the shape of the distribution from the histogram? 

#### **Step 3: Statistics Check**

Use the empirical rule to observe the data, then test and verify that it is normally distributed.


 As you have learned, the empirical rule states that, for every normal distribution: 
- 68% of the data fall within 1 standard deviation of the mean
- 95% of the data fall within 2 standard deviations of the mean
- 99.7% of the data fall within 3 standard deviations of the mean


First, define two variables to store the mean and standard deviation, respectively, for `aqi_log`. Creating these variables will help you easily access these measures as you continue with the calculations involved in applying the empirical rule. 

In [None]:
# Define variable for aqi_log mean.

# Print out the mean.



In [None]:
# Define variable for aqi_log standard deviation.

# Print out the standard deviation.

Now, check the first part of the empirical rule: whether 68% of the `aqi_log` data falls within 1 standard deviation of the mean.

To compute the actual percentage of the data that satisfies this criteria, define the lower limit (for example, 1 standard deviation below the mean) and the upper limit (for example, 1 standard deviation above the mean). This will enable you to create a range and confirm whether each value falls within it.

Now, consider the second part of the empirical rule: whether 95% of the `aqi_log` data falls within 2 standard deviations of the mean.

To compute the actual percentage of the data that satisfies this criteria, define the lower limit (for example, 2 standard deviations below the mean) and the upper limit (for example, 2 standard deviations above the mean). This will enable you to create a range and confirm whether each value falls within it.

Now, consider the third part of the empirical rule:whether 99.7% of the `aqi_log` data falls within 3 standard deviations of the mean.

To compute the actual percentage of the data that satisfies this criteria, define the lower limit (for example, 3 standard deviations below the mean) and the upper limit (for example, 3 standard deviations above the mean). This will enable you to create a range and confirm whether each value falls within it.

#### **Step 4: Results and evaluation** 

**Question:** What results did you attain by applying the empirical rule? 

**Question:** How would you use z-score to find outliers? 

Compute the z-score for every `aqi_log` value. Then, add a column named `z_score` in the data to store those results. 

In [None]:
# Compute the z-score for every aqi_log value, and add a column named z_score in the data to store those results.

### YOUR CODE HERE ###
# Display the first 5 rows to ensure that the new column was added.
### YOUR CODE HERE ###

Identify the parts of the data where `aqi_log` is above or below 3 standard deviations of the mean.

In [None]:
# Display data where `aqi_log` is above or below 3 standard deviations of the mean

### YOUR CODE HERE ###

**Question:** What do you observe about potential outliers based on the calculations?


**Question:** Why is outlier detection an important part of this project? 

#### **Considerations**

**What summary would you provide to stakeholders? Consider the distribution of the data and which sites would benefit from additional research.**

**Reference**

US EPA, OAR. 2014, July 8. [Air Data: Air Quality Data Collected at Outdoor Monitors Across the US](https://www.epa.gov/outdoor-air-quality-data). 