## Introduction

Data professionals often use descriptive statistics to understand the data they are working with and provide collaborators with a summary of the relative location of values in the data, as well an information about its spread.
Throughout this notebook, we will practice computing descriptive statistics to explore and summarize a dataset. 

## Import packages and libraries

Before getting started, we will need to import all the required libraries and extensions. Throughout the course, we will be using pandas and numpy for operations and matplotlib for plotting.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
# Change the path to the dataset as required
education_districtwise = pd.read_csv('education_districtwise.csv')

## Explore the data

Let's start with the `head() `function to get a quick overview of the dataset. `head()` will return as many rows of data as you input into the variable field.

In [3]:
education_districtwise.head(10)

Unnamed: 0,DISTNAME,STATNAME,BLOCKS,VILLAGES,CLUSTERS,TOTPOPULAT,OVERALL_LI
0,DISTRICT32,STATE1,13,391,104,875564.0,66.92
1,DISTRICT649,STATE1,18,678,144,1015503.0,66.93
2,DISTRICT229,STATE1,8,94,65,1269751.0,71.21
3,DISTRICT259,STATE1,13,523,104,735753.0,57.98
4,DISTRICT486,STATE1,8,359,64,570060.0,65.0
5,DISTRICT323,STATE1,12,523,96,1070144.0,64.32
6,DISTRICT114,STATE1,6,110,49,147104.0,80.48
7,DISTRICT438,STATE1,7,134,54,143388.0,74.49
8,DISTRICT610,STATE1,10,388,80,409576.0,65.97
9,DISTRICT476,STATE1,11,361,86,555357.0,69.9


**Note**: To interpret this data correctly, it’s important to understand that each row, or observation, refers to a different *district* (and not, for example, to a state or a village). So, the `VILLAGES` column indicates how many villages are in each district, the `TOTPOPULAT` column indicates the population for each district, and the `OVERALL_LI` column indicates the literacy rate for each district. 

### Use describe() to compute descriptive stats

Now that we have a better understanding of the dataset, let's use Python to compute descriptive stats. 

When computing descriptive stats in Python, the most useful function to know is `describe()`. Data professionals use the `describe()` function as a convenient way to calculate many key stats all at once. For a numeric column, `describe()` gives you the following output: 

*   `count`: Number of non-NA/null observations
*   `mean`: The arithmetic average
*   `std`: The standard deviation
*   `min`: The smallest (minimum) value
*   `25%`: The first quartile (25th percentile)
*   `50%`: The median (50th percentile) 
*   `75%`: The third quartile (75th percentile)
*   `max`: The largest (maximum) value


**Reference**: [pandas.DataFrame.describe](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html)

Our main interest is the literacy rate. This data is contained in the `OVERALL_LI` column, which shows the literacy rate for each district in the nation. Use the `describe()` function to reveal key stats about literacy rate. 

In [4]:
education_districtwise['OVERALL_LI'].describe()

count    634.000000
mean      73.395189
std       10.098460
min       37.220000
25%       66.437500
50%       73.490000
75%       80.815000
max       98.760000
Name: OVERALL_LI, dtype: float64

The summary of stats gives us valuable information about the overall literacy rate. For example, the mean helps to clarify the center of your dataset; we now know the average literacy rate is about 73% for all districts. This information is useful in itself and also as a basis for comparison. Knowing the mean literacy rate for *all* districts helps us understand which individual districts are significantly above or below the mean. 

**Note**: `describe()` excludes missing values (`NaN`) in the dataset from consideration. You may notice that the count, or the number of observations for `OVERALL_LI` (634), is fewer than the number of rows in the dataset (680). Steps on how to deal with missing data is presented in futher python notebooks.

You can also use the `describe()` function for a column with categorical data, like the `STATNAME` column. 

For a categorical column, `describe()` gives you the following output: 

*   `count`: Number of non-NA/null observations
*  `unique`: Number of unique values
*   `top`: The most common value (the mode)
*   `freq`: The frequency of the most common value


In [5]:
education_districtwise['STATNAME'].describe()

count         680
unique         36
top       STATE21
freq           75
Name: STATNAME, dtype: object

The `unique` category indicates that there are 36 states in our dataset. The `top` category indicates that `STATE21` is the most commonly occurring value, or mode. The `frequency` category tells you that `STATE21` appears in 75 rows, which means it includes 75 different districts. 

This information may be helpful in determining which states will need more educational resources based on their number of districts. 

### Functions for stats

The `describe()` function is also useful because it reveals a variety of key stats all at once. Python also has separate functions for the mean, median, standard deviation, minimum, and maximum. These individual functions are also useful if you want to do further computations based on descriptive stats. For example, you can use the `min()` and `max()` functions together to compute the range of your data.


### Use max() and min() to compute range


The **range** is the difference between the largest and smallest values in a dataset. In other words, range = max - min. You can use `max()` and `min()` to compute the range for the literacy rate of all districts in your dataset. 

In [6]:
range_overall_li = education_districtwise['OVERALL_LI'].max() - education_districtwise['OVERALL_LI'].min()
range_overall_li

61.540000000000006

The range in literacy rates for all districts is about 61.5 percentage points. 

This large difference tells you that some districts have much higher literacy rates than others. Later on, you will continue to analyze this data, and you can discover which districts have the lowest literacy rates. This will help the government better understand literacy rates nationally and build on their successful educational programs. 

## Practicing on another Dataset

We will perform descriptive stats on United States Environmental Protection Agency (EPA) dataset. Need to analyze data on air quality with respect to carbon monoxide, a major air pollutant. The data includes information from more than 200 sites, identified by state, county, city, and local site names. You will use Python functions to gather statistics about air quality, then share insights in conclusion

## **Step 1: Imports** 

In [1]:
# Import relevant Python libraries.

import pandas as pd
import numpy as np

Load the dataset into a DataFrame. The dataset provided is in the form of a .csv file named `c4_epa_air_quality.csv`. It contains a subset of data from the U.S. EPA.

In [2]:
# Load data from the .csv file into a DataFrame and save in a variable.

epa_data = pd.read_csv("c4_epa_air_quality.csv", index_col = 0)

## **Step 2: Data exploration** 

To understand how the dataset is structured, display the first 10 rows of the data.

In [3]:
# Display first 10 rows of the data.

epa_data.head(10)

Unnamed: 0,date_local,state_name,county_name,city_name,local_site_name,parameter_name,units_of_measure,arithmetic_mean,aqi
0,2018-01-01,Arizona,Maricopa,Buckeye,BUCKEYE,Carbon monoxide,Parts per million,0.473684,7
1,2018-01-01,Ohio,Belmont,Shadyside,Shadyside,Carbon monoxide,Parts per million,0.263158,5
2,2018-01-01,Wyoming,Teton,Not in a city,Yellowstone National Park - Old Faithful Snow ...,Carbon monoxide,Parts per million,0.111111,2
3,2018-01-01,Pennsylvania,Philadelphia,Philadelphia,North East Waste (NEW),Carbon monoxide,Parts per million,0.3,3
4,2018-01-01,Iowa,Polk,Des Moines,CARPENTER,Carbon monoxide,Parts per million,0.215789,3
5,2018-01-01,Hawaii,Honolulu,Not in a city,Kapolei,Carbon monoxide,Parts per million,0.994737,14
6,2018-01-01,Hawaii,Honolulu,Not in a city,Kapolei,Carbon monoxide,Parts per million,0.2,2
7,2018-01-01,Pennsylvania,Erie,Erie,,Carbon monoxide,Parts per million,0.2,2
8,2018-01-01,Hawaii,Honolulu,Honolulu,Honolulu,Carbon monoxide,Parts per million,0.4,5
9,2018-01-01,Colorado,Larimer,Fort Collins,Fort Collins - CSU - S. Mason,Carbon monoxide,Parts per million,0.3,6


The aqi column represents the EPA's Air Quality Index (AQI).

Now, get a table that contains some descriptive statistics about the data.

In [5]:
# Get descriptive stats.

epa_data.describe()

Unnamed: 0,arithmetic_mean,aqi
count,260.0,260.0
mean,0.403169,6.757692
std,0.317902,7.061707
min,0.0,0.0
25%,0.2,2.0
50%,0.276315,5.0
75%,0.516009,9.0
max,1.921053,50.0


* Based on the table of descriptive statistics, you can notice that the count value for the aqi column is 260. This means there are 260 aqi measurements represented in this dataset.

* You can notice that the 25th percentile for the aqi column is 2. This means that 25% of the aqi values in the data are below 2.

* You can notice that the 75th percentile for the aqi column 9. This means that 75% of the aqi values in the data are below 9.

## **Step 3: Statistical tests**

Next, get some descriptive statistics about the states in the data.

In [6]:
# Get descriptive stats about the states in the data.

epa_data["state_name"].describe()

count            260
unique            52
top       California
freq              66
Name: state_name, dtype: object

There are 260 state values, and 52 of them are unique. California is the most commonly occurring state in the data, with a frequency of 66. (In other words, 66 entries in the data correspond to aqi measurements taken in California.

**Note**: Sometimes you have to individually calculate statistics. To review to that approach, use the numpy library to calculate each of the main statistics in the preceding table for the aqi column.

## Step 4. Results and evaluation

In [7]:
# Compute the mean value from the aqi column.

np.mean(epa_data["aqi"])

6.757692307692308

You can notice that the mean value from the aqi column is approximately 6.76 (rounding to 2 decimal places here). This means that the average aqi from the data is approximately 6.76.

Next, compute the median value from the aqi column.

In [8]:
# Compute the median value from the aqi column.

np.median(epa_data["aqi"])

5.0

What do you notice about the median value from the aqi column? This is an important measure for understanding the central location of the data.
The median value for the aqi column is 5.0. This means that half of the aqi values in the data are below 5.

Next, identify the minimum value from the aqi column.

In [10]:
# Identify the minimum value from the aqi column.

np.min(epa_data["aqi"])

0

What do you notice about the minimum value from the aqi column? This is an important measure, as it tell you the best air quality observed in the data.
The minimum value for the aqi column is 0. This means that the smallest aqi value in the data is 0.

Now, identify the maximum value from the aqi column.

In [11]:
# Identify the maximum value from the aqi column.

np.max(epa_data["aqi"])

50

The maximum value for the aqi column is 50. This means that the largest aqi value in the data is 50. This is an important measure, as it tells you which value in the data corresponds to the worst air quality observed in the data.

Now, compute the standard deviation for the aqi column.

By default, the numpy library uses 0 as the Delta Degrees of Freedom, while pandas library uses 1. To get the same value for standard deviation using either library, specify the ddof parameter to 1 when calculating standard deviation.

In [12]:
# Compute the standard deviation for the aqi column.

np.std(epa_data["aqi"], ddof=1)

7.0617066788207215

The standard deviation for the aqi column is approximately 7.05 (rounding to 2 decimal places here). This is an important measure of how spread out the aqi values are in the data.

**References**

[Air Quality Index - A Guide to Air Quality and Your Health](https://www.airnow.gov/sites/default/files/2018-04/aqi_brochure_02_14_0.pdf). (2014,February)

[Numpy.Std — NumPy v1.23 Manual](https://numpy.org/doc/stable/reference/generated/numpy.std.html)

US EPA, OAR. (2014, 8 July).[*Air Data: Air Quality Data Collected at Outdoor Monitors Across the US*](https://www.epa.gov/outdoor-air-quality-data). 

## Step 5. Conclusion

* "AQI values at or below 100 are generally thought of as satisfactory. When AQI values are above 100, air quality is considered to be unhealthy—at first for certain sensitive groups of people, then for everyone as AQI values increase."
* "An AQI of 100 for carbon monoxide corresponds to a level of 9.4 parts per million."

The average AQI value in the data is approximately 6.76, which is considered safe with respect to carbon monoxide. Further, 75% of the AQI values are below 9.

**Summary**

75% of the AQI values in the data are below 9, which is considered good air quality.
Funding should be allocated for further investigation of the less healthy regions in order to learn how to improve the conditions.

## Conclusion

**Congratulations!** You've completed this notebook. 

You now understand how to compute descriptive statistics with Python. Going forward, you can start using descriptive statistics to explore and summarize your own datasets.