# IS4487 Week 6 - Data Cleaning

This notebook is designed to help you follow along with the **Week 6 Lecture and Reading**, introducing you to Python.

The practice code demos are intended to give you a chance to see working code and can be a source for your lap and assignment work.  Each section contains short explanations and annotated code that reflect the steps in the reading.

### Topics for this demo:
- View a real-world data file (from epa.gov)
- Look for incomplete, outlier, or unusable variables

<a href="https://colab.research.google.com/github/Stan-Pugsley/is_4487_base/blob/main/Demos/demo_06_data_cleaning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


### Context: Air Quality Data
This example uses real-world data from the US Environmental Protection Agency (EPA)
- **Geography** - State, County
- **Day Counts** - Good Days, Moderate Days, Unhealthy Days, etc.
- **Statistics** - Max AQI, 90th % AQI, Median AQI
- **Defining Pollutant** - Days CO, Days NO2, Days Ozone, Days PM2.5, Days PM10.   (This is the polutant that is highest or dominant on a particular day)



| **AQI Range** | **Category**                   | **Color** | **Health Implications**                                                                 |
| ------------- | ------------------------------ | --------- | --------------------------------------------------------------------------------------- |
| 0 – 50        | Good                           | Green     | Air quality is considered satisfactory, and air pollution poses little or no risk.      |
| 51 – 100      | Moderate                       | Yellow    | Acceptable; some pollutants may be a concern for very sensitive individuals.            |
| 101 – 150     | Unhealthy for Sensitive Groups | Orange    | Sensitive groups (children, elderly, etc.) may experience health effects.               |
| 151 – 200     | Unhealthy                      | Red       | Everyone may begin to experience health effects; sensitive groups may feel more severe. |
| 201 – 300     | Very Unhealthy                 | Purple    | Health alert: everyone may experience more serious health effects.                      |
| 301 – 500     | Hazardous                      | Maroon    | Health warnings of emergency conditions. The entire population is more likely affected. |

Your task is to look for data quality issues and identify ways to clean them 

### Import the data

Import the data into a dataframe

In [None]:
import pandas as pd 
df = pd.read_csv('https://github.com/Stan-Pugsley/is_4487_base/blob/main/DataSets/air_quality_by_county.csv?raw=true')
print (df)

Get descriptive statistics for the dataset

In [None]:
df.describe()

Check if all 50 states are included

In [None]:
df['State'].value_counts()

**Questions:**
- Are there some counties with too few days to be considered a valid sample?
- Are there any outliers?  How should we handle them?
- Are there null/NA values?

In [None]:
# remove records with a missing county
df = df[df['County'].notna()] 

In [None]:
# remove records with less than a specified number of days
df_filtered = df[df['Days with AQI'] >= 500]