<a href="https://colab.research.google.com/github/Gabbie22/is_4487_base/blob/main/Labs/Scripts/lab_11_air_quality_regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# IS 4487 Lab 11

## Learning Objective

Use Linear Regression to predict the AQI in Utah.

## Outline

- Pull the latest "Daily AQI by County" file from this link: https://aqs.epa.gov/aqsweb/airdata/download_files.html#AQI

- Your target variable will be *AQI", which is the value of the air quality index

- We will focus the analysis on only the air quality in the state of Utah.  

- Note that there is a several-month lag in preparing data; you should check to see if your file has a full year of data from January to December.  If not, use the previous year.    

- The AQI is divided into six categories:

*Air Quality Index*

|(AQI) Values	|Levels of Health Concern	        |
|---------------|--------|
|0-50	        |Good	 |
|51-100	        |Moderate	 |
|101-150	    |Unhealthy for Sensitive Groups	|
|151 to 200	    |Unhealthy	 |
|201 to 300	    |Very Unhealthy	 |
|301 to 500	    |Hazardous	 |

<a href="https://colab.research.google.com/github/Stan-Pugsley/is_4487_base/blob/main/Labs/Scripts/lab_11_air_quality_regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Load Libraries

➡️ Assignment Tasks
- Load any necessary libraries

In [2]:
import pandas as pd
import matplotlib as mpl
import seaborn as sns
from google.colab import files
from sklearn.tree import DecisionTreeClassifier, export_graphviz # Import Decision Tree Classifier
from sklearn.model_selection import train_test_split # Import train_test_split function
from sklearn import metrics  #Import scikit-learn metrics module for accuracy calculation
from sklearn.metrics import confusion_matrix, precision_score, recall_score, f1_score, classification_report

## Import Data into Dataframe

➡️ Assignment Tasks
- Pull the latest full year of data using the "Daily AQI by County" files from this link: https://aqs.epa.gov/aqsweb/airdata/download_files.html#AQI
- Make sure to UNZIP the file
- Import data from the air quality dataset into a dataframe
- Describe or profile the dataframe

In [14]:

uploaded = files.upload()

for fn in uploaded.keys():
  print('daily_aqi_by_county_2024.csv'.format(
      name=fn, length=len(uploaded[fn])))

df = pd.read_csv(fn)  # Replace 'your_csv_file.csv' with the actual filename

# Now you can work with the DataFrame 'df'


Saving daily_aqi_by_county_2024.csv to daily_aqi_by_county_2024 (1).csv
User uploaded file "daily_aqi_by_county_2024 (1).csv" with length 16697925 bytes


In [15]:
print(df.head())

  State Name county Name  State Code  County Code        Date  AQI Category  \
0    Alabama     Baldwin           1            3  2024-01-03   41     Good   
1    Alabama     Baldwin           1            3  2024-01-04   38     Good   
2    Alabama     Baldwin           1            3  2024-01-05   44     Good   
3    Alabama     Baldwin           1            3  2024-01-06    7     Good   
4    Alabama     Baldwin           1            3  2024-01-07   29     Good   

  Defining Parameter Defining Site  Number of Sites Reporting  
0              PM2.5   01-003-0010                          1  
1              PM2.5   01-003-0010                          1  
2              PM2.5   01-003-0010                          1  
3              PM2.5   01-003-0010                          1  
4              PM2.5   01-003-0010                          1  


## Prepare Data

➡️ Assignment Tasks
- Filter the data to use Utah data only
- Create one dummy variable (true/false) for each of the Defining Parameter values    
- Create variables for month of year, year, and season
- Perform any other data cleanup needed (remove outliers, nulls, etc.)
- After filtering for Utah, remove the geographical variables that remain (county, state) since those non-numeric values can't be used.  Remove any other non-numeric variables.
- Select the data you would like to use in the model.  If you aggregate data, you will have to decide whether to use the min, max or mean value for AQI
- Split the data 80/20 for training and testing

In [18]:
# prompt: take the dataframe above and omit any data with State Name other than Utah as the value

# Assuming 'df' is your DataFrame
df_utah = df[df['State Name'] == 'Utah']

# Now df_utah contains only data for Utah
print(df_utah.head())


       State Name county Name  State Code  County Code        Date  AQI  \
178824       Utah   Box Elder          49            3  2024-01-01   71   
178825       Utah   Box Elder          49            3  2024-01-02   64   
178826       Utah   Box Elder          49            3  2024-01-03   66   
178827       Utah   Box Elder          49            3  2024-01-04   60   
178828       Utah   Box Elder          49            3  2024-01-05   29   

        Category Defining Parameter Defining Site  Number of Sites Reporting  
178824  Moderate              PM2.5   49-003-0005                          2  
178825  Moderate              PM2.5   49-003-0005                          2  
178826  Moderate              PM2.5   49-003-0005                          2  
178827  Moderate              PM2.5   49-003-0005                          2  
178828      Good              Ozone   49-003-7001                          2  


In [21]:
# prompt: take the df_utah dataset above and only use the State Name, County Code, Date, AQI, Category, and Defining Parameter variables. save this as df_clean

df_clean = df_utah[['State Name', 'County Code', 'Date', 'AQI', 'Defining Parameter']]
print(df_clean.head())

       State Name  County Code        Date  AQI Defining Parameter
178824       Utah            3  2024-01-01   71              PM2.5
178825       Utah            3  2024-01-02   64              PM2.5
178826       Utah            3  2024-01-03   66              PM2.5
178827       Utah            3  2024-01-04   60              PM2.5
178828       Utah            3  2024-01-05   29              Ozone


In [None]:
#create column

In [None]:
#data cleanup

In [None]:
#select final columns for use

In [None]:
#split the data into training and testing datasets

## Create Model

➡️ Assignment Tasks
- Create a simple linear regression to predict AQI based on as many variables as you can use or derive.  (for example, sklearn LinearRegression)
- Evaluate the model by displaying the R squared value  
- Visualize the correlation between the target variable and at least one of the independent variables

In [None]:
#create regression or classification model

In [None]:
#print the R squared value

In [None]:
#visual

## Make a prediction

➡️ Assignment Tasks
- What would you predict the average AQI to be in January of the upcoming year?  

In [None]:
#predicted AQI

## OPTIONAL: Compare Air Quality

➡️ Assignment Tasks
- Download the data from several previous years using this website: https://aqs.epa.gov/aqsweb/airdata/download_files.html#AQI
- Append the new data to the previous dataframe
- Use the year as a variable in your regression.  Is year a significant factor in predicting AQI?

In [None]:
#import, append and create new model