<h1> AQI Data Preprocessing Notebook </h1>

This is a Python 3 notebook dedicated for preprocessing Air Quality Index (AQI) data in Florida from January 1 to July 3, 2020. The goal of this notebook is to extract data from the AQI CSV file and summarize AQI data from March 26 to July 3, 2020.

<h2> Libraries </h2>

Before running the cells of this notebook, the following libraries must be installed in your terminal:

- ```pandas```

Run the cell below to load the following libraries:

In [1]:
import pandas as pd

<h1> PART 1: Loading the Data </h1>

Since the data to be preprocessed is located in one CSV file, we need to load the CSV file, specifically '2020_Daily_AQI'. This contains the AQI data in Florida from March 26 to July 3, 2020 among other data.

In [2]:
df = pd.read_csv('2020_Daily_AQI.csv')
df

Unnamed: 0,State Name,county Name,State Code,County Code,Date,AQI,Category,Defining Parameter,Defining Site,Number of Sites Reporting
0,Alabama,Baldwin,1,3,2020-01-01,48,Good,PM2.5,01-003-0010,1
1,Alabama,Baldwin,1,3,2020-01-04,13,Good,PM2.5,01-003-0010,1
2,Alabama,Baldwin,1,3,2020-01-07,14,Good,PM2.5,01-003-0010,1
3,Alabama,Baldwin,1,3,2020-01-10,39,Good,PM2.5,01-003-0010,1
4,Alabama,Baldwin,1,3,2020-01-13,29,Good,PM2.5,01-003-0010,1
...,...,...,...,...,...,...,...,...,...,...
338190,Wyoming,Weston,56,45,2020-12-27,32,Good,Ozone,56-045-0003,2
338191,Wyoming,Weston,56,45,2020-12-28,30,Good,Ozone,56-045-0003,2
338192,Wyoming,Weston,56,45,2020-12-29,33,Good,Ozone,56-045-0003,2
338193,Wyoming,Weston,56,45,2020-12-30,33,Good,Ozone,56-045-0003,2


<h1> PART 2: Removing Unnecessary Columns </h1>

First, let us look at all the columns of the dataframe.

In [3]:
df.columns

Index(['State Name', 'county Name', 'State Code', 'County Code', 'Date', 'AQI',
       'Category', 'Defining Parameter', 'Defining Site',
       'Number of Sites Reporting'],
      dtype='object')

We can drop the following fields since they are irrelevant to the analysis of data:
- ```Category``` since right now we are only concerned with the numerical AQI
- ```Defining Site``` since the site is irrelevant to the analysis we need
- ```Number of Sites Reporting``` since it is also irrelevant to the analysis we need

In [4]:
df = df.drop(['Category', 'Defining Site', 'Number of Sites Reporting'], axis = 1)
df

Unnamed: 0,State Name,county Name,State Code,County Code,Date,AQI,Defining Parameter
0,Alabama,Baldwin,1,3,2020-01-01,48,PM2.5
1,Alabama,Baldwin,1,3,2020-01-04,13,PM2.5
2,Alabama,Baldwin,1,3,2020-01-07,14,PM2.5
3,Alabama,Baldwin,1,3,2020-01-10,39,PM2.5
4,Alabama,Baldwin,1,3,2020-01-13,29,PM2.5
...,...,...,...,...,...,...,...
338190,Wyoming,Weston,56,45,2020-12-27,32,Ozone
338191,Wyoming,Weston,56,45,2020-12-28,30,Ozone
338192,Wyoming,Weston,56,45,2020-12-29,33,Ozone
338193,Wyoming,Weston,56,45,2020-12-30,33,Ozone


<h1> PART 3: Removing Unnecessary Rows </h1>

While we have dropped the irrelevant columns, there are still also irrelevant rows. Removing these rows will allow us to have only the data we need.

In [5]:
# check the current number of rows
# (rows, columns)

df.shape

(338195, 7)

<h3> Defining Parameter </h3>

We are only concerned with the AQI that has a defining parameter of PM2.5 hence we can drop all rows that have a different defining parameter.

In [6]:
df.drop(df.index[df['Defining Parameter'] != 'PM2.5'], inplace=True)
df

Unnamed: 0,State Name,county Name,State Code,County Code,Date,AQI,Defining Parameter
0,Alabama,Baldwin,1,3,2020-01-01,48,PM2.5
1,Alabama,Baldwin,1,3,2020-01-04,13,PM2.5
2,Alabama,Baldwin,1,3,2020-01-07,14,PM2.5
3,Alabama,Baldwin,1,3,2020-01-10,39,PM2.5
4,Alabama,Baldwin,1,3,2020-01-13,29,PM2.5
...,...,...,...,...,...,...,...
337380,Wyoming,Teton,56,39,2020-10-10,70,PM2.5
337436,Wyoming,Teton,56,39,2020-12-05,43,PM2.5
337439,Wyoming,Teton,56,39,2020-12-08,52,PM2.5
337442,Wyoming,Teton,56,39,2020-12-11,36,PM2.5


<h3> Date </h3>

Since the AQI data we need is from January 1 to July 3, 2020 only, we discard the data for other dates.

In [7]:
dates = []

# for January
month = '01'
for day in range(1, 10):
    dates.append('2020-' + month + '-0' + str(day))
for day in range(10, 32):
    dates.append('2020-' + month + '-' + str(day))

# for February
month = '02'
for day in range(1, 10):
    dates.append('2020-' + month + '-0' + str(day))
for day in range(10, 30):
    dates.append('2020-' + month + '-' + str(day))

# for March
month = '03'
for day in range(1, 10):
    dates.append('2020-' + month + '-0' + str(day))
for day in range(10, 32):
    dates.append('2020-' + month + '-' + str(day))
    
# for April
month = '04'
for day in range(1, 10):
    dates.append('2020-' + month + '-0' + str(day))
for day in range(10, 31):
    dates.append('2020-' + month + '-' + str(day))
    
# for May
month = '05'
for day in range(1, 10):
    dates.append('2020-' + month + '-0' + str(day))
for day in range(10, 32):
    dates.append('2020-' + month + '-' + str(day))
    
# for June
month = '06'
for day in range(1, 10):
    dates.append('2020-' + month + '-0' + str(day))
for day in range(10, 31):
    dates.append('2020-' + month + '-' + str(day))
    
# for July
month = '07'
for day in range(1, 4):
    dates.append('2020-' + month + '-0' + str(day))

In [8]:
df = df[df.Date.isin(dates)]
df

Unnamed: 0,State Name,county Name,State Code,County Code,Date,AQI,Defining Parameter
0,Alabama,Baldwin,1,3,2020-01-01,48,PM2.5
1,Alabama,Baldwin,1,3,2020-01-04,13,PM2.5
2,Alabama,Baldwin,1,3,2020-01-07,14,PM2.5
3,Alabama,Baldwin,1,3,2020-01-10,39,PM2.5
4,Alabama,Baldwin,1,3,2020-01-13,29,PM2.5
...,...,...,...,...,...,...,...
336017,Wyoming,Sheridan,56,33,2020-01-19,40,PM2.5
336026,Wyoming,Sheridan,56,33,2020-01-28,34,PM2.5
336050,Wyoming,Sheridan,56,33,2020-02-21,61,PM2.5
336077,Wyoming,Sheridan,56,33,2020-03-19,41,PM2.5


### Reformat Dates

In [9]:
def reformat_date(date):
    temp = date.split('-')
    return temp[1] + '/' + temp[2]

In [10]:
df["Date"] = df["Date"].apply(reformat_date)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["Date"] = df["Date"].apply(reformat_date)


<h3> Location </h3>

Since the AQI data we need is from Florida only, we discard the data from other locations.

In [11]:
# Florida state code is 12

df = df[df['State Code'] == 12]
df

Unnamed: 0,State Name,county Name,State Code,County Code,Date,AQI,Defining Parameter
45790,Florida,Alachua,12,1,01/03,35,PM2.5
45795,Florida,Alachua,12,1,01/08,52,PM2.5
45796,Florida,Alachua,12,1,01/09,48,PM2.5
45798,Florida,Alachua,12,1,01/11,38,PM2.5
45799,Florida,Alachua,12,1,01/12,37,PM2.5
...,...,...,...,...,...,...,...
59681,Florida,Wakulla,12,129,06/29,60,PM2.5
59682,Florida,Wakulla,12,129,06/30,58,PM2.5
59683,Florida,Wakulla,12,129,07/01,54,PM2.5
59684,Florida,Wakulla,12,129,07/02,64,PM2.5


We can now drop the `State Name`, `State Code`, and `County Code` fields since the preprocessed data now belongs to Florida State and its corresponding code. County codes will also be dropped since they are inconsistent with the county codes in the traffic data, and that county names were used to preprocess traffic data.

In [12]:
df = df.drop(['State Name','State Code', 'County Code'], axis = 1)

df = df.rename(columns = {'county Name':'COUNTY' , 'Date':'DATE', 'Defining Parameter':'PARAMETER'},inplace=True)

df

Unnamed: 0,county Name,Date,AQI,Defining Parameter
45790,Alachua,01/03,35,PM2.5
45795,Alachua,01/08,52,PM2.5
45796,Alachua,01/09,48,PM2.5
45798,Alachua,01/11,38,PM2.5
45799,Alachua,01/12,37,PM2.5
...,...,...,...,...
59681,Wakulla,06/29,60,PM2.5
59682,Wakulla,06/30,58,PM2.5
59683,Wakulla,07/01,54,PM2.5
59684,Wakulla,07/02,64,PM2.5


# PART 4: Exporting Data

We can now explore the preprocessed data into a new CSV file.

In [13]:
df.to_csv('../florida_aqi_data.csv', encoding = 'utf-8', index = False)