# Florida Traffic and Air Quality Index Analysis Notebook

This Python notebook is dedicated for the following steps in CS 132 21.2 project:
- Cleaning and processing traffic and AQI data further
- Exploratory data analysis
- Using linear regression to model the increase in AQI per amount of cars that are added
- Visualizing the data

## Preliminaries

The following external libraries used in this notebook is given by the list below. Make sure that the following are installed in your machine before running subsequent cells. These libraries were installed via `pip`: `pip install <library-name>`
- `pandas`

Run the cell below to load all dependencies and libraries

In [1]:
#lahat ng libraries na gagamitin niyo, paki-lagay na lang yung import statements dito
import pandas as pd

## Loading Datasets

We define two dataframe variables, which will be used throughout the duration of this project:
- `traffic_data`: pertains to the traffic data in Florida
- `airq_data`: pertains to the air quality data in Florida

In [2]:
traffic_data = pd.read_csv('florida_traffic_data.csv')
airq_data = pd.read_csv('florida_aqi_data.csv')

In [3]:
traffic_data.head()

Unnamed: 0,COUNTY,03/26,03/27,03/28,03/29,03/30,03/31,04/01,04/02,04/03,...,06/28,06/29,06/30,07/01,07/02,07/03,07/04,07/05,07/06,07/07
0,Charlotte,82672,186611,57054,45057,10754599,73959,76644,78887,67982,...,80747,98897,99327,102682,111925,103217,71816,81094,99167,96529
1,Citrus,31192,78843,25315,20754,3692189,29950,31750,32386,29249,...,27751,35886,38459,43349,46666,45621,34863,31295,40701,42047
2,Collier,184823,409884,124242,95336,26338889,170020,171818,170158,159620,...,158620,216490,214114,220510,234920,218501,147360,165241,213985,207772
3,Desoto,18119,43706,13530,11630,2028471,17274,17320,17858,15583,...,14313,18203,18602,18885,20446,18055,12816,14553,18841,17845
4,Glades,3744,8605,2688,2096,408545,3580,3515,3797,3449,...,2771,3963,3847,4235,4159,3614,2638,2901,3808,3913


In [4]:
airq_data.head()

Unnamed: 0,county Name,Date,AQI,Defining Parameter
0,Alachua,03/26,56,PM2.5
1,Alachua,03/27,50,PM2.5
2,Alachua,03/28,49,PM2.5
3,Alachua,03/30,39,PM2.5
4,Alachua,03/31,39,PM2.5


# PART 1: Preprocessing Data

There are only certain counties that have been recorded by the air quality data of Florida. We can then filter the traffic data so that it can be consistent with the recorded counties of the air quality data.

In [5]:
#determine counties in the air quality data
county_airq = airq_data['county Name'].unique()

county_airq

array(['Alachua', 'Baker', 'Bay', 'Brevard', 'Broward', 'Citrus',
       'Collier', 'Columbia', 'Duval', 'Escambia', 'Hamilton',
       'Hillsborough', 'Holmes', 'Lee', 'Leon', 'Marion', 'Martin',
       'Miami-Dade', 'Orange', 'Palm Beach', 'Pinellas', 'Polk',
       'Santa Rosa', 'Sarasota', 'Seminole', 'Volusia', 'Wakulla'],
      dtype=object)

In [6]:
#determine counties in traffic data
county_traffic = traffic_data['COUNTY'].unique()

county_traffic

array(['Charlotte', 'Citrus', 'Collier', 'Desoto', 'Glades', 'Hendry',
       'Hernando', 'Highlands', 'Hillsborough', 'Lake', 'Lee', 'Manatee',
       'Pasco', 'Pinellas', 'Polk', 'Sarasota', 'Sumter', 'Alachua',
       'Baker', 'Bradford', 'Columbia', 'Dixie', 'Hamilton', 'Lafayette',
       'Levy', 'Madison', 'Marion', 'Suwannee', 'Taylor', 'Bay',
       'Calhoun', 'Escambia', 'Franklin', 'Gadsden', 'Gulf', 'Holmes',
       'Jackson', 'Jefferson', 'Leon', 'Liberty', 'Okaloosa',
       'Santa Rosa', 'Wakulla', 'Walton', 'Washington', 'Brevard', 'Clay',
       'Duval', 'Flagler', 'Nassau', 'Orange', 'Putnam', 'Seminole',
       'St. Johns', 'Volusia', 'Broward', 'Miami-Dade', 'Indian River',
       'Martin', 'Monroe', 'Osceola', 'Palm Beach', 'St. Lucie',
       "Florida's Turnpike"], dtype=object)

In [7]:
#determine counties that are common for both datasets
common_counties = list(set(county_airq).intersection(set(county_traffic)))

print(common_counties)

['Polk', 'Hamilton', 'Lee', 'Palm Beach', 'Miami-Dade', 'Holmes', 'Duval', 'Martin', 'Wakulla', 'Sarasota', 'Pinellas', 'Baker', 'Citrus', 'Leon', 'Seminole', 'Alachua', 'Volusia', 'Santa Rosa', 'Bay', 'Broward', 'Marion', 'Escambia', 'Orange', 'Collier', 'Hillsborough', 'Brevard', 'Columbia']


The variable `common_counties` represent counties in Florida that are common for both air quality and traffic data. We could now filter both the traffic and air quality data according to their common counties, represented by the `common_counties` variable.

In [8]:
traffic_data = traffic_data[traffic_data['COUNTY'].isin(common_counties)]

traffic_data.sort_values('COUNTY', inplace = True)
traffic_data

Unnamed: 0,COUNTY,03/26,03/27,03/28,03/29,03/30,03/31,04/01,04/02,04/03,...,06/28,06/29,06/30,07/01,07/02,07/03,07/04,07/05,07/06,07/07
17,Alachua,92834,245747,80112,68083,11861297,85510,89802,91746,88081,...,116049,125297,122412,131997,143193,137491,90080,118842,131855,120127
18,Baker,2631,8048,2735,2389,250426,2517,2567,2702,2645,...,2759,2965,2929,3014,3288,3183,2641,2550,2857,2707
29,Bay,106805,248579,76060,61116,11912506,95038,105501,103244,92143,...,110211,138276,154222,139730,166732,155771,124145,123179,133025,145192
45,Brevard,168556,541426,171251,142858,28758061,205848,207852,216241,193054,...,252032,287127,291592,295687,320577,305800,215883,242171,281311,275538
55,Broward,495004,1132417,351257,278898,72378812,466482,464936,460035,474892,...,562490,715555,734498,736393,764052,712455,522985,494879,697840,709431
1,Citrus,31192,78843,25315,20754,3692189,29950,31750,32386,29249,...,27751,35886,38459,43349,46666,45621,34863,31295,40701,42047
2,Collier,184823,409884,124242,95336,26338889,170020,171818,170158,159620,...,158620,216490,214114,220510,234920,218501,147360,165241,213985,207772
20,Columbia,81774,221302,71068,64730,9874912,70527,73902,78359,68961,...,100233,115280,108944,122022,144884,139546,87239,136401,126308,105119
47,Duval,378195,885404,278531,219491,41514840,349028,354745,366348,334882,...,224184,318008,405390,426355,445997,455776,322478,335327,471132,471482
31,Escambia,188975,435500,134950,106435,22015349,169279,183747,180247,166519,...,201671,242656,246983,254142,290255,267312,196559,198003,253754,245818


In [9]:
airq_data = airq_data[airq_data['county Name'].isin(common_counties)]

airq_data

Unnamed: 0,county Name,Date,AQI,Defining Parameter
0,Alachua,03/26,56,PM2.5
1,Alachua,03/27,50,PM2.5
2,Alachua,03/28,49,PM2.5
3,Alachua,03/30,39,PM2.5
4,Alachua,03/31,39,PM2.5
...,...,...,...,...
1204,Wakulla,06/29,60,PM2.5
1205,Wakulla,06/30,58,PM2.5
1206,Wakulla,07/01,54,PM2.5
1207,Wakulla,07/02,64,PM2.5


# PART 2: Exploratory Data Analysis

Both air quality and traffic data from selected counties in Florida state have been preprocessed/cleaned. We can then visualize the trends both in traffic and air quality data in selected counties in Florida.

## PART 2.1: Filtering DataFrame by County

We can then filter our dataframes according to a specific county. Let us define a function `filter_data` that performs the following tasks:
- Filter the air quality dataset to a specified county
- Filter the traffic dataset to a specified county
- Filter recorded dates, since there are particular dates that are not present in the traffic data or air quality data
- Combine datasets into a single dataframe

In [10]:
def filter_data(traffic, air_quality, county):
    temp1 = air_quality[air_quality['county Name'] == county]
    temp2 = traffic[traffic['COUNTY'] == county]
    
    date1 = list(temp2.columns[1:])
    date2 = list(temp1['Date'])
    
    common_dates = list(set(date1).intersection(set(date2)))
    common_dates.sort()
    
    temp1 = temp1[temp1['Date'].isin(common_dates)]
    temp2 = temp2[common_dates]
    to_return = temp1[['Date', 'AQI']].copy(deep = True)
    
    tem_lis = []
    for i in common_dates:
        tem_lis.append(temp2[i].values[0])
    
    to_return['traffic_volume'] = tem_lis
    to_return = to_return.reset_index(drop = True)
    
    return to_return

We can test our newly-created function to a specific county. **For the cells below, you can change the value of `county` variable and run the cell to see the results for a specific county.** You may check for the list of common counties in both datasets by looking at the `common_counties` variable.

In [11]:
print(common_counties)

['Polk', 'Hamilton', 'Lee', 'Palm Beach', 'Miami-Dade', 'Holmes', 'Duval', 'Martin', 'Wakulla', 'Sarasota', 'Pinellas', 'Baker', 'Citrus', 'Leon', 'Seminole', 'Alachua', 'Volusia', 'Santa Rosa', 'Bay', 'Broward', 'Marion', 'Escambia', 'Orange', 'Collier', 'Hillsborough', 'Brevard', 'Columbia']


In [12]:
#You may change the county variable according to a value in the common_counties variable
county = "Wakulla"

test_var = filter_data(traffic_data, airq_data, county)

test_var

Unnamed: 0,Date,AQI,traffic_volume
0,03/26,58,14785
1,03/27,43,40048
2,03/28,46,12911
3,03/29,42,10903
4,03/30,58,1596578
5,04/03,62,14799
6,04/04,76,11341
7,04/05,60,9052
8,04/08,39,13503
9,04/09,59,13997
