# Traffic Data Preprocessing Notebook

This is a Python 3 notebook dedicated for preprocessing traffic data in Florida from March 26 to July 3, 2020. The goal of this notebook is to extract data from different CSV and Excel files and summarize traffic data in different counties from March 26 to July 3, 2020.

## Libraries

Before running the cells of this notebook, the following libraries must be installed on your terminal:
- `pandas`

Run the cell below to load the following libraries

In [1]:
import pandas as pd

# PART 1: Preprocessing One File

Before processing other traffic data, we can explore and preprocess one file first. Some insights and techniques applied to this particular file can then be iterated for other data files. Consider `0401.csv`, corresponding to traffic data in all counties of Florida on April 1, 2020.

In [2]:
#load the contents of April 1, 2020 CSV file
df = pd.read_csv('0401.csv')
df

Unnamed: 0,COUNTY,SITE,BEGDATE,DIR,HR1,HR2,HR3,HR4,HR5,HR6,...,HR20,HR21,HR22,HR23,HR24,TOTVOL,PEAKHR,PEAKVOL,TYPE,TRUCKS
0,93,10,4/1/2020,N,25,13,9,7,9,33,...,347,252,147,98,73,7880,14,662,,
1,93,10,4/1/2020,S,31,17,8,7,11,40,...,347,259,152,113,62,7791,15,645,,
2,87,31,4/1/2020,E,75,46,36,38,113,413,...,616,467,370,232,120,15053,8,1543,,
3,87,31,4/1/2020,W,122,52,32,25,51,151,...,763,493,477,291,210,14595,17,1570,,
4,29,37,4/1/2020,E,7,5,15,16,29,75,...,102,77,43,31,19,2883,9,223,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
525,28,9963,4/1/2020,S,104,92,79,92,166,281,...,351,257,207,158,96,8298,8,622,,
526,93,9964,4/1/2020,N,44,8,20,37,62,146,...,193,143,97,62,51,3840,18,339,,
527,93,9964,4/1/2020,S,28,13,24,30,50,119,...,229,169,119,60,57,4698,16,365,,
528,93,9965,4/1/2020,N,57,46,74,97,118,146,...,129,121,97,94,64,4166,16,310,,


In [3]:
#see all columns of the dataframe
df.columns

Index(['COUNTY', 'SITE', 'BEGDATE', 'DIR', 'HR1', 'HR2', 'HR3', 'HR4', 'HR5',
       'HR6', 'HR7', 'HR8', 'HR9', 'HR10', 'HR11', 'HR12', 'HR13', 'HR14',
       'HR15', 'HR16', 'HR17', 'HR18', 'HR19', 'HR20', 'HR21', 'HR22', 'HR23',
       'HR24', 'TOTVOL', 'PEAKHR', 'PEAKVOL', 'TYPE', 'TRUCKS'],
      dtype='object')

We can drop the following fields since they are irrelevant to the analysis of data:
- `BEGDATE` since they are consistent across all fields
- `HR1`, `HR2`, `HR3`, ..., `HR24` since we are only concerned with the total volume, which is the sum of `HR1`, `HR2`, ...
- `TYPE` and `TRUCKS` fields since we are only concerned with the total volume and not on the count of trucks on a particular county and site.
- The data frame has a `TOTVOL` field corresponding to the total volume of cars on a particular county and site for that day

In [4]:
df = df.drop(['BEGDATE', 'HR1', 'HR2', 'HR3', 'HR4', 'HR5', 'HR6', 'HR7', 'HR8', 'HR9', 'HR10', 'HR11', 'HR12',
         'HR13', 'HR14', 'HR15', 'HR16', 'HR17', 'HR18', 'HR19', 'HR20', 'HR21', 'HR22', 'HR23', 'HR24', 
              'TYPE', 'TRUCKS'], axis = 1)

df

Unnamed: 0,COUNTY,SITE,DIR,TOTVOL,PEAKHR,PEAKVOL
0,93,10,N,7880,14,662
1,93,10,S,7791,15,645
2,87,31,E,15053,8,1543
3,87,31,W,14595,17,1570
4,29,37,E,2883,9,223
...,...,...,...,...,...,...
525,28,9963,S,8298,8,622
526,93,9964,N,3840,18,339
527,93,9964,S,4698,16,365
528,93,9965,N,4166,16,310


The remaining columns can be further analyzed if there are fields that can either be *removed* or *grouped*