# Traffic Data Preprocessing Notebook

This is a Python 3 notebook dedicated for preprocessing traffic data in Florida from March 26 to July 3, 2020. The goal of this notebook is to extract data from different CSV and Excel files and summarize traffic data in different counties from March 26 to July 3, 2020.

## Libraries

Before running the cells of this notebook, the following libraries must be installed on your terminal:
- `pandas`
- `tqdm`
- `openpyxl`
- `xlrd`

The following libraries were installed via `pip`: `pip install <library-name>`. Run the cell below to load the following libraries

In [1]:
import pandas as pd
import os
from tqdm.notebook import tqdm

# PART 1: Preprocessing One File

Before processing other traffic data, we can explore and preprocess one file first. Some insights and techniques applied to this particular file can then be iterated for other data files. Consider `0401.csv`, corresponding to traffic data in all counties of Florida on April 1, 2020.

In [2]:
#load the contents of April 1, 2020 CSV file
df = pd.read_csv('0401.csv')
df

Unnamed: 0,COUNTY,SITE,BEGDATE,DIR,HR1,HR2,HR3,HR4,HR5,HR6,...,HR20,HR21,HR22,HR23,HR24,TOTVOL,PEAKHR,PEAKVOL,TYPE,TRUCKS
0,93,10,4/1/2020,N,25,13,9,7,9,33,...,347,252,147,98,73,7880,14,662,,
1,93,10,4/1/2020,S,31,17,8,7,11,40,...,347,259,152,113,62,7791,15,645,,
2,87,31,4/1/2020,E,75,46,36,38,113,413,...,616,467,370,232,120,15053,8,1543,,
3,87,31,4/1/2020,W,122,52,32,25,51,151,...,763,493,477,291,210,14595,17,1570,,
4,29,37,4/1/2020,E,7,5,15,16,29,75,...,102,77,43,31,19,2883,9,223,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
525,28,9963,4/1/2020,S,104,92,79,92,166,281,...,351,257,207,158,96,8298,8,622,,
526,93,9964,4/1/2020,N,44,8,20,37,62,146,...,193,143,97,62,51,3840,18,339,,
527,93,9964,4/1/2020,S,28,13,24,30,50,119,...,229,169,119,60,57,4698,16,365,,
528,93,9965,4/1/2020,N,57,46,74,97,118,146,...,129,121,97,94,64,4166,16,310,,


In [3]:
#see all columns of the dataframe
df.columns

Index(['COUNTY', 'SITE', 'BEGDATE', 'DIR', 'HR1', 'HR2', 'HR3', 'HR4', 'HR5',
       'HR6', 'HR7', 'HR8', 'HR9', 'HR10', 'HR11', 'HR12', 'HR13', 'HR14',
       'HR15', 'HR16', 'HR17', 'HR18', 'HR19', 'HR20', 'HR21', 'HR22', 'HR23',
       'HR24', 'TOTVOL', 'PEAKHR', 'PEAKVOL', 'TYPE', 'TRUCKS'],
      dtype='object')

We can drop the following fields since they are irrelevant to the analysis of data:
- `BEGDATE` since they are consistent across all fields
- `HR1`, `HR2`, `HR3`, ..., `HR24` since we are only concerned with the total volume, which is the sum of `HR1`, `HR2`, ...
- `TYPE` and `TRUCKS` fields since we are only concerned with the total volume and not on the count of trucks on a particular county and site.

The data frame has a `TOTVOL` field corresponding to the total volume of cars on a particular county and site for that day

In [4]:
df = df.drop(['BEGDATE', 'HR1', 'HR2', 'HR3', 'HR4', 'HR5', 'HR6', 'HR7', 'HR8', 'HR9', 'HR10', 'HR11', 'HR12',
         'HR13', 'HR14', 'HR15', 'HR16', 'HR17', 'HR18', 'HR19', 'HR20', 'HR21', 'HR22', 'HR23', 'HR24', 
              'TYPE', 'TRUCKS'], axis = 1)

df

Unnamed: 0,COUNTY,SITE,DIR,TOTVOL,PEAKHR,PEAKVOL
0,93,10,N,7880,14,662
1,93,10,S,7791,15,645
2,87,31,E,15053,8,1543
3,87,31,W,14595,17,1570
4,29,37,E,2883,9,223
...,...,...,...,...,...,...
525,28,9963,S,8298,8,622
526,93,9964,N,3840,18,339
527,93,9964,S,4698,16,365
528,93,9965,N,4166,16,310


We can group the `TOTVOL` variable according to counties using the `groupby` function

In [5]:
total_volume_per_county = df.groupby(['COUNTY']).sum()['TOTVOL']

total_volume_per_county

COUNTY
1      76644
2      31750
3     171818
4      17320
5       3515
       ...  
90     47043
92     44636
93    517230
94     49875
97    398490
Name: TOTVOL, Length: 64, dtype: int64

In [6]:
#we can then convert total_volume_per_county into an array
lis = list(total_volume_per_county.array)

print(lis)

[76644, 31750, 171818, 17320, 3515, 27978, 26869, 12442, 510096, 45578, 107463, 70901, 125935, 114774, 183116, 97743, 32885, 89802, 2567, 17948, 73902, 2261, 56487, 4005, 36168, 21798, 127293, 22348, 1761, 105501, 15360, 183747, 4078, 34553, 4570, 1864, 49830, 20461, 108567, 2518, 125422, 19397, 15289, 33758, 11994, 207852, 47306, 354745, 22836, 61067, 216261, 27445, 111686, 67214, 132984, 464936, 333001, 22925, 113821, 47043, 44636, 517230, 49875, 398490]


# PART 2: Determining All Counties Available in All Data

Before processing traffic data across all files, we have to determine the counties that are present in all files. For the code cell below, we temporarily load all CSV files present in the working directory. From that, we determine their list of counties, and add them to the `counties` variable.

In [7]:
counties = []

path = os.getcwd()
dir_list = os.listdir(path)

for i in tqdm(range(len(dir_list))):
    if dir_list[i] in ['.ipynb_checkpoints', 'rename_files.py', 'Traffic Data Preprocessing Notebook.ipynb']:
        continue

    old_file = path + "/" + dir_list[i]
    name, extension = os.path.splitext(old_file)
    
    if extension in ['.xls', '.xlsx']:
        temp = pd.read_excel(dir_list[i])
    else:
        temp = pd.read_csv(dir_list[i])
    
    lis = set(temp['COUNTY'].unique())
    counties = list(set(counties).union(lis))

print(counties)

  0%|          | 0/102 [00:00<?, ?it/s]

[1, 2, 3, 4, 5, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 26, 27, 28, 29, 30, 32, 33, 34, 35, 36, 37, 38, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 86, 87, 88, 89, 90, 92, 93, 94, 97]


# PART 3: Preprocessing Traffic Data Across Multiple Files

Based on the codes used to pre-process data on a single file, we can then preprocess traffic data across multiple files. The calculated total volume per county will be appended as a single column in a Pandas datframe. The pandas Data Frame will have the following:

We can the initialize a new pandas `DataFrame` with the first column pertaining to the different counties identified.

In [8]:
#initialize the dataframe
traffic_data = pd.DataFrame(data = counties, columns = ['COUNTY'])

traffic_data

Unnamed: 0,COUNTY
0,1
1,2
2,3
3,4
4,5
...,...
59,90
60,92
61,93
62,94


Let us define a function `preprocess_data` which processes the dataframe according to the specifications performed in PART 1.

In [9]:
def preprocess_data(data_frame, counties):
    data_frame = data_frame.drop(['BEGDATE', 'HR1', 'HR2', 'HR3', 'HR4', 'HR5', 'HR6', 'HR7', 'HR8', 'HR9', 'HR10', 'HR11', 'HR12',
         'HR13', 'HR14', 'HR15', 'HR16', 'HR17', 'HR18', 'HR19', 'HR20', 'HR21', 'HR22', 'HR23', 'HR24', 
              'TYPE', 'TRUCKS'], axis = 1)
    
    total_volume_per_county = data_frame.groupby(['COUNTY']).sum()['TOTVOL']
    
    for i in counties:
        if i not in total_volume_per_county:
            total_volume_per_county[i] = 0
    
    total_volume_per_county = total_volume_per_county.sort_index()
    to_return = list(total_volume_per_county.array)
    
    return to_return

We test the function to a particular CSV file, `0616.csv`, which covers traffic data on June 16, 2020.

In [10]:
temp = pd.read_csv('0616.csv')
arr = preprocess_data(temp, counties)

# arr must be the same with a previous loaded cell in PART 1
print(arr)

[100808, 40109, 221256, 18700, 0, 32944, 33091, 13064, 817086, 57712, 155118, 55162, 163209, 161056, 240022, 170596, 88021, 120028, 3075, 29718, 103162, 2605, 77668, 5070, 33507, 29907, 164314, 30872, 1981, 141975, 19130, 264322, 6002, 45444, 6652, 2296, 65912, 28746, 137148, 3451, 238383, 22390, 18495, 51685, 4950, 292767, 61009, 497797, 35269, 140065, 308797, 31838, 145307, 94329, 186662, 731521, 636222, 29549, 96681, 77823, 68212, 592913, 68447, 573334]


Before proceeding with all of the files, we could determine first the files whose dates have multiple values. There were files that have multiple date values for a particular date. These files will be processed as another preprocessing step.

In [11]:
multiple_dates_files = []

for i in tqdm(range(len(dir_list))):
    if dir_list[i] in ['.ipynb_checkpoints', 'rename_files.py', 'Traffic Data Preprocessing Notebook.ipynb']:
        continue
    
    old_file = path + "/" + dir_list[i]
    name, extension = os.path.splitext(old_file)

    date = dir_list[i][:dir_list[i].index('.')]
    date = date[:2] + "/" + date[2:]

    if extension in ['.xls', '.xlsx']:
        temp = pd.read_excel(dir_list[i])
    else:
        temp = pd.read_csv(dir_list[i])
    
    if len(temp['BEGDATE'].unique()) > 1:
        multiple_dates_files.append(dir_list[i])

  0%|          | 0/102 [00:00<?, ?it/s]

This Python notebook, the `rename_files` Python script, and the Python notebook checkpoints folder should not be processed by the `preprocess_data` function. Aside from these files, data sources with multiple data values will NOT be processed as well using `preprocess_data`, and will be considered in another preprocessing step. Let us define a variable `do_not_open` that contains the filenames of those that should not be opened upon iterating `preprocess_data` over the working directory.

In [12]:
do_not_open = ['.ipynb_checkpoints', 'rename_files.py', 'Traffic Data Preprocessing Notebook.ipynb']
do_not_open += multiple_dates_files

do_not_open

['.ipynb_checkpoints',
 'rename_files.py',
 'Traffic Data Preprocessing Notebook.ipynb',
 '0327.xlsx',
 '0330.csv']

We then iterate the function over all files in the working directory. 

In [13]:
path = os.getcwd()
dir_list = os.listdir(path)

for i in tqdm(range(len(dir_list))):
    if dir_list[i] in do_not_open:
        continue
    
    old_file = path + "/" + dir_list[i]
    name, extension = os.path.splitext(old_file)

    date = dir_list[i][:dir_list[i].index('.')]
    date = date[:2] + "/" + date[2:]

    if extension in ['.xls', '.xlsx']:
        temp = pd.read_excel(dir_list[i])
    else:
        temp = pd.read_csv(dir_list[i])
    
    arr = preprocess_data(temp, counties)
    
    traffic_data[date] = arr

  0%|          | 0/102 [00:00<?, ?it/s]

In [14]:
traffic_data

Unnamed: 0,COUNTY,03/26,03/28,03/29,03/31,04/01,04/02,04/03,04/04,04/05,...,06/28,06/29,06/30,07/01,07/02,07/03,07/04,07/05,07/06,07/07
0,1,82672,57054,45057,73959,76644,78887,67982,43315,33219,...,80747,98897,99327,102682,111925,103217,71816,81094,99167,96529
1,2,31192,25315,20754,29950,31750,32386,29249,20714,14305,...,27751,35886,38459,43349,46666,45621,34863,31295,40701,42047
2,3,184823,124242,95336,170020,171818,170158,159620,99468,75727,...,158620,216490,214114,220510,234920,218501,147360,165241,213985,207772
3,4,18119,13530,11630,17274,17320,17858,15583,10876,8235,...,14313,18203,18602,18885,20446,18055,12816,14553,18841,17845
4,5,3744,2688,2096,3580,3515,3797,3449,2150,1621,...,2771,3963,3847,4235,4159,3614,2638,2901,3808,3913
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
59,90,52623,35397,29343,47285,47043,47679,46231,33231,26128,...,83934,84660,82687,86880,93197,91758,71474,73931,82239,77535
60,92,53261,35434,29510,42069,44636,44974,44931,36691,27092,...,62068,68280,67681,68647,72278,75584,63372,60028,67350,49517
61,93,551676,387554,304216,494897,517230,526698,511822,331615,250331,...,516741,703184,644371,722408,746341,683554,473495,480509,672670,690993
62,94,54314,39864,33883,48901,49875,51583,47443,31054,24792,...,61346,68557,66648,69227,77059,72195,50661,62022,67990,64337


# PART 4: Processing Special Traffic Data Files

## PART 4.1: Processing March 30 Traffic

Upon analyzing the March 30 traffic data from a prior Exploratory Data Analysis (EDA), **it contains traffic data of all counties from JANUARY 1, 2020 to MARCH 30, 2020**, condensed into one file. Let `march30_traffic` contain the dataframe of the CSV file, and from there, we can explore ways to extract traffic data for multiple dates and multiple counties.

In [15]:
march30_traffic = pd.read_csv('0330.csv')

march30_traffic.head()

Unnamed: 0,COUNTY,SITE,BEGDATE,DIR,HR1,HR2,HR3,HR4,HR5,HR6,...,HR20,HR21,HR22,HR23,HR24,TOTVOL,PEAKHR,PEAKVOL,TYPE,TRUCKS
0,93,10,1/1/2020,N,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,B,
1,93,10,1/1/2020,S,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,B,
2,93,10,1/2/2020,N,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,B,
3,93,10,1/2/2020,S,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,B,
4,93,10,1/3/2020,N,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,B,


In [16]:
#we can drop the unnecessary columns like the ones done in the previous files, except for BEGDATE
march30_traffic = march30_traffic.drop(['HR1', 'HR2', 'HR3', 'HR4', 'HR5', 'HR6', 'HR7', 'HR8', 'HR9', 'HR10', 'HR11', 'HR12',
         'HR13', 'HR14', 'HR15', 'HR16', 'HR17', 'HR18', 'HR19', 'HR20', 'HR21', 'HR22', 'HR23', 'HR24', 
              'TYPE', 'TRUCKS'], axis = 1)

march30_traffic

Unnamed: 0,COUNTY,SITE,BEGDATE,DIR,TOTVOL,PEAKHR,PEAKVOL
0,93,10,1/1/2020,N,0,1,0
1,93,10,1/1/2020,S,0,1,0
2,93,10,1/2/2020,N,0,1,0
3,93,10,1/2/2020,S,0,1,0
4,93,10,1/3/2020,N,0,1,0
...,...,...,...,...,...,...,...
47843,93,9965,3/28/2020,S,3045,7,281
47844,93,9965,3/29/2020,N,2493,15,197
47845,93,9965,3/29/2020,S,3260,18,278
47846,93,9965,3/30/2020,N,4251,17,290


We can then group the modified traffic data dataframe according to the `BEGDATE` and `COUNTY` fields.

In [17]:
stats = march30_traffic.groupby(['BEGDATE', 'COUNTY']).sum()['TOTVOL']

stats

BEGDATE   COUNTY
1/1/2020  1          93577
          2          27107
          3         236551
          4          15620
          5           3437
                     ...  
3/9/2020  90         97365
          92         82079
          93        873672
          94         88853
          97        855046
Name: TOTVOL, Length: 5758, dtype: int64

We have to get all dates that were present in the grouped dataset.

In [18]:
#get all dates present in the dataset
dates = list(set([x[0] for x in stats.keys()]))

dates.sort()
print(dates)

['1/1/2020', '1/10/2020', '1/11/2020', '1/12/2020', '1/13/2020', '1/14/2020', '1/15/2020', '1/16/2020', '1/17/2020', '1/18/2020', '1/19/2020', '1/2/2020', '1/20/2020', '1/21/2020', '1/22/2020', '1/23/2020', '1/24/2020', '1/25/2020', '1/26/2020', '1/27/2020', '1/28/2020', '1/29/2020', '1/3/2020', '1/30/2020', '1/31/2020', '1/4/2020', '1/5/2020', '1/6/2020', '1/7/2020', '1/8/2020', '1/9/2020', '2/1/2020', '2/10/2020', '2/11/2020', '2/12/2020', '2/13/2020', '2/14/2020', '2/15/2020', '2/16/2020', '2/17/2020', '2/18/2020', '2/19/2020', '2/2/2020', '2/20/2020', '2/21/2020', '2/22/2020', '2/23/2020', '2/24/2020', '2/25/2020', '2/26/2020', '2/27/2020', '2/28/2020', '2/29/2020', '2/3/2020', '2/4/2020', '2/5/2020', '2/6/2020', '2/7/2020', '2/8/2020', '2/9/2020', '3/1/2020', '3/10/2020', '3/11/2020', '3/12/2020', '3/13/2020', '3/14/2020', '3/15/2020', '3/16/2020', '3/17/2020', '3/18/2020', '3/19/2020', '3/2/2020', '3/20/2020', '3/21/2020', '3/22/2020', '3/23/2020', '3/24/2020', '3/25/2020', '3/26

In [19]:
#initialize the dataframe
traffic_data_2 = pd.DataFrame(data = counties, columns = ['COUNTY'])

traffic_data_2

Unnamed: 0,COUNTY
0,1
1,2
2,3
3,4
4,5
...,...
59,90
60,92
61,93
62,94


The code cell below performs the following tasks per date in the dataframe:
- Extract the traffic data from the grouped counties
- If there are counties that were not included in the extracted traffic data, we set that particular county to 0 
- We sort the traffic data according to counties (that serve as the indices of the extracted data)
- Change the formatting of the data for better consistency in the previous dataframe
- Append the extracted data as a new column under `traffic_data_2` dataframe

In [20]:
for i in dates:
    another_temp = stats.loc[i]
    
    for j in counties:
        if j not in another_temp:
            another_temp[j] = 0

    another_temp = another_temp.sort_index()
    
    date = i.split('/')
    date = date[0].zfill(2) + '/' + date[1].zfill(2)
    
    traffic_data_2[date] = list(another_temp.array)

We then sort our column headers using the `reindex` method.

In [21]:
traffic_data_2 = traffic_data_2.reindex(sorted(traffic_data_2.columns), axis = 1)
traffic_data_2

Unnamed: 0,01/01,01/02,01/03,01/04,01/05,01/06,01/07,01/08,01/09,01/10,...,03/22,03/23,03/24,03/25,03/26,03/27,03/28,03/29,03/30,COUNTY
0,93577,130263,137751,110805,101379,120925,117384,122200,125525,133191,...,59648,85962,83737,83638,82672,84500,57054,45057,74156,1
1,27107,42907,45616,33070,32908,41046,43203,44375,45091,47476,...,23285,31829,31733,31083,31192,32774,25315,20754,30357,2
2,236551,325565,338452,284312,264567,314038,304515,310861,318654,335605,...,125717,195608,189258,187212,184823,190306,124242,95336,169296,3
3,15620,22611,24145,18099,18292,22028,21675,22571,23220,25535,...,13628,18936,18266,18105,18119,18546,13530,11630,16882,4
4,3437,4675,5495,3620,3310,4655,4581,4488,4883,5273,...,2711,3979,4124,4019,3744,3821,2688,2096,3531,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
59,84573,104874,104700,92469,80689,90902,89265,91252,91892,93932,...,43986,53793,52954,52117,52623,50638,35397,29343,46551,90
60,67464,81092,86625,76820,74660,78769,80345,81134,83048,87917,...,41152,51161,50072,51673,53261,42690,35434,29510,41692,92
61,496735,738460,766661,636097,560431,722610,722778,716651,725090,775552,...,358677,570068,562458,557656,551676,563303,387554,304216,517117,93
62,64646,86494,90838,79047,76925,77275,72499,74157,75496,85537,...,43555,56539,54300,54646,54314,55891,39864,33883,49856,94


The columns pertaining to March 26, March 28, and March 29 traffic data have been captured as separate files. We can determine if the traffic data captured in the March 30 traffic dataset is also equal to the traffic dataset captured as separate files ('0326.csv' for March 26 traffic data, etc.)

In [22]:
traffic_data_2[['03/26', '03/28', '03/29']].equals(traffic_data[['03/26', '03/28', '03/29']])

True

Since they are both equal, we could now drop these columns, including the `COUNTY` column before appending our dataframe to the previously generated traffic data, given by `traffic_data`.

In [23]:
traffic_data_2 = traffic_data_2.drop(['03/26', '03/28', '03/29', 'COUNTY'], axis = 1)

traffic_data_2

Unnamed: 0,01/01,01/02,01/03,01/04,01/05,01/06,01/07,01/08,01/09,01/10,...,03/18,03/19,03/20,03/21,03/22,03/23,03/24,03/25,03/27,03/30
0,93577,130263,137751,110805,101379,120925,117384,122200,125525,133191,...,111942,108365,108773,78079,59648,85962,83737,83638,84500,74156
1,27107,42907,45616,33070,32908,41046,43203,44375,45091,47476,...,39627,38233,39310,28825,23285,31829,31733,31083,32774,30357
2,236551,325565,338452,284312,264567,314038,304515,310861,318654,335605,...,252039,240872,243697,165089,125717,195608,189258,187212,190306,169296
3,15620,22611,24145,18099,18292,22028,21675,22571,23220,25535,...,21288,21222,22357,15906,13628,18936,18266,18105,18546,16882
4,3437,4675,5495,3620,3310,4655,4581,4488,4883,5273,...,4484,4719,4759,3130,2711,3979,4124,4019,3821,3531
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
59,84573,104874,104700,92469,80689,90902,89265,91252,91892,93932,...,82419,78294,78786,60353,43986,53793,52954,52117,50638,46551
60,67464,81092,86625,76820,74660,78769,80345,81134,83048,87917,...,66025,63976,66871,52346,41152,51161,50072,51673,42690,41692
61,496735,738460,766661,636097,560431,722610,722778,716651,725090,775552,...,718218,694518,703092,476365,358677,570068,562458,557656,563303,517117
62,64646,86494,90838,79047,76925,77275,72499,74157,75496,85537,...,72364,70012,72453,54009,43555,56539,54300,54646,55891,49856


In [24]:
final = pd.concat([traffic_data, traffic_data_2], axis = 1)

final

Unnamed: 0,COUNTY,03/26,03/28,03/29,03/31,04/01,04/02,04/03,04/04,04/05,...,03/18,03/19,03/20,03/21,03/22,03/23,03/24,03/25,03/27,03/30
0,1,82672,57054,45057,73959,76644,78887,67982,43315,33219,...,111942,108365,108773,78079,59648,85962,83737,83638,84500,74156
1,2,31192,25315,20754,29950,31750,32386,29249,20714,14305,...,39627,38233,39310,28825,23285,31829,31733,31083,32774,30357
2,3,184823,124242,95336,170020,171818,170158,159620,99468,75727,...,252039,240872,243697,165089,125717,195608,189258,187212,190306,169296
3,4,18119,13530,11630,17274,17320,17858,15583,10876,8235,...,21288,21222,22357,15906,13628,18936,18266,18105,18546,16882
4,5,3744,2688,2096,3580,3515,3797,3449,2150,1621,...,4484,4719,4759,3130,2711,3979,4124,4019,3821,3531
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
59,90,52623,35397,29343,47285,47043,47679,46231,33231,26128,...,82419,78294,78786,60353,43986,53793,52954,52117,50638,46551
60,92,53261,35434,29510,42069,44636,44974,44931,36691,27092,...,66025,63976,66871,52346,41152,51161,50072,51673,42690,41692
61,93,551676,387554,304216,494897,517230,526698,511822,331615,250331,...,718218,694518,703092,476365,358677,570068,562458,557656,563303,517117
62,94,54314,39864,33883,48901,49875,51583,47443,31054,24792,...,72364,70012,72453,54009,43555,56539,54300,54646,55891,49856


## PART 4.2: Processing March 27 Traffic

Upon analyzing the March 27 traffic data during the preprocessing step, **it contains traffic data of all counties from MARCH 27 TO 29, 2020**, condensed into one file. Let `march27_traffic` contain the dataframe of the CSV file, and from there, we can explore ways to extract traffic data for multiple dates and multiple counties.

In [25]:
march27_traffic = pd.read_excel('0327.xlsx')

march27_traffic.head()

Unnamed: 0,COUNTY,SITE,BEGDATE,DIR,HR1,HR2,HR3,HR4,HR5,HR6,...,HR20,HR21,HR22,HR23,HR24,TOTVOL,PEAKHR,PEAKVOL,TYPE,TRUCKS
0,93,10,2020-03-27,N,50,25,15,12,6,34,...,419,307,193,102,79,8961,16,793,S,
1,93,10,2020-03-27,S,39,23,10,14,9,46,...,383,278,186,118,73,8787,13,721,S,
2,87,31,2020-03-27,E,78,61,53,45,92,334,...,630,465,362,257,117,15080,8,1294,S,
3,87,31,2020-03-27,W,106,70,46,35,54,154,...,737,528,488,316,215,15501,17,1468,S,
4,29,37,2020-03-27,E,15,9,13,21,35,57,...,112,63,55,38,33,3015,17,249,S,


In [26]:
#we can drop the unnecessary columns like the ones done in the previous files, except for BEGDATE
march27_traffic = march27_traffic.drop(['HR1', 'HR2', 'HR3', 'HR4', 'HR5', 'HR6', 'HR7', 'HR8', 'HR9', 'HR10', 'HR11', 'HR12',
         'HR13', 'HR14', 'HR15', 'HR16', 'HR17', 'HR18', 'HR19', 'HR20', 'HR21', 'HR22', 'HR23', 'HR24', 
              'TYPE', 'TRUCKS'], axis = 1)

march27_traffic

Unnamed: 0,COUNTY,SITE,BEGDATE,DIR,TOTVOL,PEAKHR,PEAKVOL
0,93,10,2020-03-27,N,8961,16,793
1,93,10,2020-03-27,S,8787,13,721
2,87,31,2020-03-27,E,15080,8,1294
3,87,31,2020-03-27,W,15501,17,1468
4,29,37,2020-03-27,E,3015,17,249
...,...,...,...,...,...,...,...
1598,28,9963,2020-03-29,S,6213,17,598
1599,93,9964,2020-03-29,N,2819,16,225
1600,93,9964,2020-03-29,S,3446,17,276
1601,93,9965,2020-03-29,N,2493,15,197


The `BEGDATE` field of this particular file is in the format `%Y-%m-%d` while the date formats of other files are in `%m/%d/%Y`. The following code block changes the formatting of the `BEGDATE` field and then converts it to a string.

In [27]:
march27_traffic['BEGDATE'] = pd.to_datetime(march27_traffic['BEGDATE'], format = '%m/%d/%Y').dt.strftime('%m/%d/%Y')

In [28]:
march27_traffic

Unnamed: 0,COUNTY,SITE,BEGDATE,DIR,TOTVOL,PEAKHR,PEAKVOL
0,93,10,03/27/2020,N,8961,16,793
1,93,10,03/27/2020,S,8787,13,721
2,87,31,03/27/2020,E,15080,8,1294
3,87,31,03/27/2020,W,15501,17,1468
4,29,37,03/27/2020,E,3015,17,249
...,...,...,...,...,...,...,...
1598,28,9963,03/29/2020,S,6213,17,598
1599,93,9964,03/29/2020,N,2819,16,225
1600,93,9964,03/29/2020,S,3446,17,276
1601,93,9965,03/29/2020,N,2493,15,197


We could now group the traffic data according to `BEGDATE` and `COUNTY` fields, and get the sum of `TOTVOL` fields according to this particular group. We then get all dates present in the dataset

In [29]:
mar27stats = march27_traffic.groupby(['BEGDATE', 'COUNTY']).sum()['TOTVOL']

mar27stats

BEGDATE     COUNTY
03/27/2020  1          84500
            2          32774
            3         190306
            4          18546
            5           3821
                       ...  
03/29/2020  90         29343
            92         29510
            93        304216
            94         33883
            97        218471
Name: TOTVOL, Length: 192, dtype: int64

In [30]:
#get all dates present in the dataset
mar27dates = list(set([x[0] for x in mar27stats.keys()]))

mar27dates.sort()
print(mar27dates)

['03/27/2020', '03/28/2020', '03/29/2020']


In [31]:
#initialize the dataframe
traffic_data_3 = pd.DataFrame(data = counties, columns = ['COUNTY'])

traffic_data_3

Unnamed: 0,COUNTY
0,1
1,2
2,3
3,4
4,5
...,...
59,90
60,92
61,93
62,94


In [32]:
for i in mar27dates:
    another_temp_2 = mar27stats.loc[i]
    
    for j in counties:
        if j not in another_temp_2:
            another_temp_2[j] = 0

    another_temp_2 = another_temp_2.sort_index()
    
    date = i.split('/')
    date = date[0].zfill(2) + '/' + date[1].zfill(2)
    
    traffic_data_3[date] = list(another_temp_2.array)

In [33]:
traffic_data_3

Unnamed: 0,COUNTY,03/27,03/28,03/29
0,1,84500,57054,45057
1,2,32774,25315,20754
2,3,190306,124242,95336
3,4,18546,13530,11630
4,5,3821,2688,2096
...,...,...,...,...
59,90,50638,35397,29343
60,92,42690,35434,29510
61,93,563303,387554,304216
62,94,55891,39864,33883


The traffic data for March 27, March 28, and March 29 have been retrieved and processed in Part 4.1. We can determine if the data processed in Part 4.2 is the same as the data retrieved and processed from Part 4.1 by using the `.equals` method.

In [34]:
traffic_data_3[['03/27', '03/28', '03/29']].equals(final[['03/27', '03/28', '03/29']])

True

Since they are equal, **we do not have to concatenate or adjoin this dataframe into the existing `final` dataset.** The `final` dataset is now sufficient enough to cover all traffic data on these particular dates. The code cell below sorts the column values/headers of the `final` dataset for a cleaner presentation.

In [35]:
final = final.reindex(sorted(final.columns), axis = 1)
final

Unnamed: 0,01/01,01/02,01/03,01/04,01/05,01/06,01/07,01/08,01/09,01/10,...,06/29,06/30,07/01,07/02,07/03,07/04,07/05,07/06,07/07,COUNTY
0,93577,130263,137751,110805,101379,120925,117384,122200,125525,133191,...,98897,99327,102682,111925,103217,71816,81094,99167,96529,1
1,27107,42907,45616,33070,32908,41046,43203,44375,45091,47476,...,35886,38459,43349,46666,45621,34863,31295,40701,42047,2
2,236551,325565,338452,284312,264567,314038,304515,310861,318654,335605,...,216490,214114,220510,234920,218501,147360,165241,213985,207772,3
3,15620,22611,24145,18099,18292,22028,21675,22571,23220,25535,...,18203,18602,18885,20446,18055,12816,14553,18841,17845,4
4,3437,4675,5495,3620,3310,4655,4581,4488,4883,5273,...,3963,3847,4235,4159,3614,2638,2901,3808,3913,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
59,84573,104874,104700,92469,80689,90902,89265,91252,91892,93932,...,84660,82687,86880,93197,91758,71474,73931,82239,77535,90
60,67464,81092,86625,76820,74660,78769,80345,81134,83048,87917,...,68280,67681,68647,72278,75584,63372,60028,67350,49517,92
61,496735,738460,766661,636097,560431,722610,722778,716651,725090,775552,...,703184,644371,722408,746341,683554,473495,480509,672670,690993,93
62,64646,86494,90838,79047,76925,77275,72499,74157,75496,85537,...,68557,66648,69227,77059,72195,50661,62022,67990,64337,94


# PART 5: Mapping County Codes to County Names

County Codes listed that were used in the *traffic data* are not consistent with other data sources collected. In order to be consistent with other data sources, we could then convert the *County Code* field into a string using the following dictionary/mapping.

This mapping was manually obtained from this [LINK](https://ftp.fdot.gov/file/d/FTP/FDOT/co/planning/transtat/gis/TRANSTAT_metadata/aadt.shp.xml) with the following additions:
- 02: Citrus
- 97: Florida's Turnpike

In [36]:
mapping = {1: 'Charlotte', 2: 'Citrus', 3: 'Collier', 4: 'Desoto', 5: 'Glades', 6: 'Hardee', 7: 'Hendry', 9: 'Highlands',
 12: 'Lee', 8: 'Hernando', 10: 'Hillsborough', 11: 'Lake', 13: 'Manatee', 14: 'Pasco', 15: 'Pinellas', 16: 'Polk',
           17: 'Sarasota', 18: 'Sumter', 26: 'Alachua', 27: 'Baker', 28: 'Bradford', 29: 'Columbia', 30: 'Dixie',
           31: 'Gilchrist', 32: 'Hamilton', 33: 'Lafayette', 34: 'Levy', 35: 'Madison',36: 'Marion', 37: 'Suwannee',
           38: 'Taylor', 39: 'Union', 46: 'Bay', 47: 'Calhoun', 48: 'Escambia', 49: 'Franklin', 50: 'Gadsden', 51: 'Gulf',
           52: 'Holmes', 53: 'Jackson', 54: 'Jefferson', 55: 'Leon', 56: 'Liberty', 57: 'Okaloosa', 58: 'Santa Rosa',
           59: 'Wakulla', 60: 'Walton', 61: 'Washington', 70: 'Brevard', 71: 'Clay', 72: 'Duval', 73: 'Flagler', 74: 'Nassau',
           75: 'Orange', 76: 'Putnam', 77: 'Seminole', 78: 'St. Johns', 79: 'Volusia', 86: 'Broward', 87: 'Miami-Dade',
           88: 'Indian River', 89: 'Martin', 90: 'Monroe', 91: 'Okeechobee', 92: 'Osceola', 93: 'Palm Beach', 94: 'St. Lucie', 
           97: 'Florida\'s Turnpike'}

In [37]:
final['COUNTY'].replace(mapping, inplace = True)

In [38]:
final

Unnamed: 0,01/01,01/02,01/03,01/04,01/05,01/06,01/07,01/08,01/09,01/10,...,06/29,06/30,07/01,07/02,07/03,07/04,07/05,07/06,07/07,COUNTY
0,93577,130263,137751,110805,101379,120925,117384,122200,125525,133191,...,98897,99327,102682,111925,103217,71816,81094,99167,96529,Charlotte
1,27107,42907,45616,33070,32908,41046,43203,44375,45091,47476,...,35886,38459,43349,46666,45621,34863,31295,40701,42047,Citrus
2,236551,325565,338452,284312,264567,314038,304515,310861,318654,335605,...,216490,214114,220510,234920,218501,147360,165241,213985,207772,Collier
3,15620,22611,24145,18099,18292,22028,21675,22571,23220,25535,...,18203,18602,18885,20446,18055,12816,14553,18841,17845,Desoto
4,3437,4675,5495,3620,3310,4655,4581,4488,4883,5273,...,3963,3847,4235,4159,3614,2638,2901,3808,3913,Glades
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
59,84573,104874,104700,92469,80689,90902,89265,91252,91892,93932,...,84660,82687,86880,93197,91758,71474,73931,82239,77535,Monroe
60,67464,81092,86625,76820,74660,78769,80345,81134,83048,87917,...,68280,67681,68647,72278,75584,63372,60028,67350,49517,Osceola
61,496735,738460,766661,636097,560431,722610,722778,716651,725090,775552,...,703184,644371,722408,746341,683554,473495,480509,672670,690993,Palm Beach
62,64646,86494,90838,79047,76925,77275,72499,74157,75496,85537,...,68557,66648,69227,77059,72195,50661,62022,67990,64337,St. Lucie


# PART 6: Exporting Data

We can now export the data to a readable CSV file.

In [39]:
final.to_csv('../florida_traffic_data.csv', encoding = 'utf-8', index = False)