# Flow Processing Pipeline


## Goal of the pipeline

<font color=red> the goal of this notebook is to process the raw ADT data and PeMs data for 2013, 2015, 2017 and 2019 into understandable format. Then, every detector of interest is associated with a network link inside Aimsun. Later, every detector is associated with a road section to create heatmap and compare flows between years. 


### Dependent scripts: 
- pems_download.py: the script automatically download PeMs data for chosen date
- pre_process_flow.py: the script (A). parse the ADT data from xlsx, doc and csv files to csv files (B). Find the coordinates of the city detectors (C). Check and adjust the locations of the (City + PeMS) detectors to match them to our network using ArcGIS 
- process_flow.py: this script processes all the flow traffic data into one big CSV file from both city and PeMS data files
- process_years_together.py: this script combines the previous scripts and processes the data for 2013, 2015, 2017, 2019


### Inputs of the pipeline: 
- PeMS account
- City flow data
- Kimley-Horn flow and speed data

### Outputs of the pipeline: 
- Processed flow data for 2013, 2015, 2017, 2019


### TO DO:
- should have exception handler in the process (and python function) if the files are not located where you are looking for.
- run PCA and create heatmap with ArcGIS (heatmap will be done by Theo)

##### 1. Obtaining Data

### A) PeMS Data Download 
### [This script only work in Ipython notebook, don't test locally]
    
Source file code: PeMS Download.ipynb <br>
Download traffic data from the PEMS website (pews.dot.ca.gov) for years 2013, 2015, 2017 and 2019 <br>


We download the data calling the method download(detector_ids, year) from the pems_download.py file. The process takes about 5 minutes to run. 

The function has for input:
- PeMS detectors ID
- Year for the desired data 

The function has for output:
- All corresponding PeMS detectors data file for the given year (and the given days encoded in the url)
- Stored in the download folder as PeMS-ID_YEAR.xlsx (where PeMS-ID is the detector ID given by PeMS).

For the function to work, you need to:
- Log in to PeMS in the same browser that runs this Jupyter notebook

For the pipeline to work:
- The downloaded files are then manually moved to the folders PeMS/PeMS_YEAR (where YEAR=2013,2017 or 2019).

### B) Turn city pdf to doc

Some of the flow data that we got from the city were in pdf files. To be able to parse them, we convert them to doc files using online website.

### C) Location and structure of the files

#### PeMS files
The PeMS files are in a folder PeMS/PeMS_YEAR/ (where YEAR=2013, 2017 or 2019).
Every file is named PeMS-ID_YEAR.xlsx (where PeMS-ID is the detector ID given by PeMS).

One xlsx file has two sheets:
- PeMS Report Description
- Report Data <br>
    - Contains the traffic flow data
    - Each row gives the number of vehicles observed in one time step (5 minutes) per lane number over the columns.
    - The first column gives the date and time stamp, and the columns that follow are lanes (i.e. Lane 1 Flow, Lane 2 Flow)
    - ***Edson Question***: There are columns that are ambiguous "Flow (Veh/5 Minutes)", "# Lane Points" and "% Observed". That is I don't know what they represent exactly. For example, "Flow (Veh/5 Minutes)" does not specify if it belongs to some Lane or if its a combination of the previous lane flows.
    ***Theo Answer***: what matter is "Flow (Veh/5 Minutes)": the number of vehicles seen for every lane of the corresponding detectors and "% Observed": how much of the flow is due to real vehicles sensed or due to estimation form other days due to a technical issue that make the sensor not sensing every cars. 

#### City files

The city files are in folder named "Year ADT Data" where Year takes values of 2013, 2015, 2017 or 2019. Almost every file is named using the convention "Main Road Cross 1 Cross 2 Direction". Main Road is the road on which the flow is recorded, Cross 1 and Cross 2 can be used to locate the sensor which is found between the intersection of the main road and road Cross 1 and the intersection of the main road and road Cross 2, and finally the Direction gives the direction of the flow such as "EB" which stands for East Bound. Note that files in "Year ADT Data" folder are of file type .xls, .xlsx, .csv and .pdf.

- ***2013 Excel files*** are structured in data sheets. The first data sheet "Summary" contains the main road, cross streets, city information and the start date of the recording. It also summarizes the data contained in all other sheets into a bar plot of traffic flow vs time of day bins (i.e Tuesday AM, Wednesday PM) for different flow directions and into a line plot of traffic flow vs. hour of day for different days of the week. The sheets that follow are named "D1", "D2",..."DN" where N denotes the N'th day since the start date. These sheets are structured into two tables, AM counts and PM counts. Each table row gives the traffic flow per timestep of 15 minutes. The first column is the time of day in hh:mm format follow by direction columns of traffic flow (NB, SB, EB, WB).
***Theo remark***: they are hidden sheets! This is very important for the parsing!


- ***2015 Excel files*** The files come from Kimley Horn. Every excel file has 6 relevant sheets including the hidden sheets. They are ['ns Day 1', 'ns Day 2', 'ns Day 3', 'ew Day 1', 'ew Day 2', 'ew Day 3']. Depending on the direction of the road, the corresponding set of sheets are filled and the other half would be emtpy. Therefore, in the parsing algorithm, we first determine the flow direction, which then decide the set of sheets to parse. Each sheet is structured to have two tables side-by-side: Northbound(Eastbound) and Southbound(Westbound): in each table, the AM and PM flow are also side-by-side sharing the same column. Therefore, the algorithm would extract AM and PM flow first and stack them together to get traffic flow for the whole day. 


- ***2017 Excel files*** are structured in one data sheet giving a header and a table for traffic flow. The header gives the start date and time of the recording, site code and sensor location, and the table gives traffic flow per a 15 minute timestep. The table's first two columns give the date and time and the following columns give traffic flow per directions.

- ***2017 PDF files*** are structured with a header and 3 tables of traffic flow data (one table per day of subsequent days). The header gives the site location and other miscellaneous meta data. Each table is titled by the date and timestep (15 minutes) of the recording. A table is organized by columns each representing the hour of day (0 - 23). Hence for a given column, the first row gives the hour of the day, the second gives the total flow for the hour, and the third to last row (4 rows total) gives traffic flow per 15 minute timestep for the hour.

- ***2019 Excel files*** have similar structure as those of 2013. The data is organized in two types of sheet, "Day N" and "GR N" sheets. The "Day N" sheets give traffic flow data in the same fashion as the "DN" sheets of 2013 excel files. The day of recording can be found in the header of the two tables. The "GR N" sheets plot the corresponding flow data of the "Day N" sheets. A line plot of flow vs. hour of day for different flow directions is given.
***Theo remark***: Same here, they are hidden sheets!

- ***2019 PDF files*** have the same structure as those of 2017. 



In [1]:
import pems_download as pems

download path/Users/LiJiayi/Downloads


In [2]:
detector_ids = [403250, 403256, 403255, 403257, 418387, 418388, 400376,
               413981, 413980, 413982, 402794, 413983, 413984, 413985,
               413987, 413986, 402796, 413988, 402799, 403251, 403710,
               403254, 403719, 400566, 418420, 418419, 418422, 418423,
               402793, 403226, 414015, 414016, 402795, 402797, 414011,
               402798]
pems.download(2013, detector_ids)
pems.download(2015, detector_ids)
pems.download(2017, detector_ids)

http://pems.dot.ca.gov/?report_form=1&dnode=VDS&content=loops&tab=det_timeseries&export=xls&station_id=403250&s_time_id=1362441600&s_time_id_f=03%2F05%2F2013+00%3A00&e_time_id=1362700740&e_time_id_f=03%2F07%2F2013+23%3A59&tod=all&tod_from=0&tod_to=0&dow_0=on&dow_1=on&dow_2=on&dow_3=on&dow_4=on&dow_5=on&dow_6=on&holidays=on&q=flow&q2=&gn=5min&agg=on&lane1=on&lane2=onlane3=on
http://pems.dot.ca.gov/?report_form=1&dnode=VDS&content=loops&tab=det_timeseries&export=xls&station_id=403256&s_time_id=1362441600&s_time_id_f=03%2F05%2F2013+00%3A00&e_time_id=1362700740&e_time_id_f=03%2F07%2F2013+23%3A59&tod=all&tod_from=0&tod_to=0&dow_0=on&dow_1=on&dow_2=on&dow_3=on&dow_4=on&dow_5=on&dow_6=on&holidays=on&q=flow&q2=&gn=5min&agg=on&lane1=on&lane2=onlane3=on
http://pems.dot.ca.gov/?report_form=1&dnode=VDS&content=loops&tab=det_timeseries&export=xls&station_id=403255&s_time_id=1362441600&s_time_id_f=03%2F05%2F2013+00%3A00&e_time_id=1362700740&e_time_id_f=03%2F07%2F2013+23%3A59&tod=all&tod_from=0&tod_t

http://pems.dot.ca.gov/?report_form=1&dnode=VDS&content=loops&tab=det_timeseries&export=xls&station_id=403719&s_time_id=1362441600&s_time_id_f=03%2F05%2F2013+00%3A00&e_time_id=1362700740&e_time_id_f=03%2F07%2F2013+23%3A59&tod=all&tod_from=0&tod_to=0&dow_0=on&dow_1=on&dow_2=on&dow_3=on&dow_4=on&dow_5=on&dow_6=on&holidays=on&q=flow&q2=&gn=5min&agg=on&lane1=on&lane2=onlane3=on
http://pems.dot.ca.gov/?report_form=1&dnode=VDS&content=loops&tab=det_timeseries&export=xls&station_id=400566&s_time_id=1362441600&s_time_id_f=03%2F05%2F2013+00%3A00&e_time_id=1362700740&e_time_id_f=03%2F07%2F2013+23%3A59&tod=all&tod_from=0&tod_to=0&dow_0=on&dow_1=on&dow_2=on&dow_3=on&dow_4=on&dow_5=on&dow_6=on&holidays=on&q=flow&q2=&gn=5min&agg=on&lane1=on&lane2=onlane3=on
http://pems.dot.ca.gov/?report_form=1&dnode=VDS&content=loops&tab=det_timeseries&export=xls&station_id=418420&s_time_id=1362441600&s_time_id_f=03%2F05%2F2013+00%3A00&e_time_id=1362700740&e_time_id_f=03%2F07%2F2013+23%3A59&tod=all&tod_from=0&tod_t

http://pems.dot.ca.gov/?report_form=1&dnode=VDS&content=loops&tab=det_timeseries&export=xls&station_id=413980&s_time_id=1425340800&s_time_id_f=03%2F03%2F2015+00%3A00&e_time_id=1362700740&e_time_id_f=03%2F07%2F2015+23%3A59&tod=all&tod_from=0&tod_to=0&dow_0=on&dow_1=on&dow_2=on&dow_3=on&dow_4=on&dow_5=on&dow_6=on&holidays=on&q=flow&q2=&gn=5min&agg=on&lane1=on&lane2=onlane3=on
http://pems.dot.ca.gov/?report_form=1&dnode=VDS&content=loops&tab=det_timeseries&export=xls&station_id=413982&s_time_id=1425340800&s_time_id_f=03%2F03%2F2015+00%3A00&e_time_id=1362700740&e_time_id_f=03%2F07%2F2015+23%3A59&tod=all&tod_from=0&tod_to=0&dow_0=on&dow_1=on&dow_2=on&dow_3=on&dow_4=on&dow_5=on&dow_6=on&holidays=on&q=flow&q2=&gn=5min&agg=on&lane1=on&lane2=onlane3=on
http://pems.dot.ca.gov/?report_form=1&dnode=VDS&content=loops&tab=det_timeseries&export=xls&station_id=402794&s_time_id=1425340800&s_time_id_f=03%2F03%2F2015+00%3A00&e_time_id=1362700740&e_time_id_f=03%2F07%2F2015+23%3A59&tod=all&tod_from=0&tod_t

http://pems.dot.ca.gov/?report_form=1&dnode=VDS&content=loops&tab=det_timeseries&export=xls&station_id=414015&s_time_id=1425340800&s_time_id_f=03%2F03%2F2015+00%3A00&e_time_id=1362700740&e_time_id_f=03%2F07%2F2015+23%3A59&tod=all&tod_from=0&tod_to=0&dow_0=on&dow_1=on&dow_2=on&dow_3=on&dow_4=on&dow_5=on&dow_6=on&holidays=on&q=flow&q2=&gn=5min&agg=on&lane1=on&lane2=onlane3=on
http://pems.dot.ca.gov/?report_form=1&dnode=VDS&content=loops&tab=det_timeseries&export=xls&station_id=414016&s_time_id=1425340800&s_time_id_f=03%2F03%2F2015+00%3A00&e_time_id=1362700740&e_time_id_f=03%2F07%2F2015+23%3A59&tod=all&tod_from=0&tod_to=0&dow_0=on&dow_1=on&dow_2=on&dow_3=on&dow_4=on&dow_5=on&dow_6=on&holidays=on&q=flow&q2=&gn=5min&agg=on&lane1=on&lane2=onlane3=on
http://pems.dot.ca.gov/?report_form=1&dnode=VDS&content=loops&tab=det_timeseries&export=xls&station_id=402795&s_time_id=1425340800&s_time_id_f=03%2F03%2F2015+00%3A00&e_time_id=1362700740&e_time_id_f=03%2F07%2F2015+23%3A59&tod=all&tod_from=0&tod_t

http://pems.dot.ca.gov/?report_form=1&dnode=VDS&content=loops&tab=det_timeseries&export=xls&station_id=402796&s_time_id=1488844800&s_time_id_f=03%2F07%2F2017+00%3A00&e_time_id=1489103940&e_time_id_f=03%2F09%2F2017+23%3A59&tod=all&tod_from=0&tod_to=0&dow_0=on&dow_1=on&dow_2=on&dow_3=on&dow_4=on&dow_5=on&dow_6=on&holidays=on&q=flow&q2=&gn=5min&agg=on&lane1=on&lane2=onlane3=on
http://pems.dot.ca.gov/?report_form=1&dnode=VDS&content=loops&tab=det_timeseries&export=xls&station_id=413988&s_time_id=1488844800&s_time_id_f=03%2F07%2F2017+00%3A00&e_time_id=1489103940&e_time_id_f=03%2F09%2F2017+23%3A59&tod=all&tod_from=0&tod_to=0&dow_0=on&dow_1=on&dow_2=on&dow_3=on&dow_4=on&dow_5=on&dow_6=on&holidays=on&q=flow&q2=&gn=5min&agg=on&lane1=on&lane2=onlane3=on
http://pems.dot.ca.gov/?report_form=1&dnode=VDS&content=loops&tab=det_timeseries&export=xls&station_id=402799&s_time_id=1488844800&s_time_id_f=03%2F07%2F2017+00%3A00&e_time_id=1489103940&e_time_id_f=03%2F09%2F2017+23%3A59&tod=all&tod_from=0&tod_t

The IDs of the PeMS detectors where obtained using ArcGIS software and an input file, pems_detectors.csv, containing the locations of all the PeMS dectectors in California <br>
***Edson Question:*** Not sure if this suffices. Who to ask to know more about this process?
***Theo Answer:*** please ask me! We downloaded the list of all detectors and corresponding location from PeMS, then we perform a selection in ArcGIS to find the one in the area of interest.

## 2. Parsing city data
Source file code: Pre-processing_flow.ipynb <br>

**To do**: The time format should be the same everywhere after the processing

### A) Parse the city data from xlsx, doc and csv files to csv files

#### Process ADT Data files to csv files
Here we process the Excel ADT data files (city data) into CSV files. We do this by calling the function process_adt_data(year) from the pre_process_flow.py python file. Note that one file corresponds to one main road and the traffic flow data recordings in it.

Description of function process_adt_data(year) <br>
The function has input:
- Year which takes values of 2013, 2015, 2017 and 2019
- Excel files located in "Year ADT Data" where Year=2013, 2015, 2017 or 2019

The function has output:
- CSV files located in "Year processed" where Year=2013, 2015, 2017 or 2019 

For the function to work:
- Excel files must be located in "Year ADT Data" folder

For the pipeline to work: [Question for Theo???]
- 2013 CSV files are manually relocated to "City/2013 reformat"
- 2017 and 2019 CSV files are manually relocated to "City/Year reformat/Format from xlsx" where Year=2017 or 2019

**(Done) TO DO 6**: Go over the code, comment it, make it easier to understand. State clearly the inputs and the outputs (this will help a lot for 2015 data). Ask Zixuan the work she did for 2015 data.
<br>
***Edson Question***: Code rewritten so that it is easier to understand. Input and outputs stated for large code segments in python files and methods being called here. Code comments made per code chunk as needed. I asked Zixuan for 2015 data, can I get access to the dropbox holding 2015 data?. She said she processed the speed and flow for 2015 data.
***Theo Answer***: I send you the info on Slack

***Theo Remark***: Maybe we should here put "Year ADT Data" and "Year processed" as parameters of the function process_adt_data

In [1]:
import pre_process_flow as pre_process

In [78]:
pre_process.process_adt_data(2013)
pre_process.process_adt_data(2015)
pre_process.process_adt_data(2017)
pre_process.process_adt_data(2019)

Processing 2013 ADT data
parsing: Auto Mall Pkwy betw. Fremont & I680.xlsx
True
/Users/LiJiayi/Fremont Dropbox/Theophile Cabannes/Private Structured data collection/Data processing/Temporary exports to be copied to processed data/Flow_processed/2013 processed/Auto Mall Pkwy betw. Fremont & I680.csv
Processing 2015 ADT data
parsing: Washington Blvd betw. Driscoll and Paseo Padre.xls
False
/Users/LiJiayi/Fremont Dropbox/Theophile Cabannes/Private Structured data collection/Data processing/Temporary exports to be copied to processed data/Flow_processed/2015 processed/Washington Blvd betw. Driscoll and Paseo Padre.csv
parsing: Paseo Padre Pkwy betw. Mission and Curtner.xls
False
/Users/LiJiayi/Fremont Dropbox/Theophile Cabannes/Private Structured data collection/Data processing/Temporary exports to be copied to processed data/Flow_processed/2015 processed/Paseo Padre Pkwy betw. Mission and Curtner.csv
parsing: Warren Ave betw. Curtner and Warm Springs.xls
False
/Users/LiJiayi/Fremont Dropb

#### Word doc data pre-processing
We process the word DOC files into CSV files by calling the function process_doc_data(year) from the pre_process.py python file. Note that year 2013 did not have DOC files.

Description of function process_doc_data(year) <br>
The function has input:
- Year which takes values 2017 or 2019
- Word DOC files located in the "Year doc" folder where Year=2017 or 2019

The function has output:
- CSV files located in the "Year processed" folder where Year=2017 or 2019

For function to work:
- DOC files must be located in "Year doc" folder where Year=2017 or 2019

For pipeline to work:
- CSV files must be manually relocated to "City/Year reformat/Format from pdf" folder where Year=2017 or 2019

In [3]:
pre_process.process_doc_data(2017)
pre_process.process_doc_data(2019)

# TO DO:
- The following doc files from 2019 were not processed due to the textract package not parsing it correctly. The can be found on the 2019 error folder.
    - DURHAM RD BT I-680 AND MISSION BLVD EB.doc
    - MISSION BLVD BT WASHINGTON BLVD AND PINES ST SB.doc
    
Theo: I might have done some work manually here to create the two corresponding csv files. In this case explain the process done manually.


### B) Find the coordinates of the city detectors
We obtained the coordinates of the detectors by calling the method get_deo_data(year) from the pre_process.py python file. Internally, it iterators over the ADT files and obtains the adresses of the detectors to then use with Google API to obtain latitude and longitude coordinates

Description of get_geo_data(year) <br>
The function has input:
- Year takes values 2013, 2017, 2019
- ADT files (Excel, PDF) located in "Year ADT Data"

The function has output:
- "year_info_coor.csv" containing the coordinates of detectors

For the function to work:
- ADT files must be located in "Year ADT Data" folder

For the pipeline to work:
- NA

**(DONE) TO DO 7**: make sure that the "parsing data" code create year_info_coor.csv. I think that I have created the files using some bash scripts and some excel functions. Here is the google doc that we used for the process of the flow. https://docs.google.com/spreadsheets/d/1tcps-8aorPZLY8nswnNCmjWSJi-7ey8Ps4twWFz2ls0/edit#gid=0
Also this step is very important because this is where I gave an ID to every detector. Please check this step with Theo, to write out the process. We might need to write a function inside Python (instead of bash + Excel).
<br>
***Edson***: Geo data is now obtained through parsing of the file data or the file name. That is, the address is obtained and then used with google API to get Latitude and Longitude.

***Theo remark***: Theo and Edson should discuss about that to get the process right.


In [3]:
pre_process.get_geo_data(2013)

Obtaining geo data from 2013 ADT files
processing: Auto Mall Pkwy betw. Fremont & I680.xlsx
main road info: ('Auto Mall Pkwy betw. Fremont & I680.xlsx', 'Fremont', 'Auto Mall Parkway', 'Between Fremont and Osgood', 'Fremont', 'Osgood')
address:  Auto Mall Parkway & Fremont, Fremont
address w coord lat, lng Auto Mall Parkway & Fremont, Fremont 37.5076894 -121.9665398
address:  Auto Mall Parkway & Osgood, Fremont
address w coord lat, lng Auto Mall Parkway & Osgood, Fremont 37.5139048 -121.9426157
<_io.TextIOWrapper name='2013_info_coor.csv' mode='w' encoding='UTF-8'>


In [4]:
pre_process.get_geo_data(2015)

Obtaining geo data from 2015 ADT files
processing: Washington Blvd betw. Driscoll and Paseo Padre.xls
Washington Blvd betw. Driscoll and Paseo Padre.xls
main road info: ('Washington Blvd betw. Driscoll and Paseo Padre.xls', 'Fremont', 'Washington Blvd', 'Driscoll and Paseo Padre', 'Driscoll', 'Paseo Padre')
processing: Paseo Padre Pkwy betw. Mission and Curtner.xls
Paseo Padre Pkwy betw. Mission and Curtner.xls
main road info: ('Paseo Padre Pkwy betw. Mission and Curtner.xls', 'Fremont', 'Paseo Padre Pkwy', 'Mission and Curtner', 'Mission', 'Curtner')
processing: Warren Ave betw. Curtner and Warm Springs.xls
Warren Ave betw. Curtner and Warm Springs.xls
main road info: ('Warren Ave betw. Curtner and Warm Springs.xls', 'Fremont', 'Warren Ave', 'Curtner and Warm Springs', 'Curtner', 'Warm Springs')
processing: Mission Blvd betw. Durham and Curtner.xls
Mission Blvd betw. Durham and Curtner.xls
main road info: ('Mission Blvd betw. Durham and Curtner.xls', 'Fremont', 'Mission Blvd', 'Durham

In [5]:
pre_process.get_geo_data(2017)
pre_process.get_geo_data(2019)

Obtaining geo data from 2017 ADT files
processing: AUTO MALL PKWY BT FREMONT BLVD AND I680.xlsx
main road info: ('AUTO MALL PKWY BT FREMONT BLVD AND I680.xlsx', 'Fremont', 'Auto Mall Pkwy', 'Fremont Blvd And I680', 'Fremont Blvd', 'I680')
address:  Auto Mall Pkwy & Fremont Blvd, Fremont
address w coord lat, lng Auto Mall Pkwy & Fremont Blvd, Fremont 37.51209619999999 -121.9511975
address:  Auto Mall Pkwy & I680, Fremont
address w coord lat, lng Auto Mall Pkwy & I680, Fremont 37.5076894 -121.9665398
<_io.TextIOWrapper name='2017_info_coor.csv' mode='w' encoding='UTF-8'>
Obtaining geo data from 2019 ADT files
processing: Driscoll Rd Bet. Mission Blvd & Paseo Padre Pkwy.xls
main road info: ('Driscoll Rd Bet. Mission Blvd & Paseo Padre Pkwy.xls', 'Fremont', 'Driscoll Rd', 'Mission Blvd & Paseo Padre Pkwy', 'Mission Blvd', 'Paseo Padre Pkwy')
address:  Driscoll Rd & Mission Blvd, Fremont
address w coord lat, lng Driscoll Rd & Mission Blvd, Fremont 37.5497624 -121.9399602
address:  Driscoll 

### C) Check and adjust the locations of the (City + PeMS) detectors to match them to our network using ArcGIS <br>
Done in the software manually in ArcGIS.
1. Export Aimsun network as GIS file
2. Import Aimsun network in ArcGIS
3. Import detectors in ArcGIS as XY_points
4. Move detectors to put them on corresponding road in Aimsun
5. Associate to every detectors the External ID of the Aimsun road (to be done again)

**TO DO THEO**: Add the process to create the detectors inside Aimsun.
Add the process to match the detectors to road section (and create the file lines_to_detectors.xlsx


### Later to do: do the spatial join in python

### 3. Process Data

In [2]:
import process_flow as pf

### A) Process the csv files (city + caltrans) to one big file. <br>
source code: processing flow to one CSV.ipynb <br>

We combine all the flow traffic data into one big CSV file from both city and PeMS data files. This is done by calling the function process_data() from the process_flow.py python file. 

Description of function process_data()
<br>
The function has input:
- "Flow_processed_tmp.csv" file that lists all the processed files from city and PeMS data
- The processed files created from the Parsing Data section

The function has output:
- "Flow_processed_city.csv" containing combined city flow data for all year
- "Flow_processed_PeMS.csv" containing combined PeMS flow data for all years

For the function to work:
- The processed files (input) must be located in City and PeMs folders
- 2013 city processed files are located in "City/2013 reformat/"
- 2017 and 2019 city processed files that originated from DOC (which originated from PDF) files are located in "City/Year reformat/Format from pdf" folder where Year=2017 or 2019
- 2017 and 2019 city processed files that originated from Excel files are located in "City/Year reformat/Format from xlsx" folder where Year=2017 or 2019
- 2013, 2017 and 2019 PeMS data files are located in "PeMS_Year" folder where Year=2013, 2017 or 2019

For the pipeline to work:
- The ouput files must remain in the working directory, no moving necessary.

Structure of ouput files: 
- Flow_processed_city.csv
    - contains city traffic data where the rows represent traffic flow. The first 5 columns give info about the traffic flow and are Year, Name, Id, Direction, Day 1 where Name refers to the file name from which the data originated, Id is the Id from the "Flow_processed_tmp.csv" file, Direction is the direction of flow and Day 1 is the start date of recording. The columns that follow are day-timesteps for flow data. There are 3 days total over which traffic flow is recorded and time progresses in 15 minute steps. Hence the data columns progress as "Day 1 - 0:0", "Day 1 - 0:15", "Day 1 - 0:30",...,"Day 3 - 23:30", "Day 3 - 23:45".
- Flow_processed_PeMS.csv
    - contains PeMS flow traffic data where the rows represent traffic flow. The first columns are Name, Id and Name PeMS where Name contains the PeMS detector Id, Id is the Id assigned from "Flow_processed_tmp.csv", Name PeMs is the road address. The next 6 columns give Observed Year and Day Year for the 3 years, 2013, 2017 and 2019. Observed Year is the percentage of the observed data and Day Year is the start date of recording. The columns that follow are Year-Day-timestep, there are 3 years, 3 days and time progresses in 15 minute steps. Hence the columns progress as "2013-Day 1 - 0:0", "2013-Day 1 - 0:15", "2013-Day 1 - 0:30",...,"2019-Day 3 - 23:30", "2019-Day 3 - 23:45".

**(DONE) TO DO 8**: Explain the structure of the output files. Also, feel free to document the doc in the python file (or iPython file). Explain also the input (Flow_processed_tmp.csv) and how it was created (I think it was created from the google doc https://docs.google.com/spreadsheets/d/1tcps-8aorPZLY8nswnNCmjWSJi-7ey8Ps4twWFz2ls0/edit#gid=0).

***Edson Question***: the Flow_processed_tmp.csv file and the google doc seem the same to me (except for the lat, lng info on the right side of the google doc). Beyond this, I don't know how the file was created. Who to ask for more info?

***Theo Answer***: I guess I have created the file from the google doc. But I have also created the google doc during the parsing. Probably in 2) we can create Flow_processed_tmp.csv from year_info.csv

# added 2015 year

In [3]:
import pandas as pd
import math

In [4]:
line = '69,Washington Blvd betw. Driscoll and Paseo Padre.xls'
year = 2015
id_flow, title = line.split(",")
title = title.replace('\n', '')


In [5]:
pf.process_data()

¥éËId,./2013 ADT Data:

1,Auto Mall Pkwy betw. Fremont & I680.xlsx

3,Driscoll Rd betw. Mission & PPP.xlsx

5,Driscoll Rd betw. PPP & Washington.xlsx

7,Durham Rd.xlsx

9,EB Washington Blvd at Gallegos.xls

11,EB Washington Blvd at Palm.xls

13,EB Washngton Blvd at Olive.xls

15,Mission Blvd betw. Driscoll & I680.xlsx

17,Mission Blvd betw. Durham & I680.xlsx

19,Mission Blvd betw. I680 & I880.xlsx

21,Mission Blvd betw. I680 & Washington.xlsx

23,Mission Blvd betw. Pine & Durham.xlsx

25,Mission Blvd betw. Washington & Pine.xlsx

27,Osgood Blvd betw. Auto Mall & Grimmer.xlsx

29,Osgood Rd betw. Washington & Auto Mall.xlsx

31,Paseo Padre Pkwy betw. Durham & S. Grimmer.xlsx

33,Paseo Padre Pkwy betw. Mission & Curtner.xlsx

35,Paseo Padre Pkwy betw. S. Grimmer & Mission.xlsx

37,Paseo Padre Pkwy betw. Washington & Durham.xlsx

39,S. Grimmer Blvd betw PPP & Osgood.xlsx

41,S. Grimmer Blvd betw. Osgood & Fremont.xlsx

43,WB Washington Blvd at Ellsworth.xls

45,WB Washington Blvd at Galle

### B) Create file that gives traffic flows for specific road sections for every year. <br>
source note: put years together.ipynb

- Use detectors (lines_to_detectors.csv) and flow processed city (flow_processed_city.csv) data to create all years flow data in "flow_processed_section.csv"
- Note that the erroneous files are still being skipped, they are: ['DurhamRd I680 MissionBlv EB', 'Mission blvd Pine Washington SB']

# To do 9: the function pytogether should be written again. + add 2015
- add the PeMS data
- be more clever about missing data or road section associated with several detectors (take the average)

### To do later, do the spatial join in python

In [6]:
import process_years_together as pytogether

line_to_detectors = 'lines_to_detectors.csv'
flow_processed_city = 'Flow_processed_city.csv'
pytogether.run(line_to_detectors, flow_processed_city)

[4 0.014691227 'Mission blvd WmSpr PPP SB' 'SB' '18' '84' '138']
(18, 84, 138)
[8 0.014910005 'Mission blvd WmSpr PPP SB' 'NB' '17' '83' '137']
(17, 83, 137)
[11 0.011642008 'AutoMailPkwy  FremontBlvd i680' 'WB' '2' '65' '122']


IndexError: index 0 is out of bounds for axis 0 with size 0

### 4. Data Analysis

# TO DO: To be done after the other to dos has been done.

A) PCA

B) Analyse PCA results using heatmap inside ArcGIS