# Flow Processing Pipeline


## Goal of the pipeline

<font color=red> The goal of this notebook is to associate the raw ADT ground-data for 2013, 2015, 2017 and 2019, and the raw speed ground-data in 2015 to the Aimsun network for the calibration of Aimsun simulations. First the raw data is processed in a common format, then, every detector is associated with a network link inside Aimsun. Later, every detector is associated with a road section to create heatmap and understand the evolution of flows over years.


# TO DO: write this in a better way
### Outputs of the pipeline: 
- One file matching detectors location to Aimsun road section
- One file with the processed flow data for 2013, 2015, 2017, 2019
- One file with the processed speed data for 2015
- One file with the flow data corresponding to road sections for 2013, 2015, 2017 and 2019
- PCA on flow data

### Inputs of the pipeline: 
**Raw data**
- PeMS account [_publicly available_]. Can be created in http://pems.dot.ca.gov
- PeMS detectors location [_publicly available_]. In the dropbox under `Demand/Flow_speed/Traffic\ flow\ studies/PeMS`
- City average annual daily traffic (AADT) data for 2013, 2017 and 2019 [_given by the city of Fremont_]. Located in the dropbox under `Demand/Flow_speed/Traffic\ flow\ studies/City`
- Kimley-Horn flow and speed data for 2015 [_given by the city of Fremont_]. Located in the dropbox under `Demand/Flow_speed/Traffic\ flow\ studies/Kimley\ Horn\ Data`

**Manually made dataset**
- Aimsun network
- Detectors location
- Road section layer
- Doc files or city ADT data corresponding to the PDF files
- Detectors ID to corresponding flow file name


### Temporary files of the pipeline
- CSV flow data
    - City and Kimley Horn
    - PeMS data
- ...

### Dependent scripts: 
- [pems_download.py](https://github.com/Fremont-project/data-processing/blob/master/fremontdropbox.py): the script automatically download PeMs data for chosen date.
- [pre_process_flow.py](https://github.com/Fremont-project/data-processing/blob/master/fremontdropbox.py): the script (A). parse the ADT data from xlsx, doc and csv files to csv files (B). Find the coordinates of the city detectors (C). Check and adjust the locations of the (City + PeMS) detectors to match them to our network using ArcGIS 
- [process_flow.py](https://github.com/Fremont-project/data-processing/blob/master/fremontdropbox.py): this script processes all the flow traffic data into one big CSV file from both city and PeMS data files
- [process_years_together.py](https://github.com/Fremont-project/data-processing/blob/master/fremontdropbox.py): this script combines the previous scripts and processes the data for 2013, 2015, 2017, 2019
- fremontdropbox_flow

### Dependent libraries:
- os
- webbrowser
- time
- requests
- pathlib
- textract
- numpy
- To do

### Work done by the script
1. Obtaining Data
    - Turn city pdf to doc
    - PeMS data download
2. Preprocessing city and Kimley-Horn data
    - Parsing the flow data [We are here]
    - Parsing the speed data
    - Geocoding location of the detectors
    - Manually update the location of the detectors
3. Processing the data
    - Creating one file with the flow for all detectors
    - Creating a layer of road section
    - Matching the detectors to road section
    - Creating one file with the flow over time for every road section
    - Matching the detectors with the corresponding road section in Aimsun
4. Exporting the data
    - Exporting a csv file where Aimsun road sections are matched with detectors
    - Exporting a csv file with flow for the different road sections
    - Exporting traffic flow heatmap
    - Exporting speed heatmap
    - PCA on the traffic flow heatmap over years

### TO DO:
- should have exception handler in the process (and python function) if the files are not located where you are looking for.
- run PCA and create heatmap with ArcGIS (heatmap will be done by Theo)
- put the correct github link for the scripts in the Dependent scripts paragraph
- Work done by the script
- Dependent libraries
- Put the scripts together in one main script
- add link from work done by the script to section of the iPython notebook

In [1]:
# import sys
# # We let this notebook to know where to look for fremontdropbox module
# module_path = os.path.abspath(os.path.join('../..'))
# if module_path not in sys.path:
#     sys.path.append(module_path)

from fremontdropbox_flow import get_dropbox_location

dropbox_dir = get_dropbox_location()

## 1. Obtaining Data

### A) PeMS Data Download 
    
Script: pems_download.py <br>
Download traffic data from the PEMS website (pews.dot.ca.gov) for years 2013, 2015, 2017 and 2019 <br>

We download the data calling the method download(detector_ids, year, PeMS_dir) from the pems_download.py file. The process takes about 20 minutes to run. 
Run `print(help(pems.download))` to get some help.

### B) Turn city pdf to doc

Some of the flow data that we got from the city were in pdf files. To be able to parse them, we convert them to doc files using online website. This should be done before running the code. The doc files are inputs of the pipeline.

### C) Location and structure of the files

#### City files

The city files are in folder named "Year ADT Data" where Year takes values of 2013, 2015, 2017 or 2019. Almost every file is named using the convention "Main Road Cross 1 Cross 2 Direction". Main Road is the road on which the flow is recorded, Cross 1 and Cross 2 can be used to locate the sensor which is found between the intersection of the main road and road Cross 1 and the intersection of the main road and road Cross 2, and finally the Direction gives the direction of the flow such as "EB" which stands for East Bound. Note that files in "Year ADT Data" folder are of file type .xls, .xlsx, .csv and .pdf.

- ***2013 Excel files*** are structured in data sheets. The first data sheet "Summary" contains the main road, cross streets, city information and the start date of the recording. It also summarizes the data contained in all other sheets into a bar plot of traffic flow vs time of day bins (i.e Tuesday AM, Wednesday PM) for different flow directions and into a line plot of traffic flow vs. hour of day for different days of the week. The sheets that follow are named "D1", "D2",..."DN" where N denotes the N'th day since the start date. These sheets are structured into two tables, AM counts and PM counts. Each table row gives the traffic flow per timestep of 15 minutes. The first column is the time of day in hh:mm format follow by direction columns of traffic flow (NB, SB, EB, WB).
***Theo remark***: they are hidden sheets! This is very important for the parsing!


- ***2015 Excel files*** The files come from Kimley Horn. Every excel file has 6 relevant sheets including the hidden sheets. They are ['ns Day 1', 'ns Day 2', 'ns Day 3', 'ew Day 1', 'ew Day 2', 'ew Day 3']. Depending on the direction of the road, the corresponding set of sheets are filled and the other half would be emtpy. Therefore, in the parsing algorithm, we first determine the flow direction, which then decide the set of sheets to parse. Each sheet is structured to have two tables side-by-side: Northbound(Eastbound) and Southbound(Westbound): in each table, the AM and PM flow are also side-by-side sharing the same column. Therefore, the algorithm would extract AM and PM flow first and stack them together to get traffic flow for the whole day. 
## Theo: in 2015 the only important hidden sheets is data!!


- ***2017 Excel files*** are structured in one data sheet giving a header and a table for traffic flow. The header gives the start date and time of the recording, site code and sensor location, and the table gives traffic flow per a 15 minute timestep. The table's first two columns give the date and time and the following columns give traffic flow per directions.

- ***2017 PDF files*** are structured with a header and 3 tables of traffic flow data (one table per day of subsequent days). The header gives the site location and other miscellaneous meta data. Each table is titled by the date and timestep (15 minutes) of the recording. A table is organized by columns each representing the hour of day (0 - 23). Hence for a given column, the first row gives the hour of the day, the second gives the total flow for the hour, and the third to last row (4 rows total) gives traffic flow per 15 minute timestep for the hour.

- ***2019 Excel files*** have similar structure as those of 2013. The data is organized in two types of sheet, "Day N" and "GR N" sheets. The "Day N" sheets give traffic flow data in the same fashion as the "DN" sheets of 2013 excel files. The day of recording can be found in the header of the two tables. The "GR N" sheets plot the corresponding flow data of the "Day N" sheets. A line plot of flow vs. hour of day for different flow directions is given.
***Theo remark***: Same here, they are hidden sheets!

- ***2019 PDF files*** have the same structure as those of 2017. 

In [2]:
import pems_download as pems
help(pems.download)

Help on function download in module pems_download:

download(year, detector_ids, PeMS_dir)
    This function downloads traffic data from the PeMS website (pems.dot.ca.gov).
    This function has for input:
        - PeMS detectors ID: detector_ids (an array of detectors)
        - Year for the desired data: year (one year as a integer, should be 2013, 2015, 2017 or 2019)
    
    This function has for output:
        - All corresponding PeMS detectors data file for the given year (and the given days encoded in the url).
        - Stored in the download folder as PeMS_dir/PeMS_year/PeMS-ID_YEAR.xlsx (where PeMS-ID is the detector ID given by PeMS).
        One xlsx file has two sheets:
            - PeMS Report Description
            - Report Data
                - Contains the traffic flow data
                - Each row gives the number of vehicles observed in one time step (5 minutes) per lane number over the columns.
                - The first column gives the date and time stamp,

In [3]:
# ************* IMPORTANT *************
# --> For this cell to work, you need to log in to PeMS in the same browser that runs this Jupyter notebook

## The IDs of the PeMS detectors where obtained using ArcGIS software and an input file, pems_detectors.csv, containing the locations of all the PeMS dectectors in California
## this should be done in Python!
detector_ids = [403250, 403256, 403255, 403257, 418387, 418388, 400376,
               413981, 413980, 413982, 402794, 413983, 413984, 413985,
               413987, 413986, 402796, 413988, 402799, 403251, 403710,
               403254, 403719, 400566, 418420, 418419, 418422, 418423,
               402793, 403226, 414015, 414016, 402795, 402797, 414011,
               402798]
PeMS_dir = dropbox_dir + '/Private Structured data collection/Data processing/Auxiliary files/Demand/Flow_speed/PeMS'

# pems.download(2013, detector_ids, PeMS_dir)
# pems.download(2015, detector_ids, PeMS_dir)
# pems.download(2017, detector_ids, PeMS_dir)
# pems.download(2019, detector_ids, PeMS_dir)

# To do:
- Find the list of detectors ID using the project delimitation shapefile.
- Write a test to check that all the files are correct: check start and end date in the PeMS Report Description of every file

In [4]:
# ************* TO DO *************
## Write a test function to make sure that all the downloads are correct

## 2. Parsing city data


### A) Parse city and Kimley Horn flow data from xlsx, doc and csv files to csv files

In [5]:
import pre_process_flow as pre_process
print(help(pre_process.process_adt_data))

Help on function process_adt_data in module pre_process_flow:

process_adt_data(year, Processed_dir, Input_dir)
    This function processes the Excel and PDF ADT data files (city data) into CSV files. Note that one file corresponds to one main road and the traffic flow data recordings in it.
    
    This function has input:
        - year which takes values 2013, 2015, 2017 or 2019
        - Processed_dir: path to the output
        - Input_dir: path to the inputs
    
    The function has output:
        - CSV files located in the Processed_dir/Year_processed/ folder where Year=2017 or 2019
    
    For function to work:
        - Files should be located in 
            1. Input_dir/Year\ EXT/ folder if Year=2013, 2017 or 2019 where:
                a. Year=2013, 2017 or 2019 if Ext=ADT Data 
                b. Year=2017 or 2019 if Ext=doc for 2017 and 2019
            2. Input_dir/Raw\ data/ folder if Year=2015

None


# To do
- Re do the parsing such that all the files look the same. The time format should be the same everywhere.
- Remove the unnecessary functions in pre_process
- Take care of the exception for
    - DURHAM RD BT I-680 AND MISSION BLVD EB.doc 
    (Change the doc manually to be able to do the parsing on them)
    - MISSION BLVD BT WASHINGTON BLVD AND PINES ST SB.doc
    (Change the doc manually to be able to do the parsing on them)
- Write test to check if the parsing of doc file is correct

In [6]:
ADT_dir = dropbox_dir + '/Private Structured data collection/Data processing/Raw/Demand/Flow_speed'
Processed_dir = dropbox_dir + '/Private Structured data collection/Data processing/Auxiliary files/Demand/Flow_speed/Flow_processed'
City_dir = ADT_dir + "/City"
Kimley_Horn_flow_dir = ADT_dir + "/Kimley Horn Data"

# pre_process.process_adt_data(2013, Processed_dir, City_dir)
# pre_process.process_adt_data(2015, Processed_dir, Kimley_Horn_flow_dir)
# pre_process.process_adt_data(2017, Processed_dir, City_dir)
pre_process.process_adt_data(2019, Processed_dir, City_dir)


ValueError: could not convert string to float: 


### B) Find the coordinates of the city detectors
We obtained the coordinates of the detectors by calling the method get_geo_data(year) from the pre_process.py python file. Internally, it iterators over the ADT files and obtains the adresses of the detectors to then use with Google API to obtain latitude and longitude coordinates

Description of get_geo_data(year) <br>
The function has input:
- Year takes values 2013, 2015, 2017, 2019
- ADT files (Excel, PDF) located in "Year ADT Data"

The function has output:
- "year_info_coor.csv" containing the coordinates of detectors

For the function to work:
- ADT files must be located in "Year ADT Data" folder

For the pipeline to work:
- NA

**(DONE) TO DO 7**: make sure that the "parsing data" code create year_info_coor.csv. I think that I have created the files using some bash scripts and some excel functions. Here is the google doc that we used for the process of the flow. https://docs.google.com/spreadsheets/d/1tcps-8aorPZLY8nswnNCmjWSJi-7ey8Ps4twWFz2ls0/edit#gid=0
Also this step is very important because this is where I gave an ID to every detector. Please check this step with Theo, to write out the process. We might need to write a function inside Python (instead of bash + Excel).
<br>
***Edson***: Geo data is now obtained through parsing of the file data or the file name. That is, the address is obtained and then used with google API to get Latitude and Longitude.

***Theo remark***: Theo and Edson should discuss about that to get the process right.

# To do:
- The google doc should be created during the parsing.

In [7]:
pre_process.get_geo_data(2013)
pre_process.get_geo_data(2015)
pre_process.get_geo_data(2017)
pre_process.get_geo_data(2019)

Obtaining geo data from 2013 ADT files


NameError: name 'ADT_dir' is not defined

# To do:
- Make the function above work (do not forget to also get the location for doc files)

# Theo: I am here

### C) Check and adjust the locations of the (City + PeMS) detectors to match them to our network using ArcGIS <br>
Done in the software manually in ArcGIS.
1. Export Aimsun network as GIS file
2. Import Aimsun network in ArcGIS
3. Import detectors in ArcGIS as XY_points
4. Move detectors to put them on corresponding road in Aimsun
5. Associate to every detectors the External ID of the Aimsun road (to be done again)

**TO DO THEO**: Add the process to create the detectors inside Aimsun.
Add the process to match the detectors to road section (and create the file lines_to_detectors.xlsx


### Later to do: do the spatial join in python

### 3. Process Data

In [None]:
import process_flow as pf

### A) Process the csv files (city + caltrans) to one big file. <br>
source code: processing flow to one CSV.ipynb <br>

We combine all the flow traffic data into one big CSV file from both city and PeMS data files. This is done by calling the function process_data() from the process_flow.py python file. 

Description of function process_data()
<br>
The function has input:
- "Flow_processed_tmp.csv" file that lists all the processed files from city and PeMS data
- The processed files created from the Parsing Data section

The function has output:
- "Flow_processed_city.csv" containing combined city flow data for all year
- "Flow_processed_PeMS.csv" containing combined PeMS flow data for all years

For the function to work:
- The processed files (input) must be located in City and PeMs folders
- 2013 city processed files are located in "City/2013 reformat/"
- 2017 and 2019 city processed files that originated from DOC (which originated from PDF) files are located in "City/Year reformat/Format from pdf" folder where Year=2017 or 2019
- 2017 and 2019 city processed files that originated from Excel files are located in "City/Year reformat/Format from xlsx" folder where Year=2017 or 2019
- 2013, 2017 and 2019 PeMS data files are located in "PeMS_Year" folder where Year=2013, 2017 or 2019

For the pipeline to work:
- The ouput files must remain in the working directory, no moving necessary.

Structure of ouput files: 
- Flow_processed_city.csv
    - contains city traffic data where the rows represent traffic flow. The first 5 columns give info about the traffic flow and are Year, Name, Id, Direction, Day 1 where Name refers to the file name from which the data originated, Id is the Id from the "Flow_processed_tmp.csv" file, Direction is the direction of flow and Day 1 is the start date of recording. The columns that follow are day-timesteps for flow data. There are 3 days total over which traffic flow is recorded and time progresses in 15 minute steps. Hence the data columns progress as "Day 1 - 0:0", "Day 1 - 0:15", "Day 1 - 0:30",...,"Day 3 - 23:30", "Day 3 - 23:45".
- Flow_processed_PeMS.csv
    - contains PeMS flow traffic data where the rows represent traffic flow. The first columns are Name, Id and Name PeMS where Name contains the PeMS detector Id, Id is the Id assigned from "Flow_processed_tmp.csv", Name PeMs is the road address. The next 6 columns give Observed Year and Day Year for the 3 years, 2013, 2017 and 2019. Observed Year is the percentage of the observed data and Day Year is the start date of recording. The columns that follow are Year-Day-timestep, there are 3 years, 3 days and time progresses in 15 minute steps. Hence the columns progress as "2013-Day 1 - 0:0", "2013-Day 1 - 0:15", "2013-Day 1 - 0:30",...,"2019-Day 3 - 23:30", "2019-Day 3 - 23:45".

**(DONE) TO DO 8**: Explain the structure of the output files. Also, feel free to document the doc in the python file (or iPython file). Explain also the input (Flow_processed_tmp.csv) and how it was created (I think it was created from the google doc https://docs.google.com/spreadsheets/d/1tcps-8aorPZLY8nswnNCmjWSJi-7ey8Ps4twWFz2ls0/edit#gid=0).

***Edson Question***: the Flow_processed_tmp.csv file and the google doc seem the same to me (except for the lat, lng info on the right side of the google doc). Beyond this, I don't know how the file was created. Who to ask for more info?

***Theo Answer***: I guess I have created the file from the google doc. But I have also created the google doc during the parsing. Probably in 2) we can create Flow_processed_tmp.csv from year_info.csv

# added 2015 year

In [None]:
import process_flow as pf
pf.process_data()

### B) Create file that gives traffic flows for specific road sections for every year. <br>
source note: put years together.ipynb

- Use detectors (lines_to_detectors.csv) and flow processed city (flow_processed_city.csv) data to create all years flow data in "flow_processed_section.csv"
- Note that the erroneous files are still being skipped, they are: ['DurhamRd I680 MissionBlv EB', 'Mission blvd Pine Washington SB']

# To do 9: the function pytogether should be written again. + add 2015
- add the PeMS data
- be more clever about missing data or road section associated with several detectors (take the average)

### To do later, do the spatial join in python

In [None]:
import process_years_together as pytogether

line_to_detectors = 'lines_to_detectors.csv'
flow_processed_city = 'Flow_processed_city.csv'
pytogether.run(line_to_detectors, flow_processed_city)

### 4. Data Analysis

# TO DO: To be done after the other to dos has been done.

A) PCA

B) Analyse PCA results using heatmap inside ArcGIS