# Introduction
The purpose of this file is to get the public transit data from TDX. Since there are multiple steps to get the complete bilateral public travel data, we figured the form of notebook gives clearer deomnstration of the entire process than python script files.

Specifically, there are three steps:
1. Fetch the data from TDX for all of the bilateral pairs (pairs are split into several files due to API service limitation), and stack the results.
2. Check for missing travel times on either going back or forth, should there exist such pairs, collect them and run again with different departure times.
3. There will still be some pairs with no entry, and they are treated in two ways: fill with walking time (if their walking time is less than 30 mins) or manually look them up using Google Maps (there shouldn't be many so this is viable).

Finally, merge all the data and we will get the public travel time data. Optionally, we can turn the result into a matrix.

# Retrieve Data From TDX
The routing service can only be access through API calls, and we will roughly introduce the process.
First, we acquire an access token, which last for 24 hours, after granted authorization of accessing API with our "client ID" and "client secret". Then when we access the routing result through API, with this access token attached to the API request.
In our implementation, for the purpose of data storage efficiency, with a pair of points A and B, we fetch the results of "A to B" and "B to A" at the same time and store them in the same row.
Therefore, for a pair of points, the code would be simply calling functions like the following block.

In [None]:
# Single Pair Example
import TDX_retriever as tr
import pandas as pd

# your info of api
client_id = "your-client-id"
client_secret = "your-client-secret"
TDX = tr.TDX_retriever(client_id, client_secret)

# some centroids as example
c1 = [24.9788580602204, 121.55598430669878]
c2 = [24.92583789587648, 121.34128216256848]

# get routing result from TDX
# notice that the returned result is filtered, not original respond
single_pair_res = TDX.get_transport_result(c1, c2)

# save as a dataframe
cols = [
    'A_lat', 'A_lon', 'B_lat', 'B_lon',
    'AB_travel_time', 'AB_ttl_cost', 'AB_transfer_cnt', 'AB_route',
    'BA_travel_time', 'BA_ttl_cost', 'BA_transfer_cnt', 'BA_route'
]
df = pd.DataFrame(columns=cols)
df.loc[len(df)] = single_pair_res

# show the df
df

For a list of pairs, we call the functions in basically the same way. However, due to the API access limitation, we have to split the list into multiple sub-files. Therefore, when running the code, we need to provide the serial number of the sub-files as the command-line arguments. Another command-line argument required is the path to your API information. The following is an example of command-line for executing the code, using the 7th sub-file.
```
python TDX_retriever.py your_api_info.json 7
```
The following block is an example of the layout in the api info json file.

In [None]:
{
    "client_id": "your-client-id",
    "client_secret": "your-client-secret"
}

In the actual implementation, we executed the command-line using batch script files (".bat" files), rather than typing the above command-line input. As we just mentioned, there are multiple sub-files and such batch files makes rerunning the command much easier. Before an example of the batch file, here are some notes:
* Things after "@REM" are comments.
* In our case, we set the batch file name as "run_py_key1.bat" and read the api information from a json file called "api_key1.json".
* We put the serial number as a command-line argument here, and the batch file will pass this number to the step we execute the code.

Batch files are also executed using commmand-line, following our code execution command example previously mentioned, the command to run the batch file is as follows: 
```
run_py_key1.bat 7
```
The below block of code is an example of our batch file. To run on mac, we need ".sh" files, which need some modification but the main idea is the same.

In [None]:
@REM set the file name as a variable, for example, "fname" would be "run_py_key1"
set fname=%~n0
@REM %fname:~7,4% is extracting 4 characters,
@REM starting from the 8th character from file name "run_py_key1"
@REM therefore "%fname:~7,4%" would be "key1"
python TDX_retriever.py env\api_%fname:~7,4%.json %1

Since we use VS code as our main editor, there's a function called "task" that makes our lives even easier when it comes to executing batch files. To do this, we need to create a file called "tasks.json" and save it in a folder called ".vscode". Example of the "tasks.json" file as the next block.

In [None]:
{
    "version": "2.0.0",
    "tasks": [
        {
            "label": "get TDX with key 1 on file 7",
            "type": "shell",
            "command": "env\\run_py_key1.bat 7"
        },
        {
            "label": "get TDX with key 1 on file 8",
            "type": "shell",
            "command": "env\\run_py_key1.bat 8"
        },
        {
            "label": "run all",
            "dependsOn": [
                "get TDX with key 1 on file 7",
                "get TDX with key 1 on file 8"
            ],
            "dependsOrder": "parallel",
            "presentation": {
                "reveal": "always",
                "revealProblems": "onProblem",
                "panel": "new"
            }
        }
    ]
}

FINAL REMARK: there are several useful tools for helping this process in the helper.py, Helper_tdx class.

# Fill Missing
Starting from this section, we will use the code blocks to demonstrate our steps and the actual code execution.

The process in this section is as follows:
1. Stack data.
2. (Optional, missing pair fill) Sometimes TDX would return only a partial of provided pairs, could check with the walking data, which has the full pair list before calibration. Should the missing happens, create rows of those pairs and give missing values to them for later missing fill.
3. Keep only the pairs with needed counties.
4. (first missing value fill) Fill the missings with walking data if the walking time is less than 30 minutes.
5. (second missing value fill) Access TDX with different departure time to fill the remaining missing data.
6. (third missing value fill) Get the remaining missing values from Google Maps.

The reason for using OSRM walking data to fill the public transit data is because of the limitation of the TDX platform. When the distance of two given points are so close that no public transit is available, TDX would report missing. Besides, since we choose 30 minutes as our settings for the first mile in retrieving data from TDX, filling missings with walking time less than 30 minutes should be a reasonable choice.

### Stack results
After retrieving data from TDX, there should be multiple result files, so we need to first stack them back to the full list then do the checking.

In [None]:
import helper
import os

# ======== File and Folder settings ========
FOLDER_PUBLIC_RAW = 'JJinTP_data_TW/public_data/Raw/'
FOLDER_PUBLIC_SCRATCH = 'JJinTP_data_TW/public_data/Scratch/'
FOLDER_PUBLIC_MAIN = 'JJinTP_data_TW/public_data/Main/'

# The calibration data keeps only the counties we need
FILE_CALIB = "JJinTP_data_TW/calibration_data_TP.csv"

# This is the full list of all counties in the Taipei Metropolitan,
# with the longitude and latitude of the county centroids
FILE_VILL_CENTROID = "JJinTP_data_TW/village_centroid_TP.csv"

# The walking data by OSRM
FILE_WALKING = os.path.join(FOLDER_PUBLIC_MAIN, 'travel_walking.csv')

hpt = helper.Helper_public_travel(
    calib_fpath=FILE_CALIB,
    centroid_path=FILE_VILL_CENTROID,
    walk_fpath=FILE_WALKING,
    public_merged_fname='merged_public.csv'
)

# For reference, column names in TDX respond:
# A_villcode, B_villcode, A_lat, A_lon, B_lat, B_lon,
# AB_travel_time, AB_ttl_cost, AB_transfer_cnt, AB_route,
# BA_travel_time, BA_ttl_cost, BA_transfer_cnt, BA_route

# Stack data, would generate the stacked data to folder
# file_cnt = 20 because we split the original list into 20 sub-files.
# if want to generate a new file, need to manually remove the file;
# otherwise, this step will be skipped.
hpt.merge_public_files(
    source_path=FOLDER_PUBLIC_RAW,
    out_path=FOLDER_PUBLIC_SCRATCH,
    start_time='10am', file_cnt=20
)  # creates "merged_public.csv" in the scratch folder.

The file already exists in JJinTP_data_TW/public_data/Scratch/, skipping this step...


### Optional missing pair check
Occationally, TDX might return less number of results than given. The difference in row counts of input and output files indicates the need for this optional step. Besides, duplicated records were also found.
Instead of filling the missing pairs for each sub-files, we do it after stacking all sub-files for checking and filling efficiently.

In [2]:
hpt.get_missing_pairs(
    stacked_public_path=FOLDER_PUBLIC_SCRATCH,
    walking_fpath=FILE_WALKING
)  # this function will replace the "merged_public.csv"

There are 40500 missing pairs...
Missing pairs restored.


### Calibrate the counties
We use the list of counties from the calibration data to keep the pairs having both points in the calibrated list.

In [5]:
# this function will replace the "merged_public.csv"
hpt.calibrate_counties_used(FOLDER_PUBLIC_SCRATCH)

(776881, 14)


### 1st missing fill: rerun TDX
This part requires several times of rerun using different departure time. We use batch files (or shell scripts on mac) to run the code.

In [None]:
# 1. get 10:30 am
hpt.get_rerun_pairs(
    data_path=FOLDER_PUBLIC_SCRATCH,
    data_fname="merged_public.csv",
    out_path=FOLDER_PUBLIC_SCRATCH,
    target_time="1030am"
)  # generates rerun_TDX_1030am.csv

In [None]:
@REM Batch file: run_py_key1.bat
@echo off
set fname=%~n0
python get_public.py api_%fname:~7,4%.json rerun_TDX_1030am.csv fill_1030am.csv ^
    --depart_time "T10:30:00" ^
    --centroid_path "JJinTP_data_TW/public_data/Scratch/" ^
    --out_path "JJinTP_data_TW/public_data/Scratch/"

In [None]:
#!/bin/bash
# steps in command-line:
# 1. chmod +x ./JJinTP_data_TW/public_data/tdx_api/run_py_key1.sh
# 2. ./JJinTP_data_TW/public_data/tdx_api/run_py_key1.sh

# Extract filename without extension
fname=$(basename "$0" .sh)

TIME_H="10"
TIME_M="30"

# Run Python script with formatted parameters
python get_public.py "api_${fname:7:4}.json" \
    rerun_TDX_${TIME_H}${TIME_M}am.csv fill_${TIME_H}${TIME_M}am.csv \
    --depart_time "T${TIME_H}:${TIME_M}:00" \
    --centroid_path "./JJinTP_data_TW/public_data/Scratch/" \
    --out_path "./JJinTP_data_TW/public_data/Scratch/"

In [None]:
# 2. get 11:00am
hpt.get_rerun_pairs(
    data_path=FOLDER_PUBLIC_SCRATCH,
    data_fname="fill_1030am.csv",
    out_path=FOLDER_PUBLIC_SCRATCH,
    target_time="1100am"
)  # generates rerun_TDX_1100am.csv

### 2nd missing fill: walking data
Fill the pairs that has missing on both directions and walking time less than 30 minutes.

In [None]:
hpt.fill_with_walk(
    data_path=FOLDER_PUBLIC_SCRATCH,
    data_fname="fill_1100am.csv",
    t_limit_minute=30,
    out_path=FOLDER_PUBLIC_SCRATCH
)  # this function generates "fill_walk.csv"

### 3rd missing fill: manual check on Google Maps
We had only 20 pairs left after the rerun TDX step (with departure time 10:30am and 11:00am), so we start to fill the remaining missing pairs with Google Maps.

Notice that there won't be files generated if the file already exist in the folder, since we don't want the manually checked results covered by the accidental click.
<!-- 
Why these pairs could be fixed with Google Maps, but the official routing service could not give a valid respond? We had no idea. However, since Google Maps' public routing function works almost perfectly in Taiwan (from our experience), we think the results should be believable.
-->

In [2]:
hpt.get_manual_check(
    data_path=FOLDER_PUBLIC_SCRATCH,
    data_fname="fill_1100am.csv",
    out_path=FOLDER_PUBLIC_SCRATCH,
)  # generates fill_manual_check.csv

### Merge all the fill files with the main
This step patches all the fill files (fill_1030am, fill_1100am, fill_manual_check) to the main data. Notice that some of the pairs might still have missing data in one direction, we will fill these with the travel time of their reverse direction.

In [None]:
# Notice that starting from the merged_public.csv, all of the villcode
# pairs are sorted, i.e. the A_villcode will be the smaller one.

# the order matters, should follow the steps above
fill_list = ['fill_1030am.csv', 'fill_1100am.csv', 'fill_manual_check.csv']

hpt.make_public_main(
    stacked_public_path=FOLDER_PUBLIC_SCRATCH,
    fill_file_path=FOLDER_PUBLIC_SCRATCH,
    fill_file_list=fill_list,
    out_path=FOLDER_PUBLIC_MAIN
)  # generates public_travel_time.csv

hpt.update_with_walk(
    final_public_fpath=os.path.join(FOLDER_PUBLIC_MAIN, "public_travel_time.csv"),
)

There are 0 pairs with missing.


# This completes the public data part.
<!--jupyter nbconvert --to html public_data_procedure.ipynb -->