# Preparing Uber Movement travel time data
---

### Overview
1. Inspect raw data
2. Process raw data
3. Process all geographic data
4. Training, validation, testing data
---

We start with importing hyper parameters and modules

In [1]:
import hyper
import prep_uber_movement
import pandas as pd

HYPER = hyper.HyperParameter()

### 1. Inspect raw data
Let us first inspect the available raw data files for an exemplar city. This will allow us to better understand our raw data and what our features and labels are. We have data available for and can choose from the following list of cities: 

Amsterdam, Atlanta, Auckland, Bagalore, Bogota, Boston, Brisbane, Bristol, Brussels, Cairo, Cape Town, Cincinnati, Guadalajara, Hyderabad, Johannesburg and Pretoria, Kolkata, Leeds, London, Los Angeles, Madrid, Manchester, Melbourne, Mexico City, Miami, Mumbai, Nairobi, New Delhi, Orlando, Paris, Perth, Pittsburgh, San Francisco, Santiago De Chile, Sao Paulo, Seattle, Stockholm, Sydney, Taipei, Tampa Bay, Toronto, Vienna, 'Washington D.C.', 'West Midlands, UK'.

The raw data shows a number of characteristics that are worth to note:

* Travel time data is describes by four distinct values: mean, std, gemoetric mean and geometric std. These are our labels
* Our features are hour of day, a source ID and a destination ID. We can further see that the filename of our .csv files contain further meta data that is useful for describining features, which are the year, the quarter of the year and the day type (weekday or weekend).
* The geojson file further maps a set of latitudinal and longitudinal coordinates to each city zone ID, which describe the coordinates of a two dimensional polygon representing each zone. 

In [2]:
# choose a city from the list of available ones above
city = 'Santiago De Chile'

# call the import data function
df_geojson = prep_uber_movement.import_geojson(HYPER, city)
df_csv_dict_list = prep_uber_movement.import_csvdata(HYPER, city)
df_csv_dict = df_csv_dict_list[0]
df_csv = df_csv_dict['df']

# set maximum column width to see more of geojson
pd.set_option('max_colwidth', 400)

# print filename
print(df_csv_dict['filename'])
display(df_csv)
display(df_geojson)

KeyError: 'Santiago De Chile'

### 2. Process raw data

The data is already clean. The only part that must be processed are the geojson coordinates describing each city zone polygon with latitudes and longitudes. We write a recursive function that fosters the json files and extracts only the latitude and longitude coordinates mapped to each city zone ID. 

The format of files resulting from this step are shown below.

In [3]:
df_augmented_csvdata = prep_uber_movement.process_csvdata(df_csv_dict, city)
df_latitudes, df_longitudes = prep_uber_movement.process_geojson(df_geojson)

display(df_augmented_csvdata)
display(df_latitudes)
display(df_longitudes)

Unnamed: 0,sourceid,dstid,hod,mean_travel_time,standard_deviation_travel_time,geometric_mean_travel_time,geometric_standard_deviation_travel_time,year,quarter,daytype,city
0,364,523,10,463.21,106.06,450.49,1.27,2020,1,weekends,Santiago De Chile
1,360,563,10,422.82,263.68,378.38,1.52,2020,1,weekends,Santiago De Chile
2,411,39,11,1123.33,237.19,1099.77,1.23,2020,1,weekends,Santiago De Chile
3,412,29,11,847.92,360.00,796.87,1.38,2020,1,weekends,Santiago De Chile
4,384,323,10,718.39,194.36,695.59,1.28,2020,1,weekends,Santiago De Chile
...,...,...,...,...,...,...,...,...,...,...,...
4740940,705,750,14,1087.76,194.36,1071.25,1.19,2020,1,weekends,Santiago De Chile
4740941,364,481,12,1181.83,280.74,1149.64,1.26,2020,1,weekends,Santiago De Chile
4740942,31,155,7,702.46,251.17,648.58,1.59,2020,1,weekends,Santiago De Chile
4740943,32,145,7,943.33,156.35,929.80,1.19,2020,1,weekends,Santiago De Chile


Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,857,858,859,860,861,862,863,864,865,866
0,-33.413271,-33.381629,-33.382148,-33.237370,-33.363130,-33.324829,-33.325472,-33.251488,-33.207584,-33.180751,...,-33.556529,-33.561790,-33.550502,-33.556417,-33.558483,-33.552066,-33.555423,-33.555423,-33.559108,-33.554364
1,-33.411482,-33.381501,-33.382176,-33.237609,-33.362549,-33.332767,-33.318158,-33.250846,-33.205678,-33.181894,...,-33.556582,-33.561790,-33.550489,-33.556508,-33.558590,-33.552307,-33.555247,-33.557529,-33.557529,-33.554931
2,-33.409830,-33.381256,-33.382515,-33.238659,-33.360728,-33.335196,-33.316544,-33.249816,-33.204387,-33.183403,...,-33.557187,-33.561874,-33.550496,-33.556548,-33.558772,-33.552596,-33.555131,-33.559108,-33.555423,-33.554995
3,-33.406758,-33.380947,-33.382846,-33.240361,-33.358650,-33.336900,-33.318717,-33.248468,-33.203974,-33.185337,...,-33.557241,-33.561958,-33.550503,-33.556582,-33.558936,-33.553018,-33.555069,-33.560773,-33.553550,-33.555857
4,-33.406457,-33.380680,-33.383235,-33.241323,-33.358189,-33.337482,-33.345531,-33.246615,-33.203897,-33.186151,...,-33.559627,-33.562042,-33.550524,-33.556723,-33.559630,-33.553559,-33.554953,-33.562642,-33.551122,-33.556801
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
344,,,,,,,,,,,...,,,,,,,,,,
345,,,,,,,,,,,...,,,,,,,,,,
346,,,,,,,,,,,...,,,,,,,,,,
347,,,,,,,,,,,...,,,,,,,,,,


Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,857,858,859,860,861,862,863,864,865,866
0,-70.502431,-70.481813,-70.508380,-70.696061,-70.508914,-70.499148,-70.450683,-70.717409,-70.684300,-70.645738,...,-70.585965,-70.595492,-70.613742,-70.561028,-70.575938,-70.668194,-70.675992,-70.675992,-70.678909,-70.654847
1,-70.501605,-70.482387,-70.508312,-70.695396,-70.508523,-70.488909,-70.379501,-70.716570,-70.683521,-70.645922,...,-70.586034,-70.595493,-70.613670,-70.560980,-70.576761,-70.667008,-70.676018,-70.677667,-70.677667,-70.654967
2,-70.498702,-70.482879,-70.507478,-70.694430,-70.509173,-70.483799,-70.320754,-70.715085,-70.682943,-70.646068,...,-70.585932,-70.596320,-70.613134,-70.561028,-70.578352,-70.665641,-70.676245,-70.678909,-70.675992,-70.654918
3,-70.497354,-70.483069,-70.506985,-70.694308,-70.509875,-70.482651,-70.280546,-70.713078,-70.682457,-70.646060,...,-70.585999,-70.597135,-70.612582,-70.561179,-70.579654,-70.663092,-70.676525,-70.680191,-70.674523,-70.655168
4,-70.497221,-70.484274,-70.506507,-70.694167,-70.509982,-70.483064,-70.277062,-70.710174,-70.681820,-70.646237,...,-70.585576,-70.597985,-70.612048,-70.562175,-70.585364,-70.659782,-70.677081,-70.681701,-70.672630,-70.655397
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
344,,,,,,,,,,,...,,,,,,,,,,
345,,,,,,,,,,,...,,,,,,,,,,
346,,,,,,,,,,,...,,,,,,,,,,
347,,,,,,,,,,,...,,,,,,,,,,


### 3. Process all geographic data

Now that we know how our raw data looks like and in what format we want to have our geographic data, we can continue with processing the geojson files of all cities into this format.

In [4]:
prep_uber_movement.process_all_raw_geojson_data(HYPER)

### 4. Training, validation, testing data

In [2]:
df_train, df_val, df_test = prep_uber_movement.train_val_test_split(HYPER)