# Preparing Uber Movement travel time data
---

### Overview
1. Inspect raw data
2. Process raw data
3. Process all geographic data
4. Training, validation, testing data
5. Shuffle data files
---

We start with importing hyper parameters and modules

In [1]:
import hyper
import prep_uber_movement
import pandas as pd

HYPER = hyper.HyperParameter()

### 1. Inspect raw data
Let us first inspect the available raw data files for an exemplar city. This will allow us to better understand our raw data and what our features and labels are. We have data available for and can choose from the following list of cities: 

Amsterdam, Atlanta, Auckland, Bagalore, Bogota, Boston, Brisbane, Bristol, Brussels, Cairo, Cape Town, Cincinnati, Guadalajara, Hyderabad, Johannesburg and Pretoria, Kolkata, Leeds, London, Los Angeles, Madrid, Manchester, Melbourne, Mexico City, Miami, Mumbai, Nairobi, New Delhi, Orlando, Paris, Perth, Pittsburgh, San Francisco, Santiago De Chile, Sao Paulo, Seattle, Stockholm, Sydney, Taipei, Tampa Bay, Toronto, Vienna, 'Washington D.C.', 'West Midlands, UK'.

The raw data shows a number of characteristics that are worth to note:

* Travel time data is describes by four distinct values: mean, std, gemoetric mean and geometric std. These are our labels
* Our features are hour of day, a source ID and a destination ID. We can further see that the filename of our .csv files contain further meta data that is useful for describining features, which are the year, the quarter of the year and the day type (weekday or weekend).
* The geojson file further maps a set of latitudinal and longitudinal coordinates to each city zone ID, which describe the coordinates of a two dimensional polygon representing each zone. 

In [2]:
# choose a city from the list of available ones above
city = 'Auckland'

# call the import data function
df_geojson = prep_uber_movement.import_geojson(HYPER, city)
df_csv_dict_list = prep_uber_movement.import_csvdata(HYPER, city)
df_csv_dict = df_csv_dict_list[0]
df_csv = df_csv_dict['df']

# set maximum column width to see more of geojson
pd.set_option('max_colwidth', 400)

# print filename
print(df_csv_dict['filename'])
display(df_csv)
display(df_geojson)

auckland-statistical_area-2018-3-OnlyWeekdays-HourlyAggregate.csv


Unnamed: 0,sourceid,dstid,hod,mean_travel_time,standard_deviation_travel_time,geometric_mean_travel_time,geometric_standard_deviation_travel_time
0,244,53,18,783.79,272.53,744.81,1.36
1,70,411,17,487.42,168.31,467.12,1.31
2,420,269,1,701.24,183.47,686.00,1.21
3,404,429,1,165.06,297.99,66.58,3.19
4,194,315,13,887.41,253.55,855.97,1.30
...,...,...,...,...,...,...,...
1092292,377,374,3,906.22,205.24,880.78,1.28
1092293,32,122,12,196.91,143.02,150.82,2.26
1092294,10,342,12,562.04,250.74,514.73,1.56
1092295,365,494,3,901.75,392.59,846.00,1.39


Unnamed: 0,type,features
0,FeatureCollection,"{'type': 'Feature', 'properties': {'SA22019_V1': '157500', 'MOVEMENT_ID': '1', 'DISPLAY_NAME': 'Chapel Downs'}, 'geometry': {'type': 'Polygon', 'coordinates': [[[174.8998339, -36.9709799], [174.898904, -36.9739314], [174.8977483, -36.9781633], [174.90213549999999, -36.9807583], [174.9046005, -36.9718473], [174.906655, -36.9645751], [174.9019628, -36.9637275], [174.9012983, -36.965881], [174.89..."
1,FeatureCollection,"{'type': 'Feature', 'properties': {'SA22019_V1': '157600', 'MOVEMENT_ID': '2', 'DISPLAY_NAME': 'Wiri West'}, 'geometry': {'type': 'Polygon', 'coordinates': [[[174.8403848, -37.0183103], [174.8412096, -37.0183095], [174.8415799, -37.0191238], [174.8420459, -37.0193849], [174.8434829, -37.0195604], [174.8450429, -37.0187793], [174.8457613, -37.0192887], [174.8469823, -37.0191245], [174.8478889, ..."
2,FeatureCollection,"{'type': 'Feature', 'properties': {'SA22019_V1': '162400', 'MOVEMENT_ID': '3', 'DISPLAY_NAME': 'Glenbrook'}, 'geometry': {'type': 'Polygon', 'coordinates': [[[174.7387411, -37.237636], [174.7401261, -37.2366038], [174.7408903, -37.2358501], [174.7440681, -37.2372232], [174.7436192, -37.2380705], [174.7495996, -37.2400172], [174.7475195, -37.2430371], [174.7467526, -37.2448268], [174.7473993, -..."
3,FeatureCollection,"{'type': 'Feature', 'properties': {'SA22019_V1': '162500', 'MOVEMENT_ID': '4', 'DISPLAY_NAME': 'Hingaia'}, 'geometry': {'type': 'Polygon', 'coordinates': [[[174.9074569, -37.0858398], [174.9072241, -37.0841591], [174.9077382, -37.0834414], [174.9183611, -37.077399], [174.9233672, -37.0745959], [174.927033, -37.0737593], [174.9289477, -37.0734856], [174.9245255, -37.0622618], [174.9237101, -37...."
4,FeatureCollection,"{'type': 'Feature', 'properties': {'SA22019_V1': '162700', 'MOVEMENT_ID': '5', 'DISPLAY_NAME': 'Kawakawa Bay-Orere'}, 'geometry': {'type': 'Polygon', 'coordinates': [[[175.0947102, -37.0367461], [175.0954846, -37.0370574], [175.0943899, -37.0373393], [175.0946923, -37.0382573], [175.0943656, -37.0389124], [175.0945893, -37.0399217], [175.0949579, -37.0401748], [175.0948447, -37.0409254], [175...."
...,...,...
538,FeatureCollection,"{'type': 'Feature', 'properties': {'SA22019_V1': '166400', 'MOVEMENT_ID': '539', 'DISPLAY_NAME': 'Ararimu'}, 'geometry': {'type': 'Polygon', 'coordinates': [[[175.0091439, -37.1229777], [175.0003397, -37.1232229], [174.9999081, -37.1243699], [175.0006101, -37.1248682], [175.0013595, -37.1262076], [175.0010292, -37.1267168], [175.0012389, -37.1270705], [175.0013827, -37.1285493], [175.0012915, ..."
539,FeatureCollection,"{'type': 'Feature', 'properties': {'SA22019_V1': '170300', 'MOVEMENT_ID': '540', 'DISPLAY_NAME': 'Tuakau South'}, 'geometry': {'type': 'Polygon', 'coordinates': [[[174.9291153, -37.2849776], [174.9321526, -37.2874313], [174.9334829, -37.2889231], [174.9342623, -37.2896191], [174.9391871, -37.2857452], [174.9412072, -37.2840975], [174.9429772, -37.2850842], [174.9438527, -37.2840672], [174.9452..."
540,FeatureCollection,"{'type': 'Feature', 'properties': {'SA22019_V1': '170000', 'MOVEMENT_ID': '541', 'DISPLAY_NAME': 'Tuakau North'}, 'geometry': {'type': 'Polygon', 'coordinates': [[[174.9362994, -37.2607948], [174.9361191, -37.2612239], [174.9443631, -37.2634499], [174.9474673, -37.2647495], [174.9491497, -37.2653783], [174.9497097, -37.263996], [174.9511267, -37.2646257], [174.9510411, -37.265322], [174.953687..."
541,FeatureCollection,"{'type': 'Feature', 'properties': {'SA22019_V1': '170200', 'MOVEMENT_ID': '542', 'DISPLAY_NAME': 'Pokeno Rural'}, 'geometry': {'type': 'Polygon', 'coordinates': [[[174.9902581, -37.2165855], [174.9818741, -37.2227402], [174.9769676, -37.2266871], [174.974538, -37.228343], [174.9781589, -37.233632], [174.9769923, -37.2417426], [174.977289, -37.2426364], [174.9776944, -37.2430861], [174.9784346,..."


### 2. Process raw data

The data is already clean. The only part that must be processed are the geojson coordinates describing each city zone polygon with latitudes and longitudes. We write a recursive function that fosters the json files and extracts only the latitude and longitude coordinates mapped to each city zone ID. 

The format of files resulting from this step are shown below.

In [3]:
df_augmented_csvdata = prep_uber_movement.process_csvdata(HYPER, df_csv_dict, city)
df_latitudes, df_longitudes = prep_uber_movement.process_geojson(df_geojson)

display(df_augmented_csvdata)
display(df_latitudes)
display(df_longitudes)

Unnamed: 0,city_id,source_id,destination_id,year,quarter_of_year,daytype,hour_of_day,mean_travel_time,standard_deviation_travel_time,geometric_mean_travel_time,geometric_standard_deviation_travel_time
0,4,244,53,2018,3,1,18,783.79,272.53,744.81,1.36
1,4,70,411,2018,3,1,17,487.42,168.31,467.12,1.31
2,4,420,269,2018,3,1,1,701.24,183.47,686.00,1.21
3,4,404,429,2018,3,1,1,165.06,297.99,66.58,3.19
4,4,194,315,2018,3,1,13,887.41,253.55,855.97,1.30
...,...,...,...,...,...,...,...,...,...,...,...
1092292,4,377,374,2018,3,1,3,906.22,205.24,880.78,1.28
1092293,4,32,122,2018,3,1,12,196.91,143.02,150.82,2.26
1092294,4,10,342,2018,3,1,12,562.04,250.74,514.73,1.56
1092295,4,365,494,2018,3,1,3,901.75,392.59,846.00,1.39


Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,534,535,536,537,538,539,540,541,542,543
0,-36.970980,-37.018310,-37.237636,-37.085840,-37.036746,-36.728759,-36.784310,-36.894186,-36.856385,-36.915605,...,-37.190833,-37.210547,-37.204284,-37.211235,-37.144819,-37.122978,-37.284978,-37.260795,-37.216586,-37.250862
1,-36.973931,-37.018310,-37.236604,-37.084159,-37.037057,-36.730135,-36.783853,-36.893222,-36.858080,-36.915779,...,-37.190733,-37.210655,-37.204450,-37.221437,-37.157962,-37.123223,-37.287431,-37.261224,-37.222740,-37.257656
2,-36.978163,-37.019124,-37.235850,-37.083441,-37.037339,-36.731323,-36.782666,-36.893974,-36.858420,-36.916105,...,-37.191070,-37.211235,-37.205321,-37.221231,-37.161115,-37.124370,-37.288923,-37.263450,-37.226687,-37.252555
3,-36.980758,-37.019385,-37.237223,-37.077399,-37.038257,-36.731050,-36.782458,-36.894260,-36.859484,-36.917474,...,-37.191342,-37.204284,-37.207400,-37.220075,-37.164574,-37.124868,-37.289619,-37.264750,-37.228343,-37.256334
4,-36.971847,-37.019560,-37.238070,-37.074596,-37.038912,-36.730427,-36.780501,-36.894152,-36.855396,-36.918903,...,-37.192316,-37.201776,-37.208049,-37.218504,-37.165564,-37.126208,-37.285745,-37.265378,-37.233632,-37.256714
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2401,,,,,,,,,,,...,,,,,,,,,,
2402,,,,,,,,,,,...,,,,,,,,,,
2403,,,,,,,,,,,...,,,,,,,,,,
2404,,,,,,,,,,,...,,,,,,,,,,


Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,534,535,536,537,538,539,540,541,542,543
0,174.899834,174.840385,174.738741,174.907457,175.094710,174.731174,174.670124,174.715359,174.765717,174.832900,...,174.905904,174.889357,174.900837,174.900197,174.970514,175.009144,174.929115,174.936299,174.990258,175.002068
1,174.898904,174.841210,174.740126,174.907224,175.095485,174.732256,174.672097,174.716352,174.763788,174.831160,...,174.906407,174.889615,174.903326,174.899261,174.978588,175.000340,174.932153,174.936119,174.981874,175.007812
2,174.897748,174.841580,174.740890,174.907738,175.094390,174.733769,174.675544,174.717542,174.763587,174.831029,...,174.907749,174.900197,174.904019,174.901211,174.980697,174.999908,174.933483,174.944363,174.976968,175.015086
3,174.902135,174.842046,174.744068,174.918361,175.094692,174.735345,174.675940,174.718474,174.765441,174.836653,...,174.908188,174.900837,174.904550,174.901311,174.983132,175.000610,174.934262,174.947467,174.974538,175.019183
4,174.904600,174.843483,174.743619,174.923367,175.094366,174.737363,174.678580,174.719398,174.768807,174.839967,...,174.909006,174.901061,174.904912,174.901250,174.983728,175.001360,174.939187,174.949150,174.978159,175.023153
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2401,,,,,,,,,,,...,,,,,,,,,,
2402,,,,,,,,,,,...,,,,,,,,,,
2403,,,,,,,,,,,...,,,,,,,,,,
2404,,,,,,,,,,,...,,,,,,,,,,


### 3. Process all geographic data

Now that we know how our raw data looks like and in what format we want to have our geographic data, we can continue with processing the geojson files of all cities into this format.

In [4]:
df_city_id_mapping, df_geographic_info = prep_uber_movement.process_geographic_information(HYPER)

display(df_city_id_mapping)
display(df_geographic_info)

Unnamed: 0,city_id
Guadalajara,0
Stockholm,1
San Francisco,2
Perth,3
Auckland,4
Boston,5
Brussels,6
London,7
Miami,8
Leeds,9


Unnamed: 0,x_cord_1,x_cord_2,x_cord_3,x_cord_4,x_cord_5,x_cord_6,x_cord_7,x_cord_8,x_cord_9,x_cord_10,...,z_cord_290,z_cord_291,z_cord_292,z_cord_293,z_cord_294,z_cord_295,z_cord_296,z_cord_297,z_cord_298,z_cord_299
0,0.590381,0.590303,0.590438,0.590119,0.590361,0.590260,0.590540,0.590568,0.590443,0.590445,...,0.805492,0.805164,0.805160,0.805520,0.807008,0.807081,0.807136,0.807085,0.806968,0.806991
1,0.590422,0.590334,0.590476,0.590174,0.590398,0.590270,0.590553,0.590668,0.590517,0.590486,...,0.805455,0.805144,0.805106,0.805489,0.806959,0.807065,0.807066,0.807061,0.806949,0.806964
2,0.590388,0.590329,0.590481,0.590176,0.590388,0.590270,0.590527,0.590666,0.590575,0.590561,...,0.805442,0.805163,0.805228,0.805491,0.806961,0.807057,0.807043,0.807062,0.806911,0.806971
3,0.590398,0.590332,0.590461,0.590194,0.590422,0.590347,0.590565,0.590631,0.590580,0.590611,...,0.805395,0.805160,0.805291,0.805479,0.806926,0.807018,0.807029,0.807061,0.806907,0.806970
4,0.590361,0.590305,0.590506,0.590229,0.590440,0.590355,0.590651,0.590669,0.590626,0.590645,...,0.805364,0.805264,0.805308,0.805391,0.806940,0.806989,0.807005,0.807040,0.806861,0.806971
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
65,,,,,,,,,,,...,,,,,,,,,,
66,,,,,,,,,,,...,,,,,,,,,,
67,,,,,,,,,,,...,,,,,,,,,,
68,,,,,,,,,,,...,,,,,,,,,,


### 4. Training, validation, testing data

In [5]:
df_train, df_val, df_test = prep_uber_movement.train_val_test_split(HYPER)

display(df_train)
display(df_val)
display(df_test)

The following cities are chosen for test ['Perth']
Training data   :    27% 
 Validation data :    27% 
 Testing data    :    47% 



Unnamed: 0,city_id,source_id,destination_id,year,quarter_of_year,daytype,hour_of_day,mean_travel_time,standard_deviation_travel_time,geometric_mean_travel_time,geometric_standard_deviation_travel_time
5351103,7,889,179,2020,1,0,22,1654.67,135.92,1649.06,1.09
5583705,7,428,407,2018,3,0,21,2138.92,365.20,2114.97,1.15
975097,7,432,977,2018,3,0,20,1360.28,281.46,1336.12,1.20
4844840,7,85,343,2018,3,0,3,1726.83,441.61,1682.69,1.24
4263862,7,365,913,2018,3,0,15,1542.69,388.50,1505.59,1.23
...,...,...,...,...,...,...,...,...,...,...,...
5068601,7,23,929,2018,3,0,8,1722.86,542.28,1641.76,1.36
883652,7,666,840,2020,1,1,5,932.29,500.90,858.44,1.43
1194086,6,462,454,2019,4,0,0,124.73,73.65,109.88,1.62
4426700,7,179,320,2018,3,0,0,1263.59,262.55,1238.67,1.22


Unnamed: 0,city_id,source_id,destination_id,year,quarter_of_year,daytype,hour_of_day,mean_travel_time,standard_deviation_travel_time,geometric_mean_travel_time,geometric_standard_deviation_travel_time
3833224,7,897,232,2018,3,0,13,1331.40,229.59,1312.13,1.19
2894836,7,677,354,2018,3,0,20,1765.33,324.10,1736.77,1.20
2008085,7,573,325,2018,3,0,6,1386.95,353.12,1358.53,1.20
4584045,7,517,86,2018,3,0,13,486.90,199.52,451.95,1.50
4018609,7,210,147,2016,1,0,13,2784.67,391.00,2756.67,1.15
...,...,...,...,...,...,...,...,...,...,...,...
828253,7,816,639,2018,3,0,12,1677.91,393.60,1635.20,1.25
1861151,6,592,109,2019,4,1,10,742.67,269.00,697.86,1.42
2262998,5,47,1220,2019,4,1,0,2438.38,177.15,2432.05,1.07
3053009,7,146,808,2018,3,0,0,1227.50,97.72,1223.47,1.09


Unnamed: 0,city_id,source_id,destination_id,year,quarter_of_year,daytype,hour_of_day,mean_travel_time,standard_deviation_travel_time,geometric_mean_travel_time,geometric_standard_deviation_travel_time
706004,7,282,927,2017,2,0,18,864.18,348.14,812.07,1.40
5511094,7,900,230,2017,2,0,3,593.93,289.56,552.99,1.41
1230158,7,638,132,2020,1,1,4,1606.25,558.85,1534.45,1.33
2599933,7,202,415,2017,2,0,16,1540.91,387.71,1505.69,1.22
6283812,7,331,876,2020,1,1,15,1964.21,460.52,1908.48,1.28
...,...,...,...,...,...,...,...,...,...,...,...
5431566,7,628,87,2017,2,0,21,296.16,183.37,229.19,2.31
2170712,7,582,509,2017,2,1,22,1048.17,232.86,1027.24,1.21
2314831,7,634,697,2017,2,0,10,772.38,171.66,755.77,1.22
1063749,7,926,755,2020,1,1,17,1229.15,347.05,1185.91,1.30


### 5. Shuffle data files

In [None]:
prep_uber_movement.shuffle_data_files(HYPER)