This notebook deals with the pre processing before a model can be trained to predict arrival times.

First off lets get some libraries into scope and load the data.

In [3]:
import numpy as np
import pandas as pd
import time
from datetime import datetime

sns.set(style="darkgrid")
data = pd.read_csv('../bus203.csv')
data.head()

Unnamed: 0.1,Unnamed: 0,timestamp,event,line,vehicle_id,longitude,latitude,journey_number
0,0,2018-02-16T09:31:00.0000000+01:00,JourneyStartedEvent,203,5434,58.417046,15.62424,1
1,1,2018-02-16T09:31:00.0000000+01:00,ObservedPositionEvent,203,5434,58.417046,15.62424,1
2,2,2018-02-16T09:31:00.0000000+01:00,ArrivedEvent,203,5434,58.417046,15.62424,1
3,3,2018-02-16T09:31:00.0000000+01:00,EnteredEvent,203,5434,58.417046,15.62424,1
4,4,2018-02-16T09:31:01.0000000+01:00,ObservedPositionEvent,203,5434,58.417046,15.62424,1


The goal is to estimate arrival time given a coordinate, but to do that we first need to annotate the trajectories in the data with the actual arrival times. The chunk below will process the data to extract the time until next stop and put that into the file "data.pkl" which can then be loaded into a data frame. Consequently this snippet only needs to be run if we want to change the data. This took me 20 minutes and the resulting file is a 4gb pickle file that contains less info than the 1 gb csv file.

In [None]:
def parse_time(dt_str):
    dt, _, _ = dt_str.partition(".")
    return datetime.strptime(dt, "%Y-%m-%dT%H:%M:%S")
   
pickle_file = 'data.pkl' 
t0 = time.time()
last_stop_timestamp = datetime.utcnow()
data_to_process = data # data.loc[data['journey_number'] == 1]
ts = pd.DataFrame(np.zeros(data_to_process.size), columns=list('t'))


for i, d in reversed(list(data_to_process.iterrows())):
    t = parse_time(d['timestamp'])
    if d['event'] == 'EnteredEvent':
        last_stop_timestamp = t
    else:
         ts.iloc[i] = (last_stop_timestamp - t).seconds

elapsed = time.time() - t0
print("Data pickled in", elapsed, " seconds")


gp_df = pd.concat([
            data_to_process['latitude'], 
            data_to_process['longitude'], 
            data_to_process['journey_number'], 
            data_to_process['segment_number'], 
            data_to_process['speed'], 
            ts],
            axis = 1)

gp_renamed = gp_df.rename(columns = {
                        'latitude': 'lat', 
                        'longitude': 'long', 
                        'journey_number': 'traj', 
                        'segment_number': 'seg', 
                        'speed': 'speed', 
                        't': 't'})

gp_renamed.to_pickle(pickle_file)


The processed data can now be loaded with `df = pd.read_pickle('data.pkl')`