# Label each row with features required to run ANN notebooks

The new features are:
* The time left until next bus stop (seconds)
* The time it takes to travel the full segment (seconds)
* The time from the start of the journey to the start of the current segment (seconds)

In [1]:
import numpy as np
import pandas as pds
import datetime as dt
import time

In [2]:
df = pds.read_csv('bus203_all.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,timestamp,event,vehicle_id,line,longitude,latitude,direction,speed,station,journey_number,segment_number
0,0,2018-02-16T04:48:40.0000000+01:00,JourneyStartedEvent,5432,203,58.414238,15.571015,-1.000000,-1.00,,1,1
1,1,2018-02-16T04:48:40.0000000+01:00,ObservedPositionEvent,5432,0,58.414238,15.571015,147.300003,0.00,,1,1
2,2,2018-02-16T04:48:40.0000000+01:00,ArrivedEvent,5432,203,58.414238,15.571015,-1.000000,-1.00,Rydsv\xe4gens \xe4ndh\xe5llpl.,1,1
3,3,2018-02-16T04:48:41.0000000+01:00,ObservedPositionEvent,5432,0,58.414246,15.571012,147.300003,0.00,,1,1
4,4,2018-02-16T04:48:42.0000000+01:00,ObservedPositionEvent,5432,0,58.414249,15.571008,147.300003,0.00,,1,1
5,5,2018-02-16T04:48:43.0000000+01:00,ObservedPositionEvent,5432,0,58.414257,15.571004,147.300003,0.00,,1,1
6,6,2018-02-16T04:48:44.0000000+01:00,ObservedPositionEvent,5432,0,58.414257,15.571006,147.300003,0.00,,1,1
7,7,2018-02-16T04:48:45.0000000+01:00,ObservedPositionEvent,5432,0,58.414261,15.571008,147.300003,0.00,,1,1
8,8,2018-02-16T04:48:46.0000000+01:00,ObservedPositionEvent,5432,0,58.414261,15.571010,147.300003,0.00,,1,1
9,9,2018-02-16T04:48:47.0000000+01:00,ObservedPositionEvent,5432,0,58.414261,15.571012,147.300003,0.00,,1,1


Ignore entries that are not `ObservedPositionEvent`

In [3]:
df = df[df['event'] == "ObservedPositionEvent"]

This leaves the indexes of rows untouched, reset.

In [4]:
df = df.reset_index().drop(columns=['index'])

Convert `timestamp` to pandas datetime object.

In [5]:
df['timestamp'] = pds.to_datetime(df['timestamp']).dt.tz_localize("UTC").dt.tz_convert("Europe/Stockholm")

A triple loop as it seems, but it is only to finally group rows from each individual segment from every journey. Should be linear in time, as the innermost loop will do all executions and every row is visited once. This only took ~50% of my 8GB of RAM but took ~30 minutes to run.

In [6]:
time_left = pds.DataFrame(np.zeros(len(df.index)), columns=['time_left'])
segment_time = pds.DataFrame(np.zeros(len(df.index)), columns=['segment_time'])
# Time since journey start
tsjs = pds.DataFrame(np.zeros(len(df.index)), columns=['tsjs'])


t0 = time.time()

for j, df_j in df.groupby('journey_number'):
    journey_start = df_j['timestamp'].iloc[0]
    for k, df_s in df_j.groupby('segment_number'):
        end_time = df_s['timestamp'].iloc[-1]
        start_time = df_s['timestamp'].iloc[0]
        for idx, row in df_s.iterrows():
            # The subtraction returns timedelta between the two timestamp objects
            # and total seconds convert the pandas datetime object to seconds
            time_left.iloc[idx] = (end_time - row['timestamp']).total_seconds()
            segment_time.iloc[idx] = (end_time - start_time).total_seconds()
            tsjs.iloc[idx] = (start_time - journey_start).total_seconds()

elapsed = time.time() - t0
print("Data processed in", elapsed, " seconds")

Data processed in 1968.3259329795837  seconds


Add new features to dataframe `df`

In [7]:
data = pds.concat([df, time_left,segment_time, tsjs], axis=1)
data.head()

Unnamed: 0.1,Unnamed: 0,timestamp,event,vehicle_id,line,longitude,latitude,direction,speed,station,journey_number,segment_number,time_left,segment_time,tsjs
0,1,2018-02-16 04:48:40+01:00,ObservedPositionEvent,5432,0,58.414238,15.571015,147.300003,0.0,,1,1,71.0,71.0,0.0
1,3,2018-02-16 04:48:41+01:00,ObservedPositionEvent,5432,0,58.414246,15.571012,147.300003,0.0,,1,1,70.0,71.0,0.0
2,4,2018-02-16 04:48:42+01:00,ObservedPositionEvent,5432,0,58.414249,15.571008,147.300003,0.0,,1,1,69.0,71.0,0.0
3,5,2018-02-16 04:48:43+01:00,ObservedPositionEvent,5432,0,58.414257,15.571004,147.300003,0.0,,1,1,68.0,71.0,0.0
4,6,2018-02-16 04:48:44+01:00,ObservedPositionEvent,5432,0,58.414257,15.571006,147.300003,0.0,,1,1,67.0,71.0,0.0


Renaming some columns to keep them similar to the GP model for easier understanding.

In [8]:
data.rename(columns={'longitude': 'lon', 'latitude': 'lat', 'segment_number': 'seg', 'journey_number': 'journey'}, inplace=True)
data.head()

Unnamed: 0.1,Unnamed: 0,timestamp,event,vehicle_id,line,lon,lat,direction,speed,station,journey,seg,time_left,segment_time,tsjs
0,1,2018-02-16 04:48:40+01:00,ObservedPositionEvent,5432,0,58.414238,15.571015,147.300003,0.0,,1,1,71.0,71.0,0.0
1,3,2018-02-16 04:48:41+01:00,ObservedPositionEvent,5432,0,58.414246,15.571012,147.300003,0.0,,1,1,70.0,71.0,0.0
2,4,2018-02-16 04:48:42+01:00,ObservedPositionEvent,5432,0,58.414249,15.571008,147.300003,0.0,,1,1,69.0,71.0,0.0
3,5,2018-02-16 04:48:43+01:00,ObservedPositionEvent,5432,0,58.414257,15.571004,147.300003,0.0,,1,1,68.0,71.0,0.0
4,6,2018-02-16 04:48:44+01:00,ObservedPositionEvent,5432,0,58.414257,15.571006,147.300003,0.0,,1,1,67.0,71.0,0.0


... remove unwanted columns ...

In [9]:
data = data.drop(columns=['Unnamed: 0', 'event', 'vehicle_id', 'line', 'station'])
data.head()

Unnamed: 0,timestamp,lon,lat,direction,speed,journey,seg,time_left,segment_time,tsjs
0,2018-02-16 04:48:40+01:00,58.414238,15.571015,147.300003,0.0,1,1,71.0,71.0,0.0
1,2018-02-16 04:48:41+01:00,58.414246,15.571012,147.300003,0.0,1,1,70.0,71.0,0.0
2,2018-02-16 04:48:42+01:00,58.414249,15.571008,147.300003,0.0,1,1,69.0,71.0,0.0
3,2018-02-16 04:48:43+01:00,58.414257,15.571004,147.300003,0.0,1,1,68.0,71.0,0.0
4,2018-02-16 04:48:44+01:00,58.414257,15.571006,147.300003,0.0,1,1,67.0,71.0,0.0


... and save it.

In [None]:
data.to_pickle('ANN_dataset.pkl')