In this notebook, I will try get an understanding of the given dataset.


In [1]:
# Importing the modules that might be used
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import dask.dataframe as dd

# ignore warnings
import warnings
warnings.filterwarnings('ignore')

## Exploring the dataset
We are given three datasets:
* Leave times
* Trips 
* Vehicles

I will load them into the notebook and individually understand each column

In [2]:
# Using error_bad_lines so that lines with too many commas will be returned in the dataframe

trips = pd.read_csv('/home/faye/data/rt_trips_DB_2018.txt', sep=';',error_bad_lines=False)
vehicles = pd.read_csv('/home/faye/data/rt_vehicles_DB_2018.txt', sep=';',error_bad_lines=False)
leave_times = dd.read_csv('/home/faye/data/rt_leavetimes_DB_2018.txt', sep=';',error_bad_lines=False)

## Leave Times Data

#### Understanding the features:
* DATASOURCE: Unique Bus Operator Code
* DAYOFSERVICE: day of attendance. One day of service could last more than 24 hours.
* TRIPID: Refers to a unique Trip. Will be investigated further.
* PROGNUMBER:  Sequential position of the stop point of the trip.
* STOPPOINTID: Unique stop point code
* PLANNEDTIME_ARR: Planned arrival time at the stop point, in seconds
* PLANNEDTIME_DEP: Planned departure time from the stop point, in seconds
* ACTUALTIME_ARR: Actual arrival time at the stop point, in seconds
* ACTUALTIME_DEP: Actual departure time from the stop point, in seconds
* VEHICLEID: Unique vehicle code arriving at this stop point
* PASSENGERS: Number of passengers on board (previous link)
* PASSENGERSIN: Number of boarded passengers
* PASSENGERSOUT: Number of descended passengers
* DISTANCE: Distance measured from the beginning of the trip
* SUPPRESSED: When the trip is partially suppressed it says that the previous link is suppressed (0 =achieved, 1 = suppressed)
* JUSTIFICATIONID: Fault code
* LASTUPDATE: Time of the last record update
* NOTE: Free note -> document states that the data type is string type but here states that it's float64.

#### Exploring Leave Times Data

* check the first 5 rows
* last 5 rows
* check rows and columns
* check data types
* check missing data

In [3]:
leave_times.head(5)

Unnamed: 0,DATASOURCE,DAYOFSERVICE,TRIPID,PROGRNUMBER,STOPPOINTID,PLANNEDTIME_ARR,PLANNEDTIME_DEP,ACTUALTIME_ARR,ACTUALTIME_DEP,VEHICLEID,PASSENGERS,PASSENGERSIN,PASSENGERSOUT,DISTANCE,SUPPRESSED,JUSTIFICATIONID,LASTUPDATE,NOTE
0,DB,01-JAN-18 00:00:00,5972116,12,119,48030,48030,48012,48012,2693211,,,,,,,08-JAN-18 17:21:10,
1,DB,01-JAN-18 00:00:00,5966674,12,119,54001,54001,54023,54023,2693267,,,,,,,08-JAN-18 17:21:10,
2,DB,01-JAN-18 00:00:00,5959105,12,119,60001,60001,59955,59955,2693263,,,,,,,08-JAN-18 17:21:10,
3,DB,01-JAN-18 00:00:00,5966888,12,119,58801,58801,58771,58771,2693284,,,,,,,08-JAN-18 17:21:10,
4,DB,01-JAN-18 00:00:00,5965960,12,119,56401,56401,56309,56323,2693209,,,,,,,08-JAN-18 17:21:10,


In [4]:
leave_times.tail(5)

Unnamed: 0,DATASOURCE,DAYOFSERVICE,TRIPID,PROGRNUMBER,STOPPOINTID,PLANNEDTIME_ARR,PLANNEDTIME_DEP,ACTUALTIME_ARR,ACTUALTIME_DEP,VEHICLEID,PASSENGERS,PASSENGERSIN,PASSENGERSOUT,DISTANCE,SUPPRESSED,JUSTIFICATIONID,LASTUPDATE,NOTE
281017,DB,31-DEC-18 00:00:00,8588153,78,4383,28605,28605,28998,29013,3265721,,,,,,,16-JAN-19 18:27:21,
281018,DB,31-DEC-18 00:00:00,8587459,78,4383,22695,22695,23247,23247,3265687,,,,,,,16-JAN-19 18:27:21,
281019,DB,31-DEC-18 00:00:00,8586183,78,4383,51481,51481,52237,52283,2693229,,,,,,,16-JAN-19 18:27:21,
281020,DB,31-DEC-18 00:00:00,8589374,23,7053,53659,53659,53525,53525,3265669,,,,,,,16-JAN-19 18:27:21,
281021,DB,31-DEC-18 00:00:00,8589372,24,2088,46383,46383,46315,46325,3265669,,,,,,,16-JAN-19 18:27:21,


In [5]:
leave_times.dtypes

DATASOURCE          object
DAYOFSERVICE        object
TRIPID               int64
PROGRNUMBER          int64
STOPPOINTID          int64
PLANNEDTIME_ARR      int64
PLANNEDTIME_DEP      int64
ACTUALTIME_ARR       int64
ACTUALTIME_DEP       int64
VEHICLEID            int64
PASSENGERS         float64
PASSENGERSIN       float64
PASSENGERSOUT      float64
DISTANCE           float64
SUPPRESSED         float64
JUSTIFICATIONID    float64
LASTUPDATE          object
NOTE               float64
dtype: object

for col in leave_times:
    num = leave_times[col].nunique().compute()
    line = f'{col}\t{num}'
    print(line)

## Trips Data

Each row represents one trip(route)
#### Understanding the features:
* DATASOURCE: Unique Bus Operator Code
* DAYOFSERVICE: Day of service. One day of service could last more than 24 hours
* TRIPID: Unique Trip code
* LINEID: Unique Line code
* ROUTEID: Unique route code
* DIRECTION: Route direction: IB = inbound / going / northbound / eastbound, OB = outbound / back / southbound / westbound
* PLANNEDTIME_ARR: Planned arrival time of the trip, in seconds
* PLANNEDTIME_DEP: Planned departure time of the trip, in seconds
* ACTUALTIME_ARR: Actual arrival time of the trip, in seconds
* ACTUALTIME_DEP: Actual departure time of the trip, in seconds
* BASIN: basin code
* TENDERLOT: tender lot
* SUPPRESSED: The whole trip has been supressed (0 = achieved, 1 = suppressed)
* JUSTIFICATIONID: Fault code
* LASTUPDATE: Time of the last record update 
* NOTE: Free note

In [6]:
trips.shape

(2182637, 16)

In [7]:
trips.head(5)

Unnamed: 0,DATASOURCE,DAYOFSERVICE,TRIPID,LINEID,ROUTEID,DIRECTION,PLANNEDTIME_ARR,PLANNEDTIME_DEP,ACTUALTIME_ARR,ACTUALTIME_DEP,BASIN,TENDERLOT,SUPPRESSED,JUSTIFICATIONID,LASTUPDATE,NOTE
0,DB,07-FEB-18 00:00:00,6253783,68,68_80,1,87245,84600,87524.0,84600.0,BasDef,,,,28-FEB-18 12:05:11,",2967409,"
1,DB,07-FEB-18 00:00:00,6262138,25B,25B_271,2,30517,26460,32752.0,,BasDef,,,,28-FEB-18 12:05:11,",2580260,"
2,DB,07-FEB-18 00:00:00,6254942,45A,45A_70,2,35512,32100,36329.0,32082.0,BasDef,,,,28-FEB-18 12:05:11,",2448968,"
3,DB,07-FEB-18 00:00:00,6259460,25A,25A_273,1,57261,54420,58463.0,54443.0,BasDef,,,,28-FEB-18 12:05:11,",3094242,"
4,DB,07-FEB-18 00:00:00,6253175,14,14_15,1,85383,81600,84682.0,81608.0,BasDef,,,,28-FEB-18 12:05:11,",2526331,"


In [8]:
trips.tail(5)

Unnamed: 0,DATASOURCE,DAYOFSERVICE,TRIPID,LINEID,ROUTEID,DIRECTION,PLANNEDTIME_ARR,PLANNEDTIME_DEP,ACTUALTIME_ARR,ACTUALTIME_DEP,BASIN,TENDERLOT,SUPPRESSED,JUSTIFICATIONID,LASTUPDATE,NOTE
2182632,DB,14-MAY-18 00:00:00,6765849,123,123_36,2,61560,57840,61365.0,57859.0,BasDef,,,,26-JUN-18 09:13:13,",3216350,"
2182633,DB,14-MAY-18 00:00:00,6765469,75,75_17,1,53416,48600,,48823.0,BasDef,,,,26-JUN-18 09:13:13,",2865284,"
2182634,DB,14-MAY-18 00:00:00,6765486,33D,33D_62,2,29460,26400,29904.0,,BasDef,,,,26-JUN-18 09:13:13,",3077688,"
2182635,DB,14-MAY-18 00:00:00,6764987,70,70_60,1,65277,60600,66341.0,,BasDef,,,,26-JUN-18 09:13:13,",3208841,"
2182636,DB,14-MAY-18 00:00:00,6765012,27,27_19,1,47722,41700,47508.0,41642.0,BasDef,,,,26-JUN-18 09:13:13,",2960092,"


In [9]:
trips.dtypes

DATASOURCE          object
DAYOFSERVICE        object
TRIPID               int64
LINEID              object
ROUTEID             object
DIRECTION            int64
PLANNEDTIME_ARR      int64
PLANNEDTIME_DEP      int64
ACTUALTIME_ARR     float64
ACTUALTIME_DEP     float64
BASIN               object
TENDERLOT          float64
SUPPRESSED         float64
JUSTIFICATIONID    float64
LASTUPDATE          object
NOTE                object
dtype: object

In [10]:
trips.dtypes

DATASOURCE          object
DAYOFSERVICE        object
TRIPID               int64
LINEID              object
ROUTEID             object
DIRECTION            int64
PLANNEDTIME_ARR      int64
PLANNEDTIME_DEP      int64
ACTUALTIME_ARR     float64
ACTUALTIME_DEP     float64
BASIN               object
TENDERLOT          float64
SUPPRESSED         float64
JUSTIFICATIONID    float64
LASTUPDATE          object
NOTE                object
dtype: object

In [11]:
# Unique values for each feature
trips.nunique()

DATASOURCE              1
DAYOFSERVICE          360
TRIPID             658964
LINEID                130
ROUTEID               588
DIRECTION               2
PLANNEDTIME_ARR     64461
PLANNEDTIME_DEP       791
ACTUALTIME_ARR      68122
ACTUALTIME_DEP      66771
BASIN                   1
TENDERLOT               0
SUPPRESSED              1
JUSTIFICATIONID      3526
LASTUPDATE            360
NOTE                46690
dtype: int64

In [12]:
# Missing values for each feature
trips.isnull().sum()

DATASOURCE               0
DAYOFSERVICE             0
TRIPID                   0
LINEID                   0
ROUTEID                  0
DIRECTION                0
PLANNEDTIME_ARR          0
PLANNEDTIME_DEP          0
ACTUALTIME_ARR      137207
ACTUALTIME_DEP      164551
BASIN                    0
TENDERLOT          2182637
SUPPRESSED         2178304
JUSTIFICATIONID    2178307
LASTUPDATE               0
NOTE                     0
dtype: int64

* There are three constant features that will be dropped: BASIN, SUPPRESSED and DATASOURCE.
* TENDERLOT's rows are 100% missing so this feature will be dropped. Trying to relate tender lot to business terms. This may refer to whether or not the trip is paid for a service. Makes sense that it's hidden for privacy if this is the case. It doesn't relate much to the project. 
* ACTUALTIME_ARR has over 100k rows missing but only accounts for 6% of the data. 
* ACTUALTIME_DEP - the same observation. 
* LASTUPDATE - in the documentation it states that it detects new data availability. 
* SUPPRESSED doesn't look like it gives any meaningful data. Not sure why it has to be suppressed. There are no 1.0 values at all, which indicates that there are no trips/rows within the trips dataset where a the information is suppressed/hidden. Demonstrated above, we see the LINEID, TRIPID, etc don't have any null values. From the documentation, it looks like the NaN values just means it's partially suppressed which may explain TENDERLOT feature being 100% missing.

In [22]:
# Seeing what the non missing rows looked like. 0 represents that it 
non_missing_trips_suppressed = trips['SUPPRESSED'].notna()
for row in range(len(non_missing_trips_suppressed)):
    if non_missing_trips_suppressed[row] == True:
        if trips['SUPPRESSED'][row] == 1.0:
            print(trips['SUPPRESSED'][row])
        else:
            print("No 1.0")

No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0

No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0

No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0

No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0
No 1.0

In [14]:
trips.loc[1783]

DATASOURCE                         DB
DAYOFSERVICE       18-FEB-18 00:00:00
TRIPID                        6269995
LINEID                            45A
ROUTEID                        45A_60
DIRECTION                           1
PLANNEDTIME_ARR                 68767
PLANNEDTIME_DEP                 65700
ACTUALTIME_ARR                    NaN
ACTUALTIME_DEP                  65746
BASIN                          BasDef
TENDERLOT                         NaN
SUPPRESSED                          0
JUSTIFICATIONID                194642
LASTUPDATE         26-FEB-18 11:09:33
NOTE                        ,2428302,
Name: 1783, dtype: object

## Vehicles Data

#### Understanding the features:
* DATASOURCE: Unique Bus Operator Code
* DAYOFSERVICE: Day of service
* VEHICLEID: Unique vehicle code arriving at this stop point
* DISTANCE: Distance travelled by the vehicle in the corresponding day
* MINUTES: Time worked by the veihcle in the corresponding day
* LASTUPDATE: Time of the last record update
* NOTE: Free note

In [15]:
vehicles.shape

(272622, 7)

In [16]:
vehicles.head(5)

Unnamed: 0,DATASOURCE,DAYOFSERVICE,VEHICLEID,DISTANCE,MINUTES,LASTUPDATE,NOTE
0,DB,23-NOV-18 00:00:00,3303848,286166,58849,04-DEC-18 08:03:09,
1,DB,23-NOV-18 00:00:00,3303847,259545,56828,04-DEC-18 08:03:09,
2,DB,28-FEB-18 00:00:00,2868329,103096,40967,08-MAR-18 10:35:59,
3,DB,28-FEB-18 00:00:00,2868330,147277,43599,08-MAR-18 10:35:59,
4,DB,28-FEB-18 00:00:00,2868331,224682,40447,08-MAR-18 10:35:59,


In [17]:
vehicles.tail(5)

Unnamed: 0,DATASOURCE,DAYOFSERVICE,VEHICLEID,DISTANCE,MINUTES,LASTUPDATE,NOTE
272617,DB,29-DEC-18 00:00:00,3393878,264237,62320,16-JAN-19 18:00:42,
272618,DB,29-DEC-18 00:00:00,3394105,250335,52254,16-JAN-19 18:00:42,
272619,DB,29-DEC-18 00:00:00,3394109,172539,44349,16-JAN-19 18:00:42,
272620,DB,29-DEC-18 00:00:00,3394130,188057,38948,16-JAN-19 18:00:42,
272621,DB,29-DEC-18 00:00:00,3394131,291697,63677,16-JAN-19 18:00:42,


In [18]:
vehicles.dtypes

DATASOURCE       object
DAYOFSERVICE     object
VEHICLEID         int64
DISTANCE          int64
MINUTES           int64
LASTUPDATE       object
NOTE            float64
dtype: object

In [19]:
#  Unique values for each feature
vehicles.nunique()

DATASOURCE           1
DAYOFSERVICE       360
VEHICLEID         1152
DISTANCE        170498
MINUTES          57523
LASTUPDATE         360
NOTE                 0
dtype: int64

In [20]:
# Missing values for each feature
vehicles.isnull().sum()

DATASOURCE           0
DAYOFSERVICE         0
VEHICLEID            0
DISTANCE             0
MINUTES              0
LASTUPDATE           0
NOTE            272622
dtype: int64

* DATASOURCE only has 1 unique value so this will probably be dropped.
* NOTE has no values so this will be dropped.
* All rows of NOTE is missing. Will be dropped.
* Other than that, other features have no missing values.