# Data Quality Preparation
In this section I will be cleaning and preparing the vehicles and trip tables for the prediction model. It will be divided into two parts as they will be cleaned and prepared seperately.

#### Introduction:

### Trip Dataset
Each row represents one trip(route)
#### Understanding the features:
* DATASOURCE: Unique Bus Operator Code
* DAYOFSERVICE: Day of service. One day of service could last more than 24 hours
* TRIPID: Unique Trip code
* LINEID: Unique Line code
* ROUTEID: Unique route code
* DIRECTION: Route direction: (2)IB = inbound / going / northbound / eastbound, (1)OB = outbound / back / southbound / westbound
* PLANNEDTIME_ARR: Planned arrival time of the trip, in seconds
* PLANNEDTIME_DEP: Planned departure time of the trip, in seconds
* ACTUALTIME_ARR: Actual arrival time of the trip, in seconds
* ACTUALTIME_DEP: Actual departure time of the trip, in seconds
* BASIN: basin code
* TENDERLOT: tender lot
* SUPPRESSED: The whole trip has been supressed (0 = achieved, 1 = suppressed)
* JUSTIFICATIONID: Fault code
* LASTUPDATE: Time of the last record update 
* NOTE: Free note

### Vehicles Dataset
Each record of this table represents the service of one vehicle for one day of service and tells about
the overall distance and time worked by the associated vehicle in that specific day.

#### Understanding the features:
* DATASOURCE: Unique Bus Operator Code
* DAYOFSERVICE: Day of service
* VEHICLEID: Unique vehicle code arriving at this stop point
* DISTANCE: Distance travelled by the vehicle in the corresponding day
* MINUTES: Time worked by the veihcle in the corresponding day
* LASTUPDATE: Time of the last record update
* NOTE: Free note

In [1]:
# Importing the modules that might be used
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as patches
import dask.dataframe as dd
import sqlite3
from sqlalchemy import create_engine
from pprint import pprint

# ignore warnings
import warnings
warnings.filterwarnings('ignore')

In [2]:
leave_times = pd.read_feather('/home/faye/data/leavetimes.feather')

In [3]:
# Loading in the data
trips = pd.read_csv('/home/faye/data/rt_trips_DB_2018.txt', sep=';',error_bad_lines=False)
vehicles = pd.read_csv('/home/faye/data/rt_vehicles_DB_2018.txt', sep=';',error_bad_lines=False)
# leave_times_db = create_engine('sqlite:///rt_leavetimes_DB_2018.db')
# con = sqlite3.connect('/home/faye/notebooks/rt_leavetimes_DB_2018.db')

leavetimes = pd.read_sql("select * from chunk_sql group by TRIPID order by max(PROGRNUMBER);", con)

## Trips 
*Exploring the data*
<br><br>

In [4]:
trips.head(50)

Unnamed: 0,DATASOURCE,DAYOFSERVICE,TRIPID,LINEID,ROUTEID,DIRECTION,PLANNEDTIME_ARR,PLANNEDTIME_DEP,ACTUALTIME_ARR,ACTUALTIME_DEP,BASIN,TENDERLOT,SUPPRESSED,JUSTIFICATIONID,LASTUPDATE,NOTE
0,DB,07-FEB-18 00:00:00,6253783,68,68_80,1,87245,84600,87524.0,84600.0,BasDef,,,,28-FEB-18 12:05:11,",2967409,"
1,DB,07-FEB-18 00:00:00,6262138,25B,25B_271,2,30517,26460,32752.0,,BasDef,,,,28-FEB-18 12:05:11,",2580260,"
2,DB,07-FEB-18 00:00:00,6254942,45A,45A_70,2,35512,32100,36329.0,32082.0,BasDef,,,,28-FEB-18 12:05:11,",2448968,"
3,DB,07-FEB-18 00:00:00,6259460,25A,25A_273,1,57261,54420,58463.0,54443.0,BasDef,,,,28-FEB-18 12:05:11,",3094242,"
4,DB,07-FEB-18 00:00:00,6253175,14,14_15,1,85383,81600,84682.0,81608.0,BasDef,,,,28-FEB-18 12:05:11,",2526331,"
5,DB,07-FEB-18 00:00:00,6248240,77A,77A_28,2,41648,37200,42019.0,37538.0,BasDef,,,,28-FEB-18 12:05:11,",2966500,"
6,DB,07-FEB-18 00:00:00,6251760,39,39_21,2,34768,28920,35709.0,28929.0,BasDef,,,,28-FEB-18 12:05:11,",2422850,"
7,DB,07-FEB-18 00:00:00,6262909,16,16_20,1,43936,38880,43721.0,38955.0,BasDef,,,,28-FEB-18 12:05:11,",3045261,"
8,DB,07-FEB-18 00:00:00,6251147,40D,40D_102,1,33558,30600,34540.0,31185.0,BasDef,,,,28-FEB-18 12:05:11,",2837123,"
9,DB,07-FEB-18 00:00:00,6249435,27B,27B_34,2,52682,49800,53281.0,49974.0,BasDef,,,,28-FEB-18 12:05:11,",2329338,"


In [5]:
trips.tail(50)

Unnamed: 0,DATASOURCE,DAYOFSERVICE,TRIPID,LINEID,ROUTEID,DIRECTION,PLANNEDTIME_ARR,PLANNEDTIME_DEP,ACTUALTIME_ARR,ACTUALTIME_DEP,BASIN,TENDERLOT,SUPPRESSED,JUSTIFICATIONID,LASTUPDATE,NOTE
2182587,DB,14-MAY-18 00:00:00,6759988,13,13_60,1,36878,30600,37027.0,30675.0,BasDef,,,,26-JUN-18 09:13:13,",2585326,"
2182588,DB,14-MAY-18 00:00:00,6760080,140,140_19,1,65691,61800,65742.0,61903.0,BasDef,,,,26-JUN-18 09:13:13,",3107607,"
2182589,DB,14-MAY-18 00:00:00,6760086,111,111_8,2,47924,45000,47856.0,45022.0,BasDef,,,,26-JUN-18 09:13:13,",2799029,"
2182590,DB,14-MAY-18 00:00:00,6760100,123,123_34,1,37796,34080,38199.0,34015.0,BasDef,,,,26-JUN-18 09:13:13,",3224378,"
2182591,DB,14-MAY-18 00:00:00,6760154,76,76_37,2,32300,29400,32323.0,29403.0,BasDef,,,,26-JUN-18 09:13:13,",1866120,"
2182592,DB,14-MAY-18 00:00:00,6763776,33,33_72,2,44067,38700,44405.0,,BasDef,,,,26-JUN-18 09:13:13,",2887911,"
2182593,DB,14-MAY-18 00:00:00,6763777,33,33_45,1,50207,45000,50026.0,44986.0,BasDef,,,,26-JUN-18 09:13:13,",2887912,"
2182594,DB,14-MAY-18 00:00:00,6763797,84,84_31,2,45325,40200,45657.0,40330.0,BasDef,,,,26-JUN-18 09:13:13,",2817661,"
2182595,DB,14-MAY-18 00:00:00,6763806,151,151_15,1,28267,24600,28932.0,24609.0,BasDef,,,,26-JUN-18 09:13:13,",2602899,"
2182596,DB,14-MAY-18 00:00:00,6763831,151,151_16,1,72032,68400,71940.0,68426.0,BasDef,,,,26-JUN-18 09:13:13,",2604009,"


In [6]:
# Columns and rows
trips.shape

(2182637, 16)

In [7]:
trips.dtypes

DATASOURCE          object
DAYOFSERVICE        object
TRIPID               int64
LINEID              object
ROUTEID             object
DIRECTION            int64
PLANNEDTIME_ARR      int64
PLANNEDTIME_DEP      int64
ACTUALTIME_ARR     float64
ACTUALTIME_DEP     float64
BASIN               object
TENDERLOT          float64
SUPPRESSED         float64
JUSTIFICATIONID    float64
LASTUPDATE          object
NOTE                object
dtype: object

In [8]:
trips.isnull().sum()

DATASOURCE               0
DAYOFSERVICE             0
TRIPID                   0
LINEID                   0
ROUTEID                  0
DIRECTION                0
PLANNEDTIME_ARR          0
PLANNEDTIME_DEP          0
ACTUALTIME_ARR      137207
ACTUALTIME_DEP      164551
BASIN                    0
TENDERLOT          2182637
SUPPRESSED         2178304
JUSTIFICATIONID    2178307
LASTUPDATE               0
NOTE                     0
dtype: int64

In [9]:
trips.nunique()

DATASOURCE              1
DAYOFSERVICE          360
TRIPID             658964
LINEID                130
ROUTEID               588
DIRECTION               2
PLANNEDTIME_ARR     64461
PLANNEDTIME_DEP       791
ACTUALTIME_ARR      68122
ACTUALTIME_DEP      66771
BASIN                   1
TENDERLOT               0
SUPPRESSED              1
JUSTIFICATIONID      3526
LASTUPDATE            360
NOTE                46690
dtype: int64

In [10]:
# trips_columns = list(trips.columns.values)

In [11]:
# Seperating features
categorical_columns = ['DATASOURCE','TRIPID','LINEID','ROUTEID','DIRECTION','BASIN','TENDERLOT','JUSTIFICATIONID','NOTE','SUPPRESSED']
time_columns = ['PLANNEDTIME_ARR','PLANNEDTIME_DEP','ACTUALTIME_ARR','ACTUALTIME_DEP']
datetime = ['DAYOFSERVICE','LASTUPDATE']

# Converting seconds into times and into datetime series
import datetime
for row in trips['PLANNEDTIME_ARR']:
    trips['PLANNEDTIME_ARR'] = trips['PLANNEDTIME_ARR'].replace(row, str(datetime.timedelta(seconds=row)))

for row in trips['PLANNEDTIME_DEP']:
    trips['PLANNEDTIME_DEP'] = trips['PLANNEDTIME_DEP'].replace(row, str(datetime.timedelta(seconds=row)))

for row in trips['ACTUALTIME_ARR']:
    trips['ACTUALTIME_ARR'] = trips['ACTUALTIME_ARR'].replace(row, str(datetime.timedelta(seconds=row)))

for row in trips['ACTUALTIME_DEP']:
    trips['ACTUALTIME_DEP'] = trips['ACTUALTIME_DEP'].replace(row, str(datetime.timedelta(seconds=row)))

In [12]:
# Converting objects into categorical types
for column in categorical_columns:
    trips[column] = trips[column].astype('category')

# Converting objects into datetime series
for column2 in datetime:
    trips[column2] = pd.to_datetime(trips[column2])

In [13]:
trips.dtypes

DATASOURCE               category
DAYOFSERVICE       datetime64[ns]
TRIPID                   category
LINEID                   category
ROUTEID                  category
DIRECTION                category
PLANNEDTIME_ARR             int64
PLANNEDTIME_DEP             int64
ACTUALTIME_ARR            float64
ACTUALTIME_DEP            float64
BASIN                    category
TENDERLOT                category
SUPPRESSED               category
JUSTIFICATIONID          category
LASTUPDATE         datetime64[ns]
NOTE                     category
dtype: object

In [None]:
# Converting the time columns seconds into times. The days part should be ignored.
for column3 in time_columns:
    trips[column3] = trips[column3].apply(pd.to_timedelta, unit='s')
    trips[column3] = trips[column3]- pd.to_timedelta(trips[column3].dt.days, unit='d')

In [None]:
trips.head(10)

In [None]:
trips_sorted = trips.sort_values(by='TRIPID')
trips_sorted.head(50)

In [None]:
trip77a = trips_sorted.loc[trips_sorted['LINEID']=='77A']
trip77a = trip77a.loc[trips_sorted['DAYOFSERVICE']=='2018-01-03']
trip77a = trip77a.sort_values(by='PLANNEDTIME_ARR')
trip77a.head(50)

## Note about this specific route
In this route, we took the route 77a from a date (January 3, 2018) and sorted it by PLANNEDTIME_ARR. It has a total of 101 rows. The first few values of PLANNEDTIME_ARR, these may be the last few trips that ran that day and it was just sorted as the top? In any case, We see that they provide data for each route for one day with all of the times. It is a matter of matching it with the leavetimes dataset then. This will be explored at the very end. 

## Duplicates
Exploring if the trips dataset has any duplicates. There shouldn't be duplicates for any reason. The trips only siginify one route.

We see there aren't any duplicates so this is a good sign for the data. 

In [None]:
duplicates = trips[trips.duplicated(keep=False)]
print("Number of rows that are duplicates: ", duplicates.shape[0])

## Checking descriptive columns

In [None]:
trips[datetime].describe(datetime_is_numeric=True).T

In [None]:
trips[time_columns].describe().T

In [None]:
trips[categorical_columns].describe().T

We see there a couple of features that have <= 1 cardinalities:
* DATASOURCE has 1 unique value.
* TRIPID has more than 600k unique values.
* LINEID has 130 unique values.
* ROUTEID has 588 unique values.
* DIRECTION has 2 unique values.
* BASIN has 1 unique values.
* TENDERLOT has no unique values meaning that it is just null.
* JUSTIFICATIONID has 4330 unique values.
* NOTE has more than 46k unique values.
* SUPPRESSED has 4333 unique values.

**RESULT :** As we can see, DATASOURCE, BASIN, TENDERLOT and SUPPRESSED has 1 or 0 unique values. This means the same value is consistent throughout the dataset. Therefore, these will be dropped.

## Logical Integrity Tests

List of integrity tests to be performed:
* Make sure LASTUPDATE dates doesn't occur before DAYSERVICE

#### TEST 1: Ensure LASTUPDATE doesn't occur before DAYOFSERVICE

In [None]:
test1 = trips[trips['DAYOFSERVICE'] > trips['LASTUPDATE']]
print("The number of rows that have dates that occur before DAYOFSERVICE from LASTUPDATES is: ", test1.shape[0])

#### TEST 2: Ensure timetables match the specific route with PLANNEDTIME_DEP
This is to check that the timetables match with all of the routes. We take a look at PLANNEDTIME_DEP to check this. 

In [None]:
# Need to get a list of all of the routes and their timetables, then compare it with the routes 
# and their planned departure times

# Make a dataframe with just LINEID and PLANNEDTIME_DEP and make a list of all of the routes, order them
data = {'LINEID': trips['LINEID'], 'PLANNEDTIME_DEP': trips['PLANNEDTIME_DEP']}
line_dep = pd.DataFrame(data)
list_routes = list(line_dep.LINEID.unique())
list_routes = sorted(list_routes)

In [None]:
trips_gtfs = pd.read_csv('/home/faye/Data-Analytics-CityRoute/Dublin_Bus_GTFS/17-07-2021/stop_times.txt', sep=',',error_bad_lines=False)

In [None]:
trips_gtfs.head(20)

### Process to get the times:
* Got all of the stop_sequence rows
* Reseted the index (did not need to because my previous plan was scraped)
* Made three lists: service, route, and direction. Service was to indicate if it's weekday,weekend (see calendar.txt for more info), route indicates which route number, direction is for inbound and outbound (I/O)
* looped through the trips_id
    * split the column value by .
    * split the 2nd value by -
    * added first value to service
    * added first value of the 2nd value split to route
    * added fourth value to direction
    
**Extra info:**
* I parsed this from the 2021 dataset. Now, some route_id won't match with the 2018 data because there are routes that are called H3 that used to be old routes. They just changed the name. I was told it was changed because the new company that owns Dublin Bus now wants to change the system again but it was halted for reasons. So for example H3 used to be 29A (I think).  

In [None]:
# Store the stop_sequence = 1 first
trips_1 = trips_gtfs.loc[trips_gtfs['stop_sequence']==1]

In [None]:
# Resetting the index for this new dataframe
trips_1 = trips_1.reset_index(drop=True)

In [None]:
trips_1.head(5)

In [None]:
# Making a list for calendar, route number, inbound/outbound
serviceid = []
routeid = []
direction = []

for value in trips_1.trip_id:
    id_ = value.split(".")
    id_2 = id_[2].split("-")
    serviceid += [id_[1]]
    routeid += [id_2[1]]
    direction += [id_[4]]

In [None]:
trips_1['service_id'] = np.array(serviceid)
trips_1['route_id'] = np.array(routeid)
trips_1['direction'] = np.array(direction)

In [None]:
trips_2 = trips_1
# trips_duplicate = trips_2[trips_2.duplicated(keep=False)]

In [None]:
trips_2 = pd.DataFrame({'departure_time': trips_2['departure_time'], 'service_id': trips_2['service_id'], 
                       'route_id': trips_2['route_id'], 'stop_headsign': trips_2['stop_headsign'], 
                       'direction': trips_2['direction']})

In [None]:
trips_2[trips_2.duplicated(keep=False)]
trips_2 = trips_2.drop_duplicates()

In [None]:
trips_2

In [None]:
trips_2.loc[trips_2['route_id']=='H1'].head(50)

<br><br>
I decided against going through this local integrity test. Because there are new replacements for some of the routes, we won't be able to match them side by side. After looking at this website: https://www.transportforireland.ie/h-spine/ , it's become apparent that the departure times will be different. I will, however replace H1 > 29A, H2> 32. H3 > 31/31A. They changed the timetables so it wouldn't be possible to perform this test. It also looks like H9 is a completely new route. 

I will save the new dataset for anyone who wants to use it. This will be done in the data quality plan. 
<br><br>

## Vehicles
*Exploring the data*
<br><br>

In [None]:
vehicles.head(10)

In [None]:
vehicles.tail(10)

In [None]:
vehicles.shape

In [None]:
vehicles.dtypes

In [None]:
vehicles.isnull().sum()

In [None]:
vehicles.nunique()

In [None]:
# Creating columns for different datatypes
categorical_column = ['DATASOURCE', 'VEHICLEID']
time_columns = ['MINUTES']
numerical_columns = ['DISTANCE']
datetime_columns = ['DAYOFSERVICE', 'LASTUPDATE']

In [None]:
# Converting objects into categorical types
for column in categorical_column:
    vehicles[column] = vehicles[column].astype('category')

# Converting objects into datetime series
for column2 in datetime_columns:
    vehicles[column2] = pd.to_datetime(vehicles[column2])

# Converting the time columns seconds into times. The days part should be ignored.
for column3 in time_columns:
    vehicles[column3] = vehicles[column3].apply(pd.to_timedelta, unit='s')
    vehicles[column3] = vehicles[column3]- pd.to_timedelta(vehicles[column3].dt.days, unit='d')

In [None]:
vehicles.head(10)

In [None]:
vehicles.dtypes

## Duplicates
Exploring if the trips dataset has any duplicates. There shouldn't be duplicates for any reason. The trips only siginify one route.

Again, there are no duplicates in this vehicles dataset.

In [None]:
duplicates = vehicles[vehicles.duplicated(keep=False)]
print("Number of rows that are duplicates: ", duplicates.shape[0])

## Checking descriptive columns

In [None]:
vehicles[datetime].describe(datetime_is_numeric=True).T

In [None]:
vehicles[categorical_column].describe().T

In [None]:
vehicles[time_columns].describe().T

In [None]:
vehicles[numerical_columns].describe().T

We see there a couple of features that have <= 1 cardinalities:
* DATASOURCE has 1 unique value.
* NOTE has 0 unique values and we see that 100% of the feature's values is missing.

**RESULT :** These two features will be investigated further. However, they will be dropped as constant features do no good for the targeted feature. NOTE has no unique values and has 100% missing values so this will most likely be dropped.

## Logical Integrity Tests

List of integrity tests to be performed:
* Make sure LASTUPDATE dates doesn't occur before DAYSERVICE
* Ensure that the dates match with the missing dates from the trips table

**Test 1: Ensuring LASTUPDATE dates doesn't occur before DAYOFSERVICE.**

As we can see, there are no dates that occur before DAYOFSERVICE meaning it is a good indication and there are no alterations to be done.

In [None]:
test1 = vehicles[vehicles['DAYOFSERVICE'] > vehicles['LASTUPDATE']]
print("The number of rows that have dates that occur before DAYOFSERVICE from LASTUPDATES is: ", test1.shape[0])

**Test 2: Ensure that the dates match with the missing dates from the trips table.** 

As we can see, the dates that are missing from the vehicles table match with the missing dates from the trips table. This is a good sign. Therefore, no further actions are to be done.


In [None]:
daysofservice_vec = vehicles['DAYOFSERVICE']
daysofservice_vec = daysofservice_vec.sort_values()
pd.date_range(start = '2018-01-01', end = '2018-12-31' ).difference(daysofservice_vec)

In [None]:
pd.date_range(start = '2018-01-01', end = '2018-12-31' ).difference(daysofservice)

In [None]:
vehicles['DISTANCE'].sum()

Interesting observation here. I'm not sure if talked about before but it's intersting to see the total distance is so high. The documentation states that it either represents km/miles but the number is too high when added together. But if you treat the total sum as metres then convert it to km then it will be around 48 million km which is close to the 2019 data Brian mentioned in the presentation. Worth discussing over I feel.

# Data Quality Report 

### Trips

**Feature  - &emsp;Issue &emsp; - Solution**
<br>
* DATASOURCE - constant column - remove feature
<br>
* DAYOFSERVICE - 5 days from 365 days - investigate if the missing 5 days are holidays
<br>
* TRIPID -  None detected - no action needed
<br>
* LINEID - None detected - no action needed
<br>
* ROUTEID - None detected - no action needed
<br>
* DIRECTION - None detected - no action needed
<br>
* PLANNEDTIME_DEP - None detected - no action needed
<br>
* PLANNEDTIME_ARR - None detected - no action needed
<br>
* ACTUALTIME_DEP - 6% missing values - investigate and find average and impute
<br>
* ACTUALTIME_ARR - 7% missing values - investigate and find average and impute
<br>
* BASIN - constant column - remove feature
<br>
* TENDERLOT - 100% missing values - investigate and remove feature
<br>
* SUPPRESSED - 99% missing values - investigate and remove feature
<br>
* JUSTIFICATIONID - 99% missing values - investigate and remove feature
<br>
* LASTUPDATE - None detected - no action needed
<br>
* NOTE - None detected - investigate
<br>
* Trips from GTFS feed - nonexisting routes on old timetable - insert new features and finalize dataset
<br>

# Data Quality Plan 

### Trips

**DATASOURCE:** The feature only has one unique value which is DB. Since we know that the source is constant throughout, we will remove this feature from the dataframe.

In [None]:
trips.pop('DATASOURCE')

In [None]:
trips.head(5)

**DAYOFSERVICE:** We see that there are 5 days that aren't accounted for. Let us see what dates that weren't included. 

The dates that were not included:
* March 1 2018
* March 2 2018
* December 9 2018
* December 10 2018
* December 25 2018

This just means that there were no trips planned for this day. Will it affect the prediction model? Probably not. 

In [None]:
daysofservice = trips['DAYOFSERVICE']
daysofservice = daysofservice.sort_values()

In [None]:
#daysofservice.unique()

In [None]:
pd.date_range(start = '2018-01-01', end = '2018-12-31' ).difference(daysofservice)

**ACTUALTIME_DEP:** We will investigate the 6% values missing. First, we will make a dataframe that only have the rows that have ACTUALTIME_DEP missing.

It is important that these missing values are dealt with. In order to do that, leavetimes table will be loaded. I'm going to try and load an SQL query. The query is: "For every TRIPID, load the maximum PROGRNUMBER and return the rows"

We see here at the first 60 rows it is unusual that we only see PROGRNUMBER. It looks as if might be a mistake. However after investigating there are some trips that only had one stop and there were no further records. 

Then, we will make a seperate dataframe that will only have rows of the tripid's from leavetimes table. Then, replace trips' actualtime_dep with leavetimes' actualtime_dep. The dataframe should match 

In [None]:
actual_dep = trips[trips['ACTUALTIME_DEP'].isna()]
tripday = [[day for day in actual_dep.DAYOFSERVICE], [id_ for id_ in actual_dep.TRIPID]]

In [None]:
# Converting to timedelta
leavetimes['PLANNEDTIME_ARR'] = leavetimes['PLANNEDTIME_ARR'].apply(pd.to_timedelta, unit='s')
leavetimes['PLANNEDTIME_ARR'] = leavetimes['PLANNEDTIME_ARR']- pd.to_timedelta(leavetimes['PLANNEDTIME_ARR'].dt.days, unit='d')
leavetimes['PLANNEDTIME_DEP'] = leavetimes['PLANNEDTIME_DEP'].apply(pd.to_timedelta, unit='s')
leavetimes['PLANNEDTIME_DEP'] = leavetimes['PLANNEDTIME_DEP']- pd.to_timedelta(leavetimes['PLANNEDTIME_DEP'].dt.days, unit='d')
leavetimes['ACTUALTIME_DEP'] = leavetimes['ACTUALTIME_DEP'].apply(pd.to_timedelta, unit='s')
leavetimes['ACTUALTIME_DEP'] = leavetimes['ACTUALTIME_DEP']- pd.to_timedelta(leavetimes['ACTUALTIME_DEP'].dt.days, unit='d')

In [None]:
leavetimes.shape

In [None]:
# Making an empty dataframe
leavetrips = pd.DataFrame()
# df.append(leavetimes[leavetimes['TRIPID'].isin(tripday[1])])

In [None]:
leave_times = leave_times[(leave_times['TRIPID'].isin(tripday[1]))]
leave_times = leave_times.sort_values(by='PROGRNUMBER')
leave = leave_times.groupby(['TRIPID'])
leave = leave.tail(1)
leavetrips.append(leave)

In [None]:
actual_dep = actual_dep.sort_values(by='TRIPID')
leavetrips = leavetrips.sort_values(by='TRIPID')

In [None]:
actual_dep.head(5)

for val in range(len(tripday[0])):
    row = leavetimes.loc[(leavetimes['DAYOFSERVICE']==tripday[0][val]) & (leavetimes['TRIPID']==tripday[1][val])].sort_values(by='PROGRNUMBER').iloc[[-1]]
    df = df.append(row)

leavetimes.loc[(leavetimes['DAYOFSERVICE']==tripday[0][0]) & (leavetimes['TRIPID']==tripday[1][0])].sort_values(by='PROGRNUMBER').iloc[[-1]]

**ACTUALTIME_ARR - TO DO**

**BASIN:** This was considered a constant column and so this feature will be dropped. After investigating it from data understanding, having a constant column would not affect the prediction model as it only has one value throughout. Therefore, it will be dropped.

In [None]:
trips.pop('BASIN')

In [None]:
trips.head(1)

**TENDERLOT:** Since 100% of TENDERLOT values are missing it will most likely be dropped. From the documentation, trying to relate tender lot to business terms. This may refer to whether or not the trip is paid for a service. Makes sense that it's hidden for privacy if this is the case. It doesn't relate much to the project.

In [None]:
trips.pop('TENDERLOT')
trips.head(1)

**SUPPRESSED:** 99% of SUPPRESSED values are missing. It doesn't look like it gives any meaningful data. Not sure why it has to be suppressed. From the documentation, it looks like the NaN values just means it's partially suppressed which may explain TENDERLOT feature being 100% missing because of personal information. Like TENDERLOT, it doesn't relate much to the project and so, it will be dropped

In [None]:
trips.pop('SUPPRESSED')
trips.head(1)

**JUSTIFICATIONID:** From the documentation, it looks as if the JUSTIFICATIONID is a primary key that is used to link to another table that we are not given, that represents some kind of fault code. Since the majority of the value is missing, the relevancy of the feature is probably not needed for the model. 

In [None]:
trips.pop('JUSTIFICATIONID')
trips.head(1)

**NOTE:** I will investigate the meaning of the numbers

In [None]:
trips.dtypes

In [None]:
# Convert NOTE to type int64
trips['NOTE'].nunique()