# Data Quality Preparation
In this section I will be cleaning and preparing the vehicles and trip tables for the prediction model. It will be divided into two parts as they will be cleaned and prepared seperately.

#### Introduction:

### Trip Dataset
Each row represents one trip(route)
#### Understanding the features:
* DATASOURCE: Unique Bus Operator Code
* DAYOFSERVICE: Day of service. One day of service could last more than 24 hours
* TRIPID: Unique Trip code
* LINEID: Unique Line code
* ROUTEID: Unique route code
* DIRECTION: Route direction: (2)IB = inbound / going / northbound / eastbound, (1)OB = outbound / back / southbound / westbound
* PLANNEDTIME_ARR: Planned arrival time of the trip, in seconds
* PLANNEDTIME_DEP: Planned departure time of the trip, in seconds
* ACTUALTIME_ARR: Actual arrival time of the trip, in seconds
* ACTUALTIME_DEP: Actual departure time of the trip, in seconds
* BASIN: basin code
* TENDERLOT: tender lot
* SUPPRESSED: The whole trip has been supressed (0 = achieved, 1 = suppressed)
* JUSTIFICATIONID: Fault code
* LASTUPDATE: Time of the last record update 
* NOTE: Free note

### Vehicles Dataset
Each record of this table represents the service of one vehicle for one day of service and tells about
the overall distance and time worked by the associated vehicle in that specific day.

#### Understanding the features:
* DATASOURCE: Unique Bus Operator Code
* DAYOFSERVICE: Day of service
* VEHICLEID: Unique vehicle code arriving at this stop point
* DISTANCE: Distance travelled by the vehicle in the corresponding day
* MINUTES: Time worked by the veihcle in the corresponding day
* LASTUPDATE: Time of the last record update
* NOTE: Free note

In [1]:
# Importing the modules that might be used
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import dask.dataframe as dd
import sqlite3
from sqlalchemy import create_engine
from pprint import pprint

# ignore warnings
import warnings
warnings.filterwarnings('ignore')

In [2]:
leavetimes = pd.read_feather('/home/faye/data/leavetimes.feather')

leavetimes = leavetimes.sort_values(by='PROGRNUMBER')

In [None]:
last_lines = leavetimes.groupby('TRIPID').apply(pd.DataFrame.sort_values, 'PROGRNUMBER')

In [None]:
last_lines

In [None]:
leavetimes.iloc[[9804909]]

In [None]:
# Loading in the data
trips = pd.read_csv('/home/faye/data/rt_trips_DB_2018.txt', sep=';',error_bad_lines=False)
vehicles = pd.read_csv('/home/faye/data/rt_vehicles_DB_2018.txt', sep=';',error_bad_lines=False)

## Trips 
*Exploring the data*
<br><br>

In [None]:
trips.head(50)

In [None]:
trips.tail(50)

In [None]:
# Columns and rows
trips.shape

In [None]:
trips.dtypes

In [None]:
trips.isnull().sum()

In [None]:
trips.nunique()

In [None]:
# trips_columns = list(trips.columns.values)

In [None]:
# Seperating features
categorical_columns = ['DATASOURCE','TRIPID','LINEID','ROUTEID','DIRECTION','BASIN','TENDERLOT','JUSTIFICATIONID','NOTE','SUPPRESSED']
time_columns = ['PLANNEDTIME_ARR','PLANNEDTIME_DEP','ACTUALTIME_ARR','ACTUALTIME_DEP']
datetime = ['DAYOFSERVICE','LASTUPDATE']

# Converting seconds into times and into datetime series
import datetime
for row in trips['PLANNEDTIME_ARR']:
    trips['PLANNEDTIME_ARR'] = trips['PLANNEDTIME_ARR'].replace(row, str(datetime.timedelta(seconds=row)))

for row in trips['PLANNEDTIME_DEP']:
    trips['PLANNEDTIME_DEP'] = trips['PLANNEDTIME_DEP'].replace(row, str(datetime.timedelta(seconds=row)))

for row in trips['ACTUALTIME_ARR']:
    trips['ACTUALTIME_ARR'] = trips['ACTUALTIME_ARR'].replace(row, str(datetime.timedelta(seconds=row)))

for row in trips['ACTUALTIME_DEP']:
    trips['ACTUALTIME_DEP'] = trips['ACTUALTIME_DEP'].replace(row, str(datetime.timedelta(seconds=row)))

In [None]:
# Converting objects into categorical types
for column in categorical_columns:
    trips[column] = trips[column].astype('category')

# Converting objects into datetime series
for column2 in datetime:
    trips[column2] = pd.to_datetime(trips[column2])

In [None]:
trips.dtypes

In [None]:
# Converting the time columns seconds into times. The days part should be ignored.
for column3 in time_columns:
    trips[column3] = trips[column3].apply(pd.to_timedelta, unit='s')
    trips[column3] = trips[column3]- pd.to_timedelta(trips[column3].dt.days, unit='d')

In [None]:
trips.head(10)

In [None]:
trips_sorted = trips.sort_values(by='TRIPID')
trips_sorted.head(50)

In [None]:
trip77a = trips_sorted.loc[trips_sorted['LINEID']=='77A']
trip77a = trip77a.loc[trips_sorted['DAYOFSERVICE']=='2018-01-03']
trip77a = trip77a.sort_values(by='PLANNEDTIME_ARR')
trip77a.head(50)

## Note about this specific route
In this route, we took the route 77a from a date (January 3, 2018) and sorted it by PLANNEDTIME_ARR. It has a total of 101 rows. The first few values of PLANNEDTIME_ARR, these may be the last few trips that ran that day and it was just sorted as the top? In any case, We see that they provide data for each route for one day with all of the times. It is a matter of matching it with the leavetimes dataset then. This will be explored at the very end. 

## Duplicates
Exploring if the trips dataset has any duplicates. There shouldn't be duplicates for any reason. The trips only siginify one route.

We see there aren't any duplicates so this is a good sign for the data. 

In [None]:
duplicates = trips[trips.duplicated(keep=False)]
print("Number of rows that are duplicates: ", duplicates.shape[0])

## Checking descriptive columns

In [None]:
trips[datetime].describe(datetime_is_numeric=True).T

In [None]:
trips[time_columns].describe().T

In [None]:
trips[categorical_columns].describe().T

We see there a couple of features that have <= 1 cardinalities:
* DATASOURCE has 1 unique value.
* TRIPID has more than 600k unique values.
* LINEID has 130 unique values.
* ROUTEID has 588 unique values.
* DIRECTION has 2 unique values.
* BASIN has 1 unique values.
* TENDERLOT has no unique values meaning that it is just null.
* JUSTIFICATIONID has 4330 unique values.
* NOTE has more than 46k unique values.
* SUPPRESSED has 4333 unique values.

**RESULT :** As we can see, DATASOURCE, BASIN, TENDERLOT and SUPPRESSED has 1 or 0 unique values. This means the same value is consistent throughout the dataset. Therefore, these will be dropped.

## Logical Integrity Tests

List of integrity tests to be performed:
* Make sure LASTUPDATE dates doesn't occur before DAYSERVICE

#### TEST 1: Ensure LASTUPDATE doesn't occur before DAYOFSERVICE

In [None]:
test1 = trips[trips['DAYOFSERVICE'] > trips['LASTUPDATE']]
print("The number of rows that have dates that occur before DAYOFSERVICE from LASTUPDATES is: ", test1.shape[0])

#### TEST 2: Ensure timetables match the specific route with PLANNEDTIME_DEP
This is to check that the timetables match with all of the routes. We take a look at PLANNEDTIME_DEP to check this. 

In [None]:
# Need to get a list of all of the routes and their timetables, then compare it with the routes 
# and their planned departure times

# Make a dataframe with just LINEID and PLANNEDTIME_DEP and make a list of all of the routes, order them
data = {'LINEID': trips['LINEID'], 'PLANNEDTIME_DEP': trips['PLANNEDTIME_DEP']}
line_dep = pd.DataFrame(data)
list_routes = list(line_dep.LINEID.unique())
list_routes = sorted(list_routes)

In [None]:
trips_gtfs = pd.read_csv('/home/faye/Data-Analytics-CityRoute/Dublin_Bus_GTFS/17-07-2021/stop_times.txt', sep=',',error_bad_lines=False)

In [None]:
trips_gtfs.head(20)

### Process to get the times:
* Got all of the stop_sequence rows
* Reseted the index (did not need to because my previous plan was scraped)
* Made three lists: service, route, and direction. Service was to indicate if it's weekday,weekend (see calendar.txt for more info), route indicates which route number, direction is for inbound and outbound (I/O)
* looped through the trips_id
    * split the column value by .
    * split the 2nd value by -
    * added first value to service
    * added first value of the 2nd value split to route
    * added fourth value to direction
    
**Extra info:**
* I parsed this from the 2021 dataset. Now, some route_id won't match with the 2018 data because there are routes that are called H3 that used to be old routes. They just changed the name. I was told it was changed because the new company that owns Dublin Bus now wants to change the system again but it was halted for reasons. So for example H3 used to be 29A (I think).  

In [None]:
# Store the stop_sequence = 1 first
trips_1 = trips_gtfs.loc[trips_gtfs['stop_sequence']==1]

In [None]:
# Resetting the index for this new dataframe
trips_1 = trips_1.reset_index(drop=True)

In [None]:
trips_1.head(5)

In [None]:
# Making a list for calendar, route number, inbound/outbound
serviceid = []
routeid = []
direction = []

for value in trips_1.trip_id:
    id_ = value.split(".")
    id_2 = id_[2].split("-")
    serviceid += [id_[1]]
    routeid += [id_2[1]]
    direction += [id_[4]]

In [None]:
trips_1['service_id'] = np.array(serviceid)
trips_1['route_id'] = np.array(routeid)
trips_1['direction'] = np.array(direction)

In [None]:
trips_2 = trips_1
# trips_duplicate = trips_2[trips_2.duplicated(keep=False)]

In [None]:
trips_2 = pd.DataFrame({'departure_time': trips_2['departure_time'], 'service_id': trips_2['service_id'], 
                       'route_id': trips_2['route_id'], 'stop_headsign': trips_2['stop_headsign'], 
                       'direction': trips_2['direction']})

In [None]:
trips_2[trips_2.duplicated(keep=False)]
trips_2 = trips_2.drop_duplicates()

In [None]:
trips_2

In [None]:
trips_2.loc[trips_2['route_id']=='H1'].head(50)

<br><br>
I decided against going through this local integrity test. Because there are new replacements for some of the routes, we won't be able to match them side by side. After looking at this website: https://www.transportforireland.ie/h-spine/ , it's become apparent that the departure times will be different. I will, however replace H1 > 29A, H2> 32. H3 > 31/31A. They changed the timetables so it wouldn't be possible to perform this test. It also looks like H9 is a completely new route. 

I will save the new dataset for anyone who wants to use it. This will be done in the data quality plan. 
<br><br>

## Vehicles
*Exploring the data*
<br><br>

In [None]:
vehicles.head(10)

In [None]:
vehicles.tail(10)

In [None]:
vehicles.shape

In [None]:
vehicles.dtypes

In [None]:
vehicles.isnull().sum()

In [None]:
# Creating columns for different datatypes
categorical_column = ['DATASOURCE', 'VEHICLEID']
time_columns = ['MINUTES']
numerical_columns = ['DISTANCE']
datetime_columns = ['DAYOFSERVICE', 'LASTUPDATE']

In [None]:
# Converting objects into categorical types
for column in categorical_column:
    vehicles[column] = vehicles[column].astype('category')

# Converting objects into datetime series
for column2 in datetime_columns:
    vehicles[column2] = pd.to_datetime(vehicles[column2])

# Converting the time columns seconds into times. The days part should be ignored.
for column3 in time_columns:
    vehicles[column3] = vehicles[column3].apply(pd.to_timedelta, unit='s')
    vehicles[column3] = vehicles[column3]- pd.to_timedelta(vehicles[column3].dt.days, unit='d')

In [None]:
vehicles.head(10)

In [None]:
vehicles.dtypes

## Duplicates
Exploring if the trips dataset has any duplicates. There shouldn't be duplicates for any reason. The trips only siginify one route.

Again, there are no duplicates in this vehicles dataset.

In [None]:
duplicates = vehicles[vehicles.duplicated(keep=False)]
print("Number of rows that are duplicates: ", duplicates.shape[0])

## Checking descriptive columns

In [None]:
vehicles[datetime].describe(datetime_is_numeric=True).T

In [None]:
vehicles[categorical_column].describe().T

In [None]:
vehicles[time_columns].describe().T

In [None]:
vehicles[numerical_columns].describe().T

# Data Quality Report 

### Trips

**Feature  - &emsp;Issue &emsp; - Solution**
<br>
* DATASOURCE - constant column - remove feature
<br>
* DAYOFSERVICE - 5 days from 365 days - investigate if the missing 5 days are holidays
<br>
* TRIPID -  None detected - no action needed
<br>
* LINEID - None detected - no action needed
<br>
* ROUTEID - None detected - no action needed
<br>
* DIRECTION - None detected - no action needed
<br>
* PLANNEDTIME_DEP - None detected - no action needed
<br>
* PLANNEDTIME_ARR - None detected - no action needed
<br>
* ACTUALTIME_DEP - 6% missing values - investigate and find average and impute
<br>
* ACTUALTIME_ARR - 7% missing values - investigate and find average and impute
<br>
* BASIN - constant column - remove feature
<br>
* TENDERLOT - 100% missing values - investigate and remove feature
<br>
* SUPPRESSED - 99% missing values - investigate and remove feature
<br>
* JUSTIFICATIONID - 99% missing values - investigate and remove feature
<br>
* LASTUPDATE - None detected - no action needed
<br>
* NOTE - None detected - investigate
<br>
* Trips from GTFS feed - nonexisting routes on old timetable - insert new features and finalize dataset
<br>

# Data Quality Plan 

### Trips

**DATASOURCE:** The feature only has one unique value which is DB. Since we know that the source is constant throughout, we will remove this feature from the dataframe.

In [None]:
trips.pop('DATASOURCE')

In [None]:
trips.head(5)

**DAYOFSERVICE:** We see that there are 5 days that aren't accounted for. Let us see what dates that weren't included. 

The dates that were not included:
* March 1 2018
* March 2 2018
* December 9 2018
* December 10 2018
* December 25 2018

This just means that there were no trips planned for this day. Will it affect the prediction model? Probably not. 

In [None]:
daysofservice = trips['DAYOFSERVICE']
daysofservice = daysofservice.sort_values()

In [None]:
#daysofservice.unique()

In [None]:
pd.date_range(start = '2018-01-01', end = '2018-12-31' ).difference(daysofservice)

**ACTUALTIME_DEP:** We will investigate the 6% values missing. First, we will make a dataframe that only have the rows that have ACTUALTIME_DEP missing. Then load leavetimes table in a loop and make queries and input tripsid then add to dataframe, and get last progrnumber of each TRIPID. 

In [None]:
actual_dep = trips[trips['ACTUALTIME_DEP'].isna()]
actual_dep

In [None]:
tripday = [[day for day in actual_dep.DAYOFSERVICE], [id_ for id_ in actual_dep.TRIPID]]

In [None]:
columns = list(leavetimes.columns)
df = pd.DataFrame(columns=columns)

In [None]:
leavetimes['DAYOFSERVICE'] = pd.to_datetime(leavetimes['DAYOFSERVICE'])

In [None]:
x = leavetimes.loc[(leavetimes['DAYOFSERVICE']==tripday[0][0]) & (leavetimes['TRIPID']==tripday[1][0])]
x = x.sort_values(by='PROGRNUMBER')
# x.iloc[[-1]]

In [None]:
for val in range(len(tripday[0])):
    row = leavetimes.loc[(leavetimes['DAYOFSERVICE']==tripday[0][val]) & (leavetimes['TRIPID']==tripday[1][val])].sort_values(by='PROGRNUMBER').iloc[[-1]]
    df = df.append(row)

In [None]:
leavetimes.loc[(leavetimes['DAYOFSERVICE']==tripday[0][0]) & (leavetimes['TRIPID']==tripday[1][0])].sort_values(by='PROGRNUMBER').iloc[[-1]]

In [None]:
n = leavetimes.reset_index()

In [None]:
k = pd.DataFrame(l.tail(1).reset_index(drop=True))

In [None]:
k.tail(50)

In [None]:
leavet