# Data Quality Preparation
In this section I will be cleaning and preparing the vehicles and trip tables for the prediction model. It will be divided into two parts as they will be cleaned and prepared seperately.

#### Introduction:

### Trip Dataset
Each row represents one trip(route)
#### Understanding the features:
* DATASOURCE: Unique Bus Operator Code
* DAYOFSERVICE: Day of service. One day of service could last more than 24 hours
* TRIPID: Unique Trip code
* LINEID: Unique Line code
* ROUTEID: Unique route code
* DIRECTION: Route direction: (2)IB = inbound / going / northbound / eastbound, (1)OB = outbound / back / southbound / westbound
* PLANNEDTIME_ARR: Planned arrival time of the trip, in seconds
* PLANNEDTIME_DEP: Planned departure time of the trip, in seconds
* ACTUALTIME_ARR: Actual arrival time of the trip, in seconds
* ACTUALTIME_DEP: Actual departure time of the trip, in seconds
* BASIN: basin code
* TENDERLOT: tender lot
* SUPPRESSED: The whole trip has been supressed (0 = achieved, 1 = suppressed)
* JUSTIFICATIONID: Fault code
* LASTUPDATE: Time of the last record update 
* NOTE: Free note

### Vehicles Dataset
Each record of this table represents the service of one vehicle for one day of service and tells about
the overall distance and time worked by the associated vehicle in that specific day.

#### Understanding the features:
* DATASOURCE: Unique Bus Operator Code
* DAYOFSERVICE: Day of service
* VEHICLEID: Unique vehicle code arriving at this stop point
* DISTANCE: Distance travelled by the vehicle in the corresponding day
* MINUTES: Time worked by the veihcle in the corresponding day
* LASTUPDATE: Time of the last record update
* NOTE: Free note

In [1]:
# Importing the modules that might be used
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import dask.dataframe as dd
import sqlite3
from sqlalchemy import create_engine
from pprint import pprint

# ignore warnings
import warnings
warnings.filterwarnings('ignore')

In [2]:
# Loading in the data
trips = pd.read_csv('/home/faye/data/rt_trips_DB_2018.txt', sep=';',error_bad_lines=False)
vehicles = pd.read_csv('/home/faye/data/rt_vehicles_DB_2018.txt', sep=';',error_bad_lines=False)

## Trips 
*Exploring the data*
<br><br>

In [3]:
trips.head(50)

Unnamed: 0,DATASOURCE,DAYOFSERVICE,TRIPID,LINEID,ROUTEID,DIRECTION,PLANNEDTIME_ARR,PLANNEDTIME_DEP,ACTUALTIME_ARR,ACTUALTIME_DEP,BASIN,TENDERLOT,SUPPRESSED,JUSTIFICATIONID,LASTUPDATE,NOTE
0,DB,07-FEB-18 00:00:00,6253783,68,68_80,1,87245,84600,87524.0,84600.0,BasDef,,,,28-FEB-18 12:05:11,",2967409,"
1,DB,07-FEB-18 00:00:00,6262138,25B,25B_271,2,30517,26460,32752.0,,BasDef,,,,28-FEB-18 12:05:11,",2580260,"
2,DB,07-FEB-18 00:00:00,6254942,45A,45A_70,2,35512,32100,36329.0,32082.0,BasDef,,,,28-FEB-18 12:05:11,",2448968,"
3,DB,07-FEB-18 00:00:00,6259460,25A,25A_273,1,57261,54420,58463.0,54443.0,BasDef,,,,28-FEB-18 12:05:11,",3094242,"
4,DB,07-FEB-18 00:00:00,6253175,14,14_15,1,85383,81600,84682.0,81608.0,BasDef,,,,28-FEB-18 12:05:11,",2526331,"
5,DB,07-FEB-18 00:00:00,6248240,77A,77A_28,2,41648,37200,42019.0,37538.0,BasDef,,,,28-FEB-18 12:05:11,",2966500,"
6,DB,07-FEB-18 00:00:00,6251760,39,39_21,2,34768,28920,35709.0,28929.0,BasDef,,,,28-FEB-18 12:05:11,",2422850,"
7,DB,07-FEB-18 00:00:00,6262909,16,16_20,1,43936,38880,43721.0,38955.0,BasDef,,,,28-FEB-18 12:05:11,",3045261,"
8,DB,07-FEB-18 00:00:00,6251147,40D,40D_102,1,33558,30600,34540.0,31185.0,BasDef,,,,28-FEB-18 12:05:11,",2837123,"
9,DB,07-FEB-18 00:00:00,6249435,27B,27B_34,2,52682,49800,53281.0,49974.0,BasDef,,,,28-FEB-18 12:05:11,",2329338,"


In [4]:
trips.tail(50)

Unnamed: 0,DATASOURCE,DAYOFSERVICE,TRIPID,LINEID,ROUTEID,DIRECTION,PLANNEDTIME_ARR,PLANNEDTIME_DEP,ACTUALTIME_ARR,ACTUALTIME_DEP,BASIN,TENDERLOT,SUPPRESSED,JUSTIFICATIONID,LASTUPDATE,NOTE
2182587,DB,14-MAY-18 00:00:00,6759988,13,13_60,1,36878,30600,37027.0,30675.0,BasDef,,,,26-JUN-18 09:13:13,",2585326,"
2182588,DB,14-MAY-18 00:00:00,6760080,140,140_19,1,65691,61800,65742.0,61903.0,BasDef,,,,26-JUN-18 09:13:13,",3107607,"
2182589,DB,14-MAY-18 00:00:00,6760086,111,111_8,2,47924,45000,47856.0,45022.0,BasDef,,,,26-JUN-18 09:13:13,",2799029,"
2182590,DB,14-MAY-18 00:00:00,6760100,123,123_34,1,37796,34080,38199.0,34015.0,BasDef,,,,26-JUN-18 09:13:13,",3224378,"
2182591,DB,14-MAY-18 00:00:00,6760154,76,76_37,2,32300,29400,32323.0,29403.0,BasDef,,,,26-JUN-18 09:13:13,",1866120,"
2182592,DB,14-MAY-18 00:00:00,6763776,33,33_72,2,44067,38700,44405.0,,BasDef,,,,26-JUN-18 09:13:13,",2887911,"
2182593,DB,14-MAY-18 00:00:00,6763777,33,33_45,1,50207,45000,50026.0,44986.0,BasDef,,,,26-JUN-18 09:13:13,",2887912,"
2182594,DB,14-MAY-18 00:00:00,6763797,84,84_31,2,45325,40200,45657.0,40330.0,BasDef,,,,26-JUN-18 09:13:13,",2817661,"
2182595,DB,14-MAY-18 00:00:00,6763806,151,151_15,1,28267,24600,28932.0,24609.0,BasDef,,,,26-JUN-18 09:13:13,",2602899,"
2182596,DB,14-MAY-18 00:00:00,6763831,151,151_16,1,72032,68400,71940.0,68426.0,BasDef,,,,26-JUN-18 09:13:13,",2604009,"


In [5]:
# Columns and rows
trips.shape

(2182637, 16)

In [6]:
trips.dtypes

DATASOURCE          object
DAYOFSERVICE        object
TRIPID               int64
LINEID              object
ROUTEID             object
DIRECTION            int64
PLANNEDTIME_ARR      int64
PLANNEDTIME_DEP      int64
ACTUALTIME_ARR     float64
ACTUALTIME_DEP     float64
BASIN               object
TENDERLOT          float64
SUPPRESSED         float64
JUSTIFICATIONID    float64
LASTUPDATE          object
NOTE                object
dtype: object

In [7]:
trips.isnull().sum()

DATASOURCE               0
DAYOFSERVICE             0
TRIPID                   0
LINEID                   0
ROUTEID                  0
DIRECTION                0
PLANNEDTIME_ARR          0
PLANNEDTIME_DEP          0
ACTUALTIME_ARR      137207
ACTUALTIME_DEP      164551
BASIN                    0
TENDERLOT          2182637
SUPPRESSED         2178304
JUSTIFICATIONID    2178307
LASTUPDATE               0
NOTE                     0
dtype: int64

In [8]:
trips.nunique()

DATASOURCE              1
DAYOFSERVICE          360
TRIPID             658964
LINEID                130
ROUTEID               588
DIRECTION               2
PLANNEDTIME_ARR     64461
PLANNEDTIME_DEP       791
ACTUALTIME_ARR      68122
ACTUALTIME_DEP      66771
BASIN                   1
TENDERLOT               0
SUPPRESSED              1
JUSTIFICATIONID      3526
LASTUPDATE            360
NOTE                46690
dtype: int64

In [9]:
# trips_columns = list(trips.columns.values)

In [10]:
# Seperating features
categorical_columns = ['DATASOURCE','TRIPID','LINEID','ROUTEID','DIRECTION','BASIN','TENDERLOT','JUSTIFICATIONID','NOTE','SUPPRESSED']
time_columns = ['PLANNEDTIME_ARR','PLANNEDTIME_DEP','ACTUALTIME_ARR','ACTUALTIME_DEP']
datetime = ['DAYOFSERVICE','LASTUPDATE']

# Converting seconds into times and into datetime series
import datetime
for row in trips['PLANNEDTIME_ARR']:
    trips['PLANNEDTIME_ARR'] = trips['PLANNEDTIME_ARR'].replace(row, str(datetime.timedelta(seconds=row)))

for row in trips['PLANNEDTIME_DEP']:
    trips['PLANNEDTIME_DEP'] = trips['PLANNEDTIME_DEP'].replace(row, str(datetime.timedelta(seconds=row)))

for row in trips['ACTUALTIME_ARR']:
    trips['ACTUALTIME_ARR'] = trips['ACTUALTIME_ARR'].replace(row, str(datetime.timedelta(seconds=row)))

for row in trips['ACTUALTIME_DEP']:
    trips['ACTUALTIME_DEP'] = trips['ACTUALTIME_DEP'].replace(row, str(datetime.timedelta(seconds=row)))

In [11]:
# Converting objects into categorical types
for column in categorical_columns:
    trips[column] = trips[column].astype('category')

# Converting objects into datetime series
for column2 in datetime:
    trips[column2] = pd.to_datetime(trips[column2])

In [12]:
trips.dtypes

DATASOURCE               category
DAYOFSERVICE       datetime64[ns]
TRIPID                   category
LINEID                   category
ROUTEID                  category
DIRECTION                category
PLANNEDTIME_ARR             int64
PLANNEDTIME_DEP             int64
ACTUALTIME_ARR            float64
ACTUALTIME_DEP            float64
BASIN                    category
TENDERLOT                category
SUPPRESSED               category
JUSTIFICATIONID          category
LASTUPDATE         datetime64[ns]
NOTE                     category
dtype: object

In [13]:
# Converting the time columns seconds into times. The days part should be ignored.
for column3 in time_columns:
    trips[column3] = trips[column3].apply(pd.to_timedelta, unit='s')
    trips[column3] = trips[column3]- pd.to_timedelta(trips[column3].dt.days, unit='d')

In [14]:
trips.head(10)

Unnamed: 0,DATASOURCE,DAYOFSERVICE,TRIPID,LINEID,ROUTEID,DIRECTION,PLANNEDTIME_ARR,PLANNEDTIME_DEP,ACTUALTIME_ARR,ACTUALTIME_DEP,BASIN,TENDERLOT,SUPPRESSED,JUSTIFICATIONID,LASTUPDATE,NOTE
0,DB,2018-02-07,6253783,68,68_80,1,0 days 00:14:05,0 days 23:30:00,0 days 00:18:44,0 days 23:30:00,BasDef,,,,2018-02-28 12:05:11,",2967409,"
1,DB,2018-02-07,6262138,25B,25B_271,2,0 days 08:28:37,0 days 07:21:00,0 days 09:05:52,NaT,BasDef,,,,2018-02-28 12:05:11,",2580260,"
2,DB,2018-02-07,6254942,45A,45A_70,2,0 days 09:51:52,0 days 08:55:00,0 days 10:05:29,0 days 08:54:42,BasDef,,,,2018-02-28 12:05:11,",2448968,"
3,DB,2018-02-07,6259460,25A,25A_273,1,0 days 15:54:21,0 days 15:07:00,0 days 16:14:23,0 days 15:07:23,BasDef,,,,2018-02-28 12:05:11,",3094242,"
4,DB,2018-02-07,6253175,14,14_15,1,0 days 23:43:03,0 days 22:40:00,0 days 23:31:22,0 days 22:40:08,BasDef,,,,2018-02-28 12:05:11,",2526331,"
5,DB,2018-02-07,6248240,77A,77A_28,2,0 days 11:34:08,0 days 10:20:00,0 days 11:40:19,0 days 10:25:38,BasDef,,,,2018-02-28 12:05:11,",2966500,"
6,DB,2018-02-07,6251760,39,39_21,2,0 days 09:39:28,0 days 08:02:00,0 days 09:55:09,0 days 08:02:09,BasDef,,,,2018-02-28 12:05:11,",2422850,"
7,DB,2018-02-07,6262909,16,16_20,1,0 days 12:12:16,0 days 10:48:00,0 days 12:08:41,0 days 10:49:15,BasDef,,,,2018-02-28 12:05:11,",3045261,"
8,DB,2018-02-07,6251147,40D,40D_102,1,0 days 09:19:18,0 days 08:30:00,0 days 09:35:40,0 days 08:39:45,BasDef,,,,2018-02-28 12:05:11,",2837123,"
9,DB,2018-02-07,6249435,27B,27B_34,2,0 days 14:38:02,0 days 13:50:00,0 days 14:48:01,0 days 13:52:54,BasDef,,,,2018-02-28 12:05:11,",2329338,"


In [22]:
trips_sorted = trips.sort_values(by='TRIPID')
trips_sorted.head(50)

Unnamed: 0,DATASOURCE,DAYOFSERVICE,TRIPID,LINEID,ROUTEID,DIRECTION,PLANNEDTIME_ARR,PLANNEDTIME_DEP,ACTUALTIME_ARR,ACTUALTIME_DEP,BASIN,TENDERLOT,SUPPRESSED,JUSTIFICATIONID,LASTUPDATE,NOTE
681660,DB,2018-01-06,5955221,14,14_15,1,0 days 11:49:13,0 days 10:30:00,0 days 11:41:18,0 days 10:30:36,BasDef,,,,2018-01-15 08:25:25,",2513511,"
723197,DB,2018-01-06,5955222,14,14_16,2,0 days 13:17:47,0 days 12:00:00,0 days 13:15:07,0 days 11:59:42,BasDef,,,,2018-01-15 08:25:25,",2513514,"
684398,DB,2018-01-06,5955223,14,14_15,1,0 days 15:01:45,0 days 13:40:00,0 days 15:02:08,0 days 13:39:31,BasDef,,,,2018-01-15 08:25:25,",2513515,"
719581,DB,2018-01-06,5955224,14,14_16,2,0 days 16:41:34,0 days 15:20:00,0 days 16:40:25,0 days 15:20:03,BasDef,,,,2018-01-15 08:25:25,",2429419,"
674864,DB,2018-01-06,5955225,14,14_15,1,0 days 18:15:51,0 days 17:00:00,0 days 18:08:52,0 days 16:59:21,BasDef,,,,2018-01-15 08:25:25,",2513518,"
2064218,DB,2018-01-03,5955226,77A,77A_29,1,0 days 08:18:00,0 days 07:20:00,0 days 08:33:31,0 days 07:20:07,BasDef,,,,2018-01-11 18:46:25,",2958050,"
2158096,DB,2018-01-02,5955226,77A,77A_29,1,0 days 08:18:00,0 days 07:20:00,0 days 08:18:23,0 days 07:19:45,BasDef,,,,2018-01-09 20:18:05,",2958050,"
231424,DB,2018-01-05,5955226,77A,77A_29,1,0 days 08:18:00,0 days 07:20:00,0 days 08:29:39,0 days 07:20:32,BasDef,,,,2018-01-12 20:27:15,",2958050,"
737493,DB,2018-01-04,5955226,77A,77A_29,1,0 days 08:18:00,0 days 07:20:00,0 days 08:28:26,0 days 07:19:54,BasDef,,,,2018-01-11 18:55:15,",2958050,"
225052,DB,2018-01-05,5955227,77A,77A_28,2,0 days 10:32:05,0 days 09:00:00,0 days 10:11:17,0 days 08:59:53,BasDef,,,,2018-01-12 20:27:15,",2432285,"


In [34]:
trip77a = trips_sorted.loc[trips_sorted['LINEID']=='77A']
trip77a = trip77a.loc[trips_sorted['DAYOFSERVICE']=='2018-01-03']
trip77a = trip77a.sort_values(by='PLANNEDTIME_ARR')
trip77a.head(50)

Unnamed: 0,DATASOURCE,DAYOFSERVICE,TRIPID,LINEID,ROUTEID,DIRECTION,PLANNEDTIME_ARR,PLANNEDTIME_DEP,ACTUALTIME_ARR,ACTUALTIME_DEP,BASIN,TENDERLOT,SUPPRESSED,JUSTIFICATIONID,LASTUPDATE,NOTE
694969,DB,2018-01-03,5965215,77A,77A_29,1,0 days 00:04:24,0 days 23:00:00,NaT,0 days 22:58:27,BasDef,,,,2018-01-11 18:46:25,",2960125,"
2059759,DB,2018-01-03,5969376,77A,77A_28,2,0 days 00:25:09,0 days 23:30:00,0 days 00:14:56,0 days 23:29:40,BasDef,,,,2018-01-11 18:46:25,",2965215,"
692364,DB,2018-01-03,5964114,77A,77A_29,1,0 days 00:29:24,0 days 23:25:00,NaT,0 days 23:25:04,BasDef,,,,2018-01-11 18:46:25,",2959823,"
707213,DB,2018-01-03,5957107,77A,77A_29,1,0 days 06:38:00,0 days 05:40:00,0 days 06:33:55,0 days 05:44:10,BasDef,,,,2018-01-11 18:46:25,",2966449,"
714026,DB,2018-01-03,5959625,77A,77A_29,1,0 days 06:58:00,0 days 06:00:00,0 days 06:54:40,0 days 05:59:25,BasDef,,,,2018-01-11 18:46:25,",2969650,"
703209,DB,2018-01-03,5956817,77A,77A_29,1,0 days 07:18:00,0 days 06:20:00,0 days 07:08:01,0 days 06:19:26,BasDef,,,,2018-01-11 18:46:25,",2964882,"
2062834,DB,2018-01-03,5964103,77A,77A_28,2,0 days 07:30:26,0 days 06:00:00,0 days 07:05:14,0 days 06:00:09,BasDef,,,,2018-01-11 18:46:25,",2959799,"
697807,DB,2018-01-03,5961549,77A,77A_29,1,0 days 07:38:00,0 days 06:40:00,0 days 07:47:24,0 days 06:40:07,BasDef,,,,2018-01-11 18:46:25,",2962842,"
2049037,DB,2018-01-03,5971472,77A,77A_28,2,0 days 07:50:26,0 days 06:20:00,0 days 07:27:45,0 days 06:20:03,BasDef,,,,2018-01-11 18:46:25,",2959038,"
701813,DB,2018-01-03,5957040,77A,77A_29,1,0 days 07:58:00,0 days 07:00:00,0 days 07:57:58,0 days 06:58:59,BasDef,,,,2018-01-11 18:46:25,",2964410,"


## Note about this specific route
In this route, we took the route 77a from a date (January 3, 2018) and sorted it by PLANNEDTIME_ARR. It has a total of 101 rows. The first few values of PLANNEDTIME_ARR, these may be the last few trips that ran that day and it was just sorted as the top? In any case, We see that they provide data for each route for one day with all of the times. It is a matter of matching it with the leavetimes dataset then. This will be explored at the very end. 

## Duplicates
Exploring if the trips dataset has any duplicates. There shouldn't be duplicates for any reason. The trips only siginify one route.

We see there aren't any duplicates so this is a good sign for the data. 

In [17]:
duplicates = trips[trips.duplicated(keep=False)]
print("Number of rows that are duplicates: ", duplicates.shape[0])

Number of rows that are duplicates:  0


## Checking descriptive columns

In [18]:
trips[datetime].describe(datetime_is_numeric=True).T

Unnamed: 0,count,mean,min,25%,50%,75%,max
DAYOFSERVICE,2182637,2018-06-28 17:25:38.969786880,2018-01-01 00:00:00,2018-03-31 00:00:00,2018-06-28 00:00:00,2018-09-26 00:00:00,2018-12-31 00:00:00
LASTUPDATE,2182637,2018-07-16 10:25:44.996667392,2018-01-08 17:21:10,2018-04-10 15:08:12,2018-07-13 19:47:00,2018-10-23 06:57:21,2019-01-16 18:27:21


In [19]:
trips[time_columns].describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
PLANNEDTIME_ARR,2182637,0 days 14:58:33.083917298,0 days 05:01:58.847667725,0 days 00:00:00,0 days 10:47:14,0 days 15:08:10,0 days 18:58:40,0 days 23:59:59
PLANNEDTIME_DEP,2182637,0 days 14:18:14.341037927,0 days 04:53:08.417924110,0 days 00:05:00,0 days 10:00:00,0 days 14:20:00,0 days 18:05:00,0 days 23:50:00
ACTUALTIME_ARR,2045430,0 days 15:00:19.765848745,0 days 04:59:23.886687638,0 days 00:00:00,0 days 10:48:42,0 days 15:07:59,0 days 19:00:00,0 days 23:59:59
ACTUALTIME_DEP,2018086,0 days 14:21:03.389364972,0 days 04:52:25.239888791,0 days 00:00:07,0 days 10:07:17.250000,0 days 14:20:09,0 days 18:09:41,0 days 23:59:58


In [20]:
trips[categorical_columns].describe().T

Unnamed: 0,count,unique,top,freq
DATASOURCE,2182637,1,DB,2182637.0
TRIPID,2182637,658964,7324685,19.0
LINEID,2182637,130,46A,76728.0
ROUTEID,2182637,588,46A_74,37182.0
DIRECTION,2182637,2,2,1100273.0
BASIN,2182637,1,BasDef,2182637.0
TENDERLOT,0,0,,
JUSTIFICATIONID,4330,3526,203708,19.0
NOTE,2182637,46690,",2326480,",244.0
SUPPRESSED,4333,1,0,4333.0


We see there a couple of features that have <= 1 cardinalities:
* DATASOURCE has 1 unique value.
* TRIPID has more than 600k unique values.
* LINEID has 130 unique values.
* ROUTEID has 588 unique values.
* DIRECTION has 2 unique values.
* BASIN has 1 unique values.
* TENDERLOT has no unique values meaning that it is just null.
* JUSTIFICATIONID has 4330 unique values.
* NOTE has more than 46k unique values.
* SUPPRESSED has 4333 unique values.

**RESULT :** As we can see, DATASOURCE, BASIN, TENDERLOT and SUPPRESSED has 1 or 0 unique values. This means the same value is consistent throughout the dataset. Therefore, these will be dropped.

## Logical Integrity Tests

List of integrity tests to be performed:
* Make sure LASTUPDATE dates doesn't occur before DAYSERVICE
* Ensure that the timetables match the specific route and their PLANNEDTIME_DEP
* Ensure that the routeID has the matching route number in LINEID

#### TEST 1: Ensure LASTUPDATE doesn't occur before DAYOFSERVICE

In [40]:
test1 = trips[trips['DAYOFSERVICE'] > trips['LASTUPDATE']]
print("The number of rows that have dates that occur before DAYOFSERVICE from LASTUPDATES is: ", test1.shape[0])

The number of rows that have dates that occur before DAYOFSERVICE from LASTUPDATES is:  0


#### TEST 2: Ensure timetables match the specific route with PLANNEDTIME_DEP
This is to check that the timetables match with all of the routes. We take a look at PLANNEDTIME_DEP to check this. 

In [46]:
# Need to get a list of all of the routes and their timetables, then compare it with the routes 
# and their planned departure times

# Make a dataframe with just LINEID and PLANNEDTIME_DEP and make a list of all of the routes, order them
data = {'LINEID': trips['LINEID'], 'PLANNEDTIME_DEP': trips['PLANNEDTIME_DEP']}
line_dep = pd.DataFrame(data)
list_routes = list(line_dep.LINEID.unique())
list_routes = sorted(list_routes)


In [73]:
!pip3 install beautifulsoup4

Collecting beautifulsoup4
  Using cached https://files.pythonhosted.org/packages/d1/41/e6495bd7d3781cee623ce23ea6ac73282a373088fcd0ddc809a047b18eae/beautifulsoup4-4.9.3-py3-none-any.whl
Collecting soupsieve>1.2; python_version >= "3.0" (from beautifulsoup4)
  Using cached https://files.pythonhosted.org/packages/36/69/d82d04022f02733bf9a72bc3b96332d360c0c5307096d76f6bb7489f7e57/soupsieve-2.2.1-py3-none-any.whl
Installing collected packages: soupsieve, beautifulsoup4
Successfully installed beautifulsoup4-4.9.3 soupsieve-2.2.1


* Connect to the dublinbus website
* Get the url for the one that shows the timetable for one route
* Loop through list_route and apply it at the end of the url
* Get the following:
    * times
    * day
* day search by 'strong'
* times search by 'div class="time"'
* Store the times values in a dataframe:
    * Weekday (monday - friday)
    * Saturday
    * Sunday

In [127]:
# Make a dataframe that scrapes the route numbers and the times within that route
import requests

# Making a timetables variable to store the link
url = 'https://www.dublinbus.ie/Your-Journey1/Timetables/All-Timetables/25/'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
find = soup.find(id='ctl00_FullRegion_MainRegion_MainContentRegion_MainBodyRegion_Timetable1_ctl02_DataList1_ctl00_dlTimeTop')

# Need to find a way to only get the interger and put it on a list
times_weekday = soup.find_all("div", {"class": "time"})

In [147]:
# Lets convert times_weekday into a list and then convert its values into a string
times_weekday = list(times_weekday)
for value in range(len(times_weekday)):
    times_weekday[value] = str(times_weekday[value])

# Then, using regex, get the interger and convert it to a string version of timedelta then put it in a list
import re
weekdays = []
for value in range(len(times_weekday)):
    parse = re.findall('\d+', times_weekday[value])
    time = str('0 days ' + parse[0] + ':' + parse[1] + ':00')
    weekdays += [time]

In [153]:
# These are not all weekdays
weekdays

['0 days 08:20:00',
 '0 days 09:20:00',
 '0 days 10:20:00',
 '0 days 11:20:00',
 '0 days 12:20:00',
 '0 days 13:20:00',
 '0 days 14:20:00',
 '0 days 15:20:00',
 '0 days 16:20:00',
 '0 days 16:50:00',
 '0 days 17:20:00',
 '0 days 18:30:00',
 '0 days 19:30:00',
 '0 days 20:30:00',
 '0 days 21:30:00',
 '0 days 22:30:00',
 '0 days 23:20:00',
 '0 days 07:05:00',
 '0 days 08:05:00',
 '0 days 09:05:00',
 '0 days 10:05:00',
 '0 days 11:05:00',
 '0 days 12:05:00',
 '0 days 13:05:00',
 '0 days 14:05:00',
 '0 days 15:05:00',
 '0 days 16:05:00',
 '0 days 17:05:00',
 '0 days 18:05:00',
 '0 days 19:05:00',
 '0 days 20:05:00',
 '0 days 21:05:00',
 '0 days 22:05:00',
 '0 days 23:05:00',
 '0 days 11:40:00',
 '0 days 13:40:00',
 '0 days 15:40:00',
 '0 days 17:40:00',
 '0 days 19:40:00',
 '0 days 21:40:00',
 '0 days 23:25:00',
 '0 days 06:40:00',
 '0 days 06:55:00',
 '0 days 07:10:00',
 '0 days 08:05:00',
 '0 days 09:20:00',
 '0 days 10:20:00',
 '0 days 11:20:00',
 '0 days 12:20:00',
 '0 days 13:20:00',


#### Converting seconds into times and into datetime series
import datetime
for row in trips['PLANNEDTIME_ARR']:
    trips['PLANNEDTIME_ARR'] = trips['PLANNEDTIME_ARR'].replace(row, str(datetime.timedelta(seconds=row)))

for row in trips['PLANNEDTIME_DEP']:
    trips['PLANNEDTIME_DEP'] = trips['PLANNEDTIME_DEP'].replace(row, str(datetime.timedelta(seconds=row)))

for row in trips['ACTUALTIME_ARR']:
    trips['ACTUALTIME_ARR'] = trips['ACTUALTIME_ARR'].replace(row, str(datetime.timedelta(seconds=row)))

for row in trips['ACTUALTIME_DEP']:
    trips['ACTUALTIME_DEP'] = trips['ACTUALTIME_DEP'].replace(row, str(datetime.timedelta(seconds=row)))
    
trips1 = trips
trips2 = trips1['PLANNEDTIME_ARR']
trips2 = trips2.apply(pd.to_timedelta, unit='s')
trips2 = trips2 - pd.to_timedelta(trips2.dt.days, unit='d')
trips2.head()