## Cleaning Notebook
- 1.0 Setup Workspace
- 2.0 Cleaning Leavetimes Dataset
- 3.0 Cleaning Trips Dataset
- 4.0 Cleaning Vehicles Dataset
- 5.0 Cleaning Weather Dataset

### 1.0 Setup workspace

In [1]:
# import required packages
import pandas as pd
import numpy as np
import sqlite3
import pymysql
from datetime import datetime
import time
import pickle

### 2.0 Cleaning Leavetimes Dataset

- Leavetimes df, with over 116 million rows, is too large to load in all at once, and so we must only load a small subset in order to do the data understanding section before we make a decision on how to adequately handle it

- Trips and vehicles tables are of a manageable size, and can be fully loaded in

In [2]:
# Loading in leavetimes dataset
df_leavetimes = pd.read_csv(r'C:\Users\jason\OneDrive - University College Dublin\Documents\MSc Computer Science\Summer Semester\Data\GTFS Data\2018 Data\rt_leavetimes_DB_2018.txt', sep=';')

In [3]:
# # Replacing carriage return with blank values
# df_leavetimes.replace(to_replace=r'[\n\r\t]', value='', regex=True, inplace=True)

In [4]:
# # Replacing blank cells with Nan values
# df_leavetimes.replace("", np.nan, inplace=True)

#### Check for Null Values

In [5]:
df_leavetimes.isnull().sum()

DATASOURCE                 0
DAYOFSERVICE               0
TRIPID                     0
PROGRNUMBER                0
STOPPOINTID                0
PLANNEDTIME_ARR            0
PLANNEDTIME_DEP            0
ACTUALTIME_ARR             0
ACTUALTIME_DEP             0
VEHICLEID                  0
PASSENGERS         116949113
PASSENGERSIN       116949113
PASSENGERSOUT      116949113
DISTANCE           116949113
SUPPRESSED         116360453
JUSTIFICATIONID    116360526
LASTUPDATE                 0
NOTE               116949113
dtype: int64

- There are null values for passengers, passengersin, passengersout, distance, suppressed, justificationid, note.

#### We will now check what % of missing values each column has

In [6]:
# % missing on passengers
pasengers_missing = 100 * (df_leavetimes['PASSENGERS'].isnull().sum()/df_leavetimes.shape[0])
pasengers_missing

100.0

In [7]:
# % missing on passengersin
passengersin_missing = 100 * (df_leavetimes['PASSENGERSIN'].isnull().sum()/df_leavetimes.shape[0])
passengersin_missing

100.0

In [8]:
# % missing on passengersout
passengersout_missing = 100 * (df_leavetimes['PASSENGERSOUT'].isnull().sum()/df_leavetimes.shape[0])
passengersout_missing

100.0

In [9]:
# % missing on distance
distance_missing = 100 * (df_leavetimes['DISTANCE'].isnull().sum()/df_leavetimes.shape[0])
distance_missing

100.0

In [10]:
# % missing on suppressed
suppressed_missing = 100 * (df_leavetimes['SUPPRESSED'].isnull().sum()/df_leavetimes.shape[0])
suppressed_missing

99.49665287328857

In [11]:
# % missing on justificationid
justificationid_missing = 100 * (df_leavetimes['JUSTIFICATIONID'].isnull().sum()/df_leavetimes.shape[0])
justificationid_missing

99.49671529359954

In [12]:
# % missing on note
note_missing = 100 * (df_leavetimes['NOTE'].isnull().sum()/df_leavetimes.shape[0])
note_missing

100.0

#### Checking for duplicate rows

In [13]:
# # Checking duplicate rows
# df_leavetimes.duplicated().sum()

#### Dropping the necessary columns

In [14]:
# Dropping columns

df_leavetimes = df_leavetimes.drop(columns=["DATASOURCE","PASSENGERS", "PASSENGERSIN", "PASSENGERSOUT", "DISTANCE", "JUSTIFICATIONID", "SUPPRESSED", "NOTE"])
df_leavetimes.head(5)

Unnamed: 0,DAYOFSERVICE,TRIPID,PROGRNUMBER,STOPPOINTID,PLANNEDTIME_ARR,PLANNEDTIME_DEP,ACTUALTIME_ARR,ACTUALTIME_DEP,VEHICLEID,LASTUPDATE
0,01-JAN-18 00:00:00,5972116,12,119,48030,48030,48012,48012,2693211,08-JAN-18 17:21:10
1,01-JAN-18 00:00:00,5966674,12,119,54001,54001,54023,54023,2693267,08-JAN-18 17:21:10
2,01-JAN-18 00:00:00,5959105,12,119,60001,60001,59955,59955,2693263,08-JAN-18 17:21:10
3,01-JAN-18 00:00:00,5966888,12,119,58801,58801,58771,58771,2693284,08-JAN-18 17:21:10
4,01-JAN-18 00:00:00,5965960,12,119,56401,56401,56309,56323,2693209,08-JAN-18 17:21:10


In [15]:
df_leavetimes.isnull().sum()

DAYOFSERVICE       0
TRIPID             0
PROGRNUMBER        0
STOPPOINTID        0
PLANNEDTIME_ARR    0
PLANNEDTIME_DEP    0
ACTUALTIME_ARR     0
ACTUALTIME_DEP     0
VEHICLEID          0
LASTUPDATE         0
dtype: int64

- There are no more null values, moving on to the data types

#### Setting up Data Types

In [16]:
# Finding the data types of each column
df_leavetimes.dtypes

DAYOFSERVICE       object
TRIPID              int64
PROGRNUMBER         int64
STOPPOINTID         int64
PLANNEDTIME_ARR     int64
PLANNEDTIME_DEP     int64
ACTUALTIME_ARR      int64
ACTUALTIME_DEP      int64
VEHICLEID           int64
LASTUPDATE         object
dtype: object

In [17]:
# Select columns containing categorical data
categorical_columns = df_leavetimes[["TRIPID", "PROGRNUMBER", "STOPPOINTID", "PLANNEDTIME_ARR", "PLANNEDTIME_DEP", "ACTUALTIME_ARR", "ACTUALTIME_DEP", "VEHICLEID"]].columns

# Convert data type to category for these columns
for column in categorical_columns:
    df_leavetimes[column] = df_leavetimes[column].astype('category')

In [18]:
# Converting dayofservice to datetime64
df_leavetimes['DAYOFSERVICE'] = pd.to_datetime(df_leavetimes['DAYOFSERVICE'], infer_datetime_format=True)

# Converting lastupdate to datetime64
df_leavetimes['LASTUPDATE'] = pd.to_datetime(df_leavetimes['LASTUPDATE'], infer_datetime_format=True)
df_leavetimes.dtypes

DAYOFSERVICE       datetime64[ns]
TRIPID                   category
PROGRNUMBER              category
STOPPOINTID              category
PLANNEDTIME_ARR          category
PLANNEDTIME_DEP          category
ACTUALTIME_ARR           category
ACTUALTIME_DEP           category
VEHICLEID                category
LASTUPDATE         datetime64[ns]
dtype: object

In [19]:
df_leavetimes.describe().T

  df_leavetimes.describe().T
  df_leavetimes.describe().T


Unnamed: 0,count,unique,top,freq,first,last
DAYOFSERVICE,116949113.0,360.0,2018-02-12 00:00:00,384590.0,2018-01-01 00:00:00,2018-12-31 00:00:00
TRIPID,116949113.0,658961.0,7325784.0,1924.0,NaT,NaT
PROGRNUMBER,116949113.0,109.0,14.0,2108194.0,NaT,NaT
STOPPOINTID,116949113.0,4774.0,1479.0,214973.0,NaT,NaT
PLANNEDTIME_ARR,116949113.0,72712.0,61200.0,28108.0,NaT,NaT
PLANNEDTIME_DEP,116949113.0,72712.0,61200.0,28108.0,NaT,NaT
ACTUALTIME_ARR,116949113.0,74361.0,64806.0,2492.0,NaT,NaT
ACTUALTIME_DEP,116949113.0,74363.0,64815.0,2499.0,NaT,NaT
VEHICLEID,116949113.0,1151.0,2868400.0,179716.0,NaT,NaT
LASTUPDATE,116949113.0,360.0,2018-02-28 13:18:29,384590.0,2018-01-08 17:21:10,2019-01-16 18:27:21


In [20]:
# Write the cleaned sample df to a new csv file
df_leavetimes.to_csv(r'C:\Users\jason\OneDrive - University College Dublin\Documents\MSc Computer Science\Summer Semester\Data\Notebooks\CleanedCSVs\rt_leavetimes_DB_2018_cleaned_sample.csv')

### Summary of data quality plan for sample of Leavetimes:

<table>
    <tr>
        <th>Feature</th><th>Feature Type</th><th>Data Quality Issue</th><th>Handling Strategy</th>
    </tr>
    <tr>
        <td>datasource</td><td>Category</td><td>Identical Values</td><td>This feature should be dropped as the values are all the same and add nothing useful to the dataset</td>
    </tr>
    <tr>
        <td>dayofservice</td><td>Datetime64</td><td>None</td><td>This feature will be left the same</td>
    </tr>
    <tr>
        <td>tripid</td><td>Category</td><td>None</td><td>This feature will be left the same</td>
    </tr>
    <tr>
        <td>progrnumber</td><td>Category</td><td>None</td><td>This feature will be left the same</td>
    </tr>
    <tr>
        <td>stoppointid</td><td>Category</td><td>None</td><td>This feature will be left the same</td>
    </tr>
    <tr>
        <td>plannedtime_arr</td><td>Category</td><td>None</td><td>This feature will be left the same</td>
    </tr>
    <tr>
        <td>plannedtime_dep</td><td>Category</td><td>None</td><td>This feature will be left the same</td>
    </tr>
    <tr>
        <td>actualtime_arr</td><td>Category</td><td>None</td><td>This feature will be left the same</td>
    </tr>
    <tr>
        <td>actualtime_dep</td><td>Category</td><td>None</td><td>This feature will be left the same</td>
    </tr>
    <tr>
        <td>vehicleid</td><td>Category</td><td>None</td><td>This feature will be left the same</td>
    </tr>
    <tr>
        <td>passengers</td><td>Category</td><td>Missing Values 100%</td><td>This feature is completely missing and therefore can provide no useful information for further analysis and should be dropped</td>
    </tr>
    <tr>
        <td>passengersin</td><td>Category</td><td>Missing Values 100%</td><td>This feature is completely missing and therefore can provide no useful information for further analysis and should be dropped</td>
    </tr>
    <tr>
        <td>passengersout</td><td>Category</td><td>Missing Values 100%</td><td>This feature is completely missing and therefore can provide no useful information for further analysis and should be dropped</td>
    </tr>
    <tr>
        <td>distance</td><td>Category</td><td>Missing Values 100%</td><td>This feature is completely missing and therefore can provide no useful information for further analysis and should be dropped</td>
    </tr>
    <tr>
        <td>suppressed</td><td>Category</td><td>Missing Values 99.5%</td><td>This feature is nearly completely missing. However, it may be possible to imput times from the trips dataset so this column will be kept for now.</td>
    </tr>
    <tr>
        <td>justificationid</td><td>Category</td><td>Missing Values 99.5%</td><td>This feature is completely missing and therefore can provide no useful information for further analysis and should be dropped</td>
    </tr>
    <tr>
        <td>lastupdate</td><td>Category</td><td>None</td><td>This feature will be left the same</td>
    </tr>
    <tr>
        <td>note</td><td>Category</td><td>Missing Values 100%</td><td>This feature is completely missing and provides no useful information and should be dropped</td>
    </tr>
</table>

### 3.0 Cleaning Trips Dataset

In [21]:
# Loading in trips dataset
df_trips = pd.read_csv(r'C:\Users\jason\OneDrive - University College Dublin\Documents\MSc Computer Science\Summer Semester\Data\GTFS Data\2018 Data\rt_trips_DB_2018.txt', sep=';')

In [22]:
# # Replacing carriage return with blank values
# df_trips.replace(to_replace=r'[\n\r\t]', value='', regex=True, inplace=True)

In [23]:
# # Replacing blank cells with Nan values
# df_trips.replace("", np.nan, inplace=True)

In [24]:
df_trips.head(20)

Unnamed: 0,DATASOURCE,DAYOFSERVICE,TRIPID,LINEID,ROUTEID,DIRECTION,PLANNEDTIME_ARR,PLANNEDTIME_DEP,ACTUALTIME_ARR,ACTUALTIME_DEP,BASIN,TENDERLOT,SUPPRESSED,JUSTIFICATIONID,LASTUPDATE,NOTE
0,DB,07-FEB-18 00:00:00,6253783,68,68_80,1,87245,84600,87524.0,84600.0,BasDef,,,,28-FEB-18 12:05:11,",2967409,"
1,DB,07-FEB-18 00:00:00,6262138,25B,25B_271,2,30517,26460,32752.0,,BasDef,,,,28-FEB-18 12:05:11,",2580260,"
2,DB,07-FEB-18 00:00:00,6254942,45A,45A_70,2,35512,32100,36329.0,32082.0,BasDef,,,,28-FEB-18 12:05:11,",2448968,"
3,DB,07-FEB-18 00:00:00,6259460,25A,25A_273,1,57261,54420,58463.0,54443.0,BasDef,,,,28-FEB-18 12:05:11,",3094242,"
4,DB,07-FEB-18 00:00:00,6253175,14,14_15,1,85383,81600,84682.0,81608.0,BasDef,,,,28-FEB-18 12:05:11,",2526331,"
5,DB,07-FEB-18 00:00:00,6248240,77A,77A_28,2,41648,37200,42019.0,37538.0,BasDef,,,,28-FEB-18 12:05:11,",2966500,"
6,DB,07-FEB-18 00:00:00,6251760,39,39_21,2,34768,28920,35709.0,28929.0,BasDef,,,,28-FEB-18 12:05:11,",2422850,"
7,DB,07-FEB-18 00:00:00,6262909,16,16_20,1,43936,38880,43721.0,38955.0,BasDef,,,,28-FEB-18 12:05:11,",3045261,"
8,DB,07-FEB-18 00:00:00,6251147,40D,40D_102,1,33558,30600,34540.0,31185.0,BasDef,,,,28-FEB-18 12:05:11,",2837123,"
9,DB,07-FEB-18 00:00:00,6249435,27B,27B_34,2,52682,49800,53281.0,49974.0,BasDef,,,,28-FEB-18 12:05:11,",2329338,"


In [25]:
# Checking the rows and columns of this dataframe
df_trips.shape

(2182637, 16)

In [26]:
df_trips.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
TRIPID,2182637.0,7149479.0,731442.265466,5955221.0,6592584.0,7110090.0,8018241.0,8592207.0
DIRECTION,2182637.0,1.504103,0.499983,1.0,1.0,2.0,2.0,2.0
PLANNEDTIME_ARR,2182637.0,55078.75,17424.822304,18203.0,39683.0,55276.0,69006.0,91680.0
PLANNEDTIME_DEP,2182637.0,51533.29,17571.637274,16200.0,36000.0,51600.0,65160.0,87600.0
ACTUALTIME_ARR,2045430.0,55062.39,17338.699154,17546.0,39668.0,55189.0,69019.0,94378.0
ACTUALTIME_DEP,2018086.0,51706.12,17526.948184,15974.0,36518.0,51618.0,65398.0,97177.0
TENDERLOT,0.0,,,,,,,
SUPPRESSED,4333.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
JUSTIFICATIONID,4330.0,237554.9,34585.051789,177856.0,205506.5,233851.5,271506.75,297899.0


#### Check for Null Values

In [27]:
df_trips.isnull().sum()

DATASOURCE               0
DAYOFSERVICE             0
TRIPID                   0
LINEID                   0
ROUTEID                  0
DIRECTION                0
PLANNEDTIME_ARR          0
PLANNEDTIME_DEP          0
ACTUALTIME_ARR      137207
ACTUALTIME_DEP      164551
BASIN                    0
TENDERLOT          2182637
SUPPRESSED         2178304
JUSTIFICATIONID    2178307
LASTUPDATE               0
NOTE                     0
dtype: int64

- There are null values for ACTUALTIME_ARR, ACTUALTIME_DEP, TENDERLOT, SUPPRESSED & JUSTIFICATIONID

#### We will now check what % of missing values each column has

In [28]:
# % missing on ACTUALTIME_ARR
actualtime_arr_missing = 100 * (df_trips['ACTUALTIME_ARR'].isnull().sum()/df_trips.shape[0])
actualtime_arr_missing

6.28629497254926

In [29]:
# % missing on ACTUALTIME_DEP
actualtime_dep_missing = 100 * (df_trips['ACTUALTIME_DEP'].isnull().sum()/df_trips.shape[0])
actualtime_dep_missing

7.5390914751284805

In [30]:
# % missing on tenderlot
tenderlot_missing = 100 * (df_trips['TENDERLOT'].isnull().sum()/df_trips.shape[0])
tenderlot_missing

100.0

In [31]:
# % missing on supressed
supressed_missing = 100 * (df_trips['SUPPRESSED'].isnull().sum()/df_trips.shape[0])
supressed_missing

99.80147867006744

In [32]:
# % missing on justificationid
justificationid_missing = 100 * (df_trips['JUSTIFICATIONID'].isnull().sum()/df_trips.shape[0])
justificationid_missing

99.8016161184842

#### Checking for duplicate rows

In [33]:
# Checking duplicate rows
df_trips.duplicated().sum()

0

#### Dropping the necessary columns

In [34]:
# Dropping columns

df_trips = df_trips.drop(columns=["DATASOURCE","BASIN", "TENDERLOT","SUPPRESSED","JUSTIFICATIONID"])
df_trips.head(5)

Unnamed: 0,DAYOFSERVICE,TRIPID,LINEID,ROUTEID,DIRECTION,PLANNEDTIME_ARR,PLANNEDTIME_DEP,ACTUALTIME_ARR,ACTUALTIME_DEP,LASTUPDATE,NOTE
0,07-FEB-18 00:00:00,6253783,68,68_80,1,87245,84600,87524.0,84600.0,28-FEB-18 12:05:11,",2967409,"
1,07-FEB-18 00:00:00,6262138,25B,25B_271,2,30517,26460,32752.0,,28-FEB-18 12:05:11,",2580260,"
2,07-FEB-18 00:00:00,6254942,45A,45A_70,2,35512,32100,36329.0,32082.0,28-FEB-18 12:05:11,",2448968,"
3,07-FEB-18 00:00:00,6259460,25A,25A_273,1,57261,54420,58463.0,54443.0,28-FEB-18 12:05:11,",3094242,"
4,07-FEB-18 00:00:00,6253175,14,14_15,1,85383,81600,84682.0,81608.0,28-FEB-18 12:05:11,",2526331,"


In [35]:
df_trips.isnull().sum()

DAYOFSERVICE            0
TRIPID                  0
LINEID                  0
ROUTEID                 0
DIRECTION               0
PLANNEDTIME_ARR         0
PLANNEDTIME_DEP         0
ACTUALTIME_ARR     137207
ACTUALTIME_DEP     164551
LASTUPDATE              0
NOTE                    0
dtype: int64

- Null values only occur in ACTUALTIME_ARR and ACTUALTIME_DEP and only account for 6.3% and 7.5% of the overall dataset. 
- We will now drop all the rows with null values remaining.

In [36]:
# Dropping rows with null values
df_trips = df_trips.dropna()

In [37]:
df_trips.isnull().sum()

DAYOFSERVICE       0
TRIPID             0
LINEID             0
ROUTEID            0
DIRECTION          0
PLANNEDTIME_ARR    0
PLANNEDTIME_DEP    0
ACTUALTIME_ARR     0
ACTUALTIME_DEP     0
LASTUPDATE         0
NOTE               0
dtype: int64

- No null values remain for trips

#### Checking and Chaning datatypes

In [38]:
# Finding the data types of each column
df_trips.dtypes

DAYOFSERVICE        object
TRIPID               int64
LINEID              object
ROUTEID             object
DIRECTION            int64
PLANNEDTIME_ARR      int64
PLANNEDTIME_DEP      int64
ACTUALTIME_ARR     float64
ACTUALTIME_DEP     float64
LASTUPDATE          object
NOTE                object
dtype: object

In [39]:
# Convert all types to category
for column in df_trips:
    df_trips[column] = df_trips[column].astype('category')  

In [40]:
# Converting dayofservice to datetime64
df_trips['DAYOFSERVICE'] = pd.to_datetime(df_trips['DAYOFSERVICE'], infer_datetime_format=True)

# Converting lastupdate to datetime64
df_trips['LASTUPDATE'] = pd.to_datetime(df_trips['LASTUPDATE'], infer_datetime_format=True)
df_trips.dtypes

DAYOFSERVICE       category
TRIPID             category
LINEID             category
ROUTEID            category
DIRECTION          category
PLANNEDTIME_ARR    category
PLANNEDTIME_DEP    category
ACTUALTIME_ARR     category
ACTUALTIME_DEP     category
LASTUPDATE         category
NOTE               category
dtype: object

In [41]:
df_trips.head(5)

Unnamed: 0,DAYOFSERVICE,TRIPID,LINEID,ROUTEID,DIRECTION,PLANNEDTIME_ARR,PLANNEDTIME_DEP,ACTUALTIME_ARR,ACTUALTIME_DEP,LASTUPDATE,NOTE
0,2018-02-07,6253783,68,68_80,1,87245,84600,87524.0,84600.0,2018-02-28 12:05:11,",2967409,"
2,2018-02-07,6254942,45A,45A_70,2,35512,32100,36329.0,32082.0,2018-02-28 12:05:11,",2448968,"
3,2018-02-07,6259460,25A,25A_273,1,57261,54420,58463.0,54443.0,2018-02-28 12:05:11,",3094242,"
4,2018-02-07,6253175,14,14_15,1,85383,81600,84682.0,81608.0,2018-02-28 12:05:11,",2526331,"
5,2018-02-07,6248240,77A,77A_28,2,41648,37200,42019.0,37538.0,2018-02-28 12:05:11,",2966500,"


In [42]:
# Write the cleaned df to a new csv file
df_trips.to_csv(r'C:\Users\jason\OneDrive - University College Dublin\Documents\MSc Computer Science\Summer Semester\Data\Notebooks\CleanedCSVs\rt_trips_DB_2018_cleaned.csv')

### Summary of data quality plan for Trips Dataset:

<table>
    <tr>
        <th>Feature</th><th>Feature Type</th><th>Data Quality Issue</th><th>Handling Strategy</th>
    </tr>
    <tr>
        <td>datasource</td><td>Category</td><td>Identical Values</td><td>This feature should be dropped as the values are all the same and add nothing useful to the dataset</td>
    </tr>
    <tr>
        <td>dayofservice</td><td>Datetime64</td><td>None</td><td>This feature will be left the same</td>
    </tr>
    <tr>
        <td>tripid</td><td>Category</td><td>None</td><td>This feature will be left the same</td>
    </tr>
    <tr>
        <td>lineid</td><td>Category</td><td>None</td><td>This feature will be left the same</td>
    </tr>
    <tr>
        <td>routeid</td><td>Category</td><td>None</td><td>This feature will be left the same</td>
    </tr>
    <tr>
        <td>direction</td><td>Category</td><td>None</td><td>This feature will be left the same</td>
    </tr>
    <tr>
        <td>plannedtime_arr</td><td>Category</td><td>None</td><td>This feature will be left the same</td>
    </tr>
    <tr>
        <td>plannedtime_dep</td><td>Category</td><td>None</td><td>This feature will be left the same</td>
    </tr>
    <tr>
        <td>actualtime_arr</td><td>Category</td><td>Missing Values 6.3%</td><td>Dropping rows with null values</td>
    </tr>
    <tr>
        <td>actualtime_dep</td><td>Category</td><td>Missing Values 7.5%</td><td>Dropping rows with null values</td>
    </tr>
    <tr>
        <td>basin</td><td>Category</td><td>Identical Values</td><td>This feature should be dropped as the values are all the same and add nothing useful to the dataset</td>
    </tr>
    <tr>
        <td>tenderlot</td><td>Category</td><td>Missing Values 100%</td><td>This feature is completely missing and therefore can provide no useful information for further analysis and should be dropped</td>
    </tr>
    <tr>
        <td>suppressed</td><td>Category</td><td>Missing Values 99.8%</td><td>This feature has over 99% missing data, the suppressed flag defines actual arrival/departure time is missing. If this feature can not be recovered, it can be dropped too.
</td>
    </tr>
    <tr>
        <td>justificationid</td><td>Category</td><td>Missing Values 99.8%</td><td>This feature refers to a table that is not made available and is also nearly completely missing. This feature will be dropped.</td>
    </tr>
    <tr>
        <td>lastupdate</td><td>Datetime64</td><td>None</td><td>This feature will be left the same</td>
    </tr>
    <tr>
        <td>note</td><td>Category</td><td>None</td><td>This feature will be left the same</td>
    </tr>
</table>

### 4.0 Cleaning Vehicles Dataset

In [43]:
# Loading in leavetimes dataset
df_vehicles = pd.read_csv(r'C:\Users\jason\OneDrive - University College Dublin\Documents\MSc Computer Science\Summer Semester\Data\GTFS Data\2018 Data\rt_vehicles_DB_2018.txt', sep=';')

In [44]:
# # Replacing carriage return with blank values
# df_vehicles.replace(to_replace=r'[\n\r\t]', value='', regex=True, inplace=True)

In [45]:
# # Replacing blank cells with Nan values
# df_vehicles.replace("", np.nan, inplace=True)

In [46]:
# Checking the rows and columns of this dataframe
df_vehicles.shape

(272622, 7)

In [47]:
df_vehicles.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
VEHICLEID,272622.0,1928436.0,810440.716935,1000110.0,1000948.0,2172252.0,2693217.0,3394131.0
DISTANCE,272622.0,177486.3,71334.432383,0.0,126765.0,177420.0,227126.25,473864.0
MINUTES,272622.0,45597.64,14881.389868,-70145.0,39380.0,46994.5,57911.0,76651.0
NOTE,0.0,,,,,,,


#### Check for Null Values

In [48]:
df_vehicles.isnull().sum()

DATASOURCE           0
DAYOFSERVICE         0
VEHICLEID            0
DISTANCE             0
MINUTES              0
LASTUPDATE           0
NOTE            272622
dtype: int64

- There are null values for note

#### We will now check what % of missing values note has

In [49]:
# % missing on note
note_missing = 100 * (df_vehicles['NOTE'].isnull().sum()/df_vehicles.shape[0])
note_missing

100.0

#### Dropping the necessary columns

In [50]:
# Dropping columns

df_vehicles = df_vehicles.drop(columns=["DATASOURCE","NOTE"])
df_vehicles.head(5)

Unnamed: 0,DAYOFSERVICE,VEHICLEID,DISTANCE,MINUTES,LASTUPDATE
0,23-NOV-18 00:00:00,3303848,286166,58849,04-DEC-18 08:03:09
1,23-NOV-18 00:00:00,3303847,259545,56828,04-DEC-18 08:03:09
2,28-FEB-18 00:00:00,2868329,103096,40967,08-MAR-18 10:35:59
3,28-FEB-18 00:00:00,2868330,147277,43599,08-MAR-18 10:35:59
4,28-FEB-18 00:00:00,2868331,224682,40447,08-MAR-18 10:35:59


#### Checking for duplicate rows

In [51]:
# Checking duplicate rows
df_vehicles.duplicated().sum()

0

#### Checking and changing datatypes

In [52]:
# Finding the data types of each column
df_vehicles.dtypes

DAYOFSERVICE    object
VEHICLEID        int64
DISTANCE         int64
MINUTES          int64
LASTUPDATE      object
dtype: object

In [53]:
# Convert all types to category
for column in df_vehicles:
    df_vehicles[column] = df_vehicles[column].astype('category')  

In [54]:
# Converting dayofservice to datetime64
df_vehicles['DAYOFSERVICE'] = pd.to_datetime(df_vehicles['DAYOFSERVICE'], infer_datetime_format=True)

# Converting lastupdate to datetime64
df_vehicles['LASTUPDATE'] = pd.to_datetime(df_vehicles['LASTUPDATE'], infer_datetime_format=True)
df_vehicles.dtypes

DAYOFSERVICE    category
VEHICLEID       category
DISTANCE        category
MINUTES         category
LASTUPDATE      category
dtype: object

In [55]:
# Write the cleaned df to a new csv file
df_vehicles.to_csv(r'C:\Users\jason\OneDrive - University College Dublin\Documents\MSc Computer Science\Summer Semester\Data\Notebooks\CleanedCSVs\rt_vehicles_DB_2018_cleaned.csv')

### Summary of data quality plan for Vehicles Dataset:

<table>
    <tr>
        <th>Feature</th><th>Feature Type</th><th>Data Quality Issue</th><th>Handling Strategy</th>
    </tr>
    <tr>
        <td>datasource</td><td>Category</td><td>Identical Values</td><td>This feature should be dropped as the values are all the same and add nothing useful to the dataset</td>
    </tr>
    <tr>
        <td>dayofservice</td><td>Datetime64</td><td>None</td><td>This feature will be left the same</td>
    </tr>
    <tr>
        <td>vehicleid</td><td>Category</td><td>None</td><td>This feature will be left the same</td>
    </tr>
    <tr>
        <td>distance</td><td>Category</td><td>None</td><td>This feature will be left the same</td>
    </tr>
    <tr>
        <td>minutes</td><td>Category</td><td>None</td><td>This feature will be left the same</td>
    </tr>
    <tr>
        <td>lastupdate</td><td>Datetime64</td><td>None</td><td>This feature will be left the same</td>
    </tr>
    <tr>
        <td>note</td><td>Category</td><td>Missing Values 100%</td><td>This feature is completely missing and therefore can provide no useful information for further analysis and should be dropped</td>
    </tr>
</table>

### 5.0 Cleaning Weather Data

In [56]:
#read in weather csv
df_weather = pd.read_csv(r'C:\Users\jason\OneDrive - University College Dublin\Documents\MSc Computer Science\Summer Semester\Data\Notebooks\weather2018.csv')

In [57]:
# Checking the rows and columns of this dataframe
df_weather.shape

(8760, 7)

In [58]:
# Exploring the dataframe
df_weather.head(10)

Unnamed: 0,latitude,longitude,date,rain,temp,msl,wdsp
0,53.364,-6.35,01-jan-2018 00:00,0.0,4.6,991.0,
1,53.364,-6.35,01-jan-2018 01:00,0.1,4.7,991.1,
2,53.364,-6.35,01-jan-2018 02:00,0.0,4.8,991.1,
3,53.364,-6.35,01-jan-2018 03:00,0.0,4.9,990.7,
4,53.364,-6.35,01-jan-2018 04:00,0.0,5.3,990.3,
5,53.364,-6.35,01-jan-2018 05:00,0.0,5.1,990.0,
6,53.364,-6.35,01-jan-2018 06:00,0.0,5.0,990.9,
7,53.364,-6.35,01-jan-2018 07:00,0.0,4.6,991.2,
8,53.364,-6.35,01-jan-2018 08:00,0.0,4.3,991.8,
9,53.364,-6.35,01-jan-2018 09:00,0.0,4.5,992.4,


In [59]:
# Checking for null values
df_weather.isnull().sum()

latitude        0
longitude       0
date            0
rain            0
temp            0
msl             0
wdsp         8760
dtype: int64

In [60]:
# Checking for % null values on wdsp
wdsp_missing = 100 * (df_weather['wdsp'].isnull().sum()/df_weather.shape[0])
wdsp_missing

100.0

- 100% of values are missing on column wdsp, dropping this column

#### Dropping the necessary columns

In [61]:
# Dropping columns

df_weather = df_weather.drop(columns=["wdsp","latitude","longitude"])
df_vehicles.head(5)

Unnamed: 0,DAYOFSERVICE,VEHICLEID,DISTANCE,MINUTES,LASTUPDATE
0,2018-11-23,3303848,286166,58849,2018-12-04 08:03:09
1,2018-11-23,3303847,259545,56828,2018-12-04 08:03:09
2,2018-02-28,2868329,103096,40967,2018-03-08 10:35:59
3,2018-02-28,2868330,147277,43599,2018-03-08 10:35:59
4,2018-02-28,2868331,224682,40447,2018-03-08 10:35:59


#### Setting datatypes

In [62]:
df_weather.dtypes

date     object
rain     object
temp    float64
msl     float64
dtype: object

In [63]:
# Converting column 'date' to datetime64
df_weather['date'] = pd.to_datetime(df_weather['date'], infer_datetime_format=True)
df_weather.dtypes

date    datetime64[ns]
rain            object
temp           float64
msl            float64
dtype: object

In [64]:
# Write the dataframe to a new csv file
df_weather.to_csv(r'C:\Users\jason\OneDrive - University College Dublin\Documents\MSc Computer Science\Summer Semester\Data\Notebooks\CleanedCSVs\weather2018_cleaned.csv')
df_weather.head()

Unnamed: 0,date,rain,temp,msl
0,2018-01-01 00:00:00,0.0,4.6,991.0
1,2018-01-01 01:00:00,0.1,4.7,991.1
2,2018-01-01 02:00:00,0.0,4.8,991.1
3,2018-01-01 03:00:00,0.0,4.9,990.7
4,2018-01-01 04:00:00,0.0,5.3,990.3


### Summary of data quality plan for Weather Dataset:

<table>
    <tr>
        <th>Feature</th><th>Feature Type</th><th>Data Quality Issue</th><th>Handling Strategy</th>
    </tr>
    <tr>
        <td>latitude</td><td>Continuous</td><td>Constant</td><td>Dropping this feature as it is a constant column and means nothing to us</td>
    </tr>
    <tr>
        <td>longitude</td><td>Continuous</td><td>Constant</td><td>Dropping this feature as it is a constant column and means nothing to us</td>
    </tr>
    <tr>
        <td>date</td><td>Datetime64</td><td>None</td><td>This feature will be left the same</td>
    </tr>
    <tr>
        <td>rain</td><td>Continuous</td><td>None</td><td>This feature will be left the same</td>
    </tr>
    <tr>
        <td>temp</td><td>Continuous</td><td>None</td><td>This feature will be left the same</td>
    </tr>
    <tr>
        <td>msl</td><td>Continuous</td><td>None</td><td>This feature will be left the same</td>
    </tr>
    <tr>
        <td>wdsp</td><td>Continuous</td><td>Missing Values 100%</td><td>This feature is completely missing and therefore can provide no useful information for further analysis and should be dropped</td>
    </tr>
</table>