<a id="top"></a>

The aim of this notebook is to clean and prepare the data in the leavetimes dataset.  
The data will be used to build a predictive model that will predict the travel time for a bus route.

***

# Import Packages

In [2]:
import pandas as pd
import datetime

***

# Load Original Data File

In [None]:
# load in original data file
# this may take a couple of minutes and can result in a memory error if many notebooks are in use
t1 = datetime.datetime.now()
df_leavetimes = pd.read_csv('/home/faye/data/rt_leavetimes_DB_2018.txt', sep=';',error_bad_lines=False)
diff = datetime.datetime.now() - t1
print(f"Time to read: {diff}")

In [3]:
# load in original data file using a feather file
t1 = datetime.datetime.now()
df_leavetimes = pd.read_feather('/home/faye/data/leavetimes.feather')
diff = datetime.datetime.now() - t1
print(f"Time to read: {diff}")

Time to read: 0:01:22.023991


***

<a id="contents"></a>
# Contents

- [1. Data Quality Report](#data_quality_report)
    - [1.1. Overview of the Dataset](#overview)
    - [1.2. Feature Data Types](#feature_data_types)
    - [1.3. Duplicates and Constants](#duplicates_and_constants)
    - [1.4. Descriptive Statistics for Continuous Features](#desc_stats_continuous)
    - [1.5. Descriptive Statistics for Categorical Features](#desc_stats_categorical)
    - [1.6. Plotting Continuous Features](#plotting_continuous)
    - [1.7. Plotting Categorical Features](#plotting_categorical)
    - [1.8. Logical Integrity Check](#logical_integrity_check)
- [2. Data Quality Plan](#data_quality_plan)
    - [2.1. The Plan](#the_plan)
    - [2.2. Applying the Solutions](#applying_solutions)

***

<a id="data_quality_report"></a>
# 1. Data Quality Report
[Back to contents](#contents)

<a id="overview"></a>
## 1.1. Overview of the Dataset
[Top of section](#data_quality_report)

> Each entry in the leavetimes dataset contains the actual data for one leave time of one vehicle at one stop point of a route.

In [4]:
# print the number of rows and features
num_rows = df_leavetimes.shape[0]
features = df_leavetimes.shape[1]
print(f"The dataset has {num_rows} rows with {features} features.")

The dataset has 116949113 rows with 18 features.


In [5]:
# print the first 5 rows of the dataset
print("The first 5 Rows are:")
df_leavetimes.head(5)

The first 5 Rows are:


Unnamed: 0,DATASOURCE,DAYOFSERVICE,TRIPID,PROGRNUMBER,STOPPOINTID,PLANNEDTIME_ARR,PLANNEDTIME_DEP,ACTUALTIME_ARR,ACTUALTIME_DEP,VEHICLEID,PASSENGERS,PASSENGERSIN,PASSENGERSOUT,DISTANCE,SUPPRESSED,JUSTIFICATIONID,LASTUPDATE,NOTE
0,DB,01-JAN-18 00:00:00,5972116,12,119,48030,48030,48012,48012,2693211,,,,,,,08-JAN-18 17:21:10,
1,DB,01-JAN-18 00:00:00,5966674,12,119,54001,54001,54023,54023,2693267,,,,,,,08-JAN-18 17:21:10,
2,DB,01-JAN-18 00:00:00,5959105,12,119,60001,60001,59955,59955,2693263,,,,,,,08-JAN-18 17:21:10,
3,DB,01-JAN-18 00:00:00,5966888,12,119,58801,58801,58771,58771,2693284,,,,,,,08-JAN-18 17:21:10,
4,DB,01-JAN-18 00:00:00,5965960,12,119,56401,56401,56309,56323,2693209,,,,,,,08-JAN-18 17:21:10,


In [12]:
df_leavetimes.head(5)

Unnamed: 0,DATASOURCE,DAYOFSERVICE,TRIPID,PROGRNUMBER,STOPPOINTID,PLANNEDTIME_ARR,PLANNEDTIME_DEP,ACTUALTIME_ARR,ACTUALTIME_DEP,VEHICLEID,PASSENGERS,PASSENGERSIN,PASSENGERSOUT,DISTANCE,SUPPRESSED,JUSTIFICATIONID,LASTUPDATE,NOTE
0,DB,2018-01-01,5972116,12,119,1970-01-01 13:20:30,1970-01-01 13:20:30,1970-01-01 13:20:12,1970-01-01 13:20:12,2693211,,,,,,,2018-01-08 17:21:10,
1,DB,2018-01-01,5966674,12,119,1970-01-01 15:00:01,1970-01-01 15:00:01,1970-01-01 15:00:23,1970-01-01 15:00:23,2693267,,,,,,,2018-01-08 17:21:10,
2,DB,2018-01-01,5959105,12,119,1970-01-01 16:40:01,1970-01-01 16:40:01,1970-01-01 16:39:15,1970-01-01 16:39:15,2693263,,,,,,,2018-01-08 17:21:10,
3,DB,2018-01-01,5966888,12,119,1970-01-01 16:20:01,1970-01-01 16:20:01,1970-01-01 16:19:31,1970-01-01 16:19:31,2693284,,,,,,,2018-01-08 17:21:10,
4,DB,2018-01-01,5965960,12,119,1970-01-01 15:40:01,1970-01-01 15:40:01,1970-01-01 15:38:29,1970-01-01 15:38:43,2693209,,,,,,,2018-01-08 17:21:10,


In [6]:
# print the last 5 rows of the dataset
print("The last 5 Rows are:")
df_leavetimes.tail(5)

The last 5 Rows are:


Unnamed: 0,DATASOURCE,DAYOFSERVICE,TRIPID,PROGRNUMBER,STOPPOINTID,PLANNEDTIME_ARR,PLANNEDTIME_DEP,ACTUALTIME_ARR,ACTUALTIME_DEP,VEHICLEID,PASSENGERS,PASSENGERSIN,PASSENGERSOUT,DISTANCE,SUPPRESSED,JUSTIFICATIONID,LASTUPDATE,NOTE
116949108,DB,31-DEC-18 00:00:00,8588153,78,4383,28605,28605,28998,29013,3265721,,,,,,,16-JAN-19 18:27:21,
116949109,DB,31-DEC-18 00:00:00,8587459,78,4383,22695,22695,23247,23247,3265687,,,,,,,16-JAN-19 18:27:21,
116949110,DB,31-DEC-18 00:00:00,8586183,78,4383,51481,51481,52237,52283,2693229,,,,,,,16-JAN-19 18:27:21,
116949111,DB,31-DEC-18 00:00:00,8589374,23,7053,53659,53659,53525,53525,3265669,,,,,,,16-JAN-19 18:27:21,
116949112,DB,31-DEC-18 00:00:00,8589372,24,2088,46383,46383,46315,46325,3265669,,,,,,,16-JAN-19 18:27:21,


<a id="feature_data_types"></a>
## 1.2. Feature Data Types
[Top of section](#data_quality_report)

In [7]:
# print the data type for each feature
df_leavetimes.dtypes

DATASOURCE          object
DAYOFSERVICE        object
TRIPID               int64
PROGRNUMBER          int64
STOPPOINTID          int64
PLANNEDTIME_ARR      int64
PLANNEDTIME_DEP      int64
ACTUALTIME_ARR       int64
ACTUALTIME_DEP       int64
VEHICLEID            int64
PASSENGERS         float64
PASSENGERSIN       float64
PASSENGERSOUT      float64
DISTANCE           float64
SUPPRESSED         float64
JUSTIFICATIONID    float64
LASTUPDATE          object
NOTE               float64
dtype: object

- I am not concerned with the feature DATASOURCE so I will leave it as an object.  
- DAYOFSERVICE and LASTUPDATE are dates so I will convert these to datetime objects.  
- TRIPID, STOPPOINTID, and VEHICLEID are all unique ID numbers. I will leave these as int64.  
- PROGNUMBER is the sequential position of the stop point in the trip. I will leave this as int64.  
- PLANNEDTIME_ARR, PLANNEDTIME_DEP, ACTUALTIME_ARR, and ACTUALTIME_DEP are in unit seconds, I will leave these as int64 for the moment.  
- PASSENGERS, PASSENGERSIN, and PASSENGERSOUT describe the number of passengers on board, boarding, and descending. I will change these to int64.  
- DISTANCE describes the distance measured from the beginning of the trip, I will leave this as float64.  
- SUPPRESSED is a flag feature. If the leave time is planned and achieved, the record will contain both the planned and actual data. If the leave time is planned and suppressed, the record will contain only the planned data and the SUPRESSED flag will be checked. When the trip is partially suppressed it says the previous link is suppressed (0=achieved,1=suppressed).  
- The JUSTIFICATIONID feature is simply described as 'fault code' and listed as a number. I will examine this feature.
- The NOTE feature is described as 'free note', I will also need to examine this feature.

### 1.2.1 Convert DAYOFSERVICE and LASTUPDATE to datetime objects

In [12]:
# convert DAYOFSERVICE and LASTUPDATE to datetime object
df_leavetimes['DAYOFSERVICE'] = df_leavetimes['DAYOFSERVICE'].astype('datetime64')
df_leavetimes['LASTUPDATE'] = df_leavetimes['LASTUPDATE'].astype('datetime64')

### 1.2.2. Convert SUPPRESSED to boolean

In [13]:
# convert SUPPRESSED to boolean
df_leavetimes['SUPPRESSED'] = df_leavetimes['SUPPRESSED'].astype('boolean')

### 1.2.3. Examine JUSTIFICATIONID and NOTE

In [14]:
# print the unique values of the JUSTIFICATIONID feature
df_leavetimes['JUSTIFICATIONID'].unique()

array([           nan, 4.84981386e+17, 4.84981386e+17, ...,
       4.84981387e+17, 4.84981387e+17, 4.84981387e+17])

In [15]:
# print the number of unique values for the JUSTIFICATIONID feature
len(df_leavetimes['JUSTIFICATIONID'].unique())

34966

> In the Concept Design document for the database, from which this data has come, we see that JUSTIFICATIONID is used as a foreign key to link to a justifications table whose data we do not have.  
As we are missing the key data to which this feature relates I believe it will be unusable for modelling an there for I will drop this feature later. At the moment I will leave it as float64.

In [16]:
# print the unique values for the NOTE feature
df_leavetimes['NOTE'].unique()

array([nan])

> The NOTE feature is a constant column of missing values. This will be dealt with later so I will leave the data type as float64.

### 1.2.4. Separate Categorical and Continuous Features

In [17]:
# display the datatype of each feature
df_leavetimes.dtypes

DATASOURCE                 object
DAYOFSERVICE       datetime64[ns]
TRIPID                      int64
PROGRNUMBER                 int64
STOPPOINTID                 int64
PLANNEDTIME_ARR             int64
PLANNEDTIME_DEP             int64
ACTUALTIME_ARR              int64
ACTUALTIME_DEP              int64
VEHICLEID                   int64
PASSENGERS                float64
PASSENGERSIN              float64
PASSENGERSOUT             float64
DISTANCE                  float64
SUPPRESSED                boolean
JUSTIFICATIONID           float64
LASTUPDATE         datetime64[ns]
NOTE                      float64
dtype: object

In [18]:
# Select columns containing categorical data
categorical_columns = df_leavetimes[
    ['DATASOURCE', 'SUPPRESSED']
].columns

In [19]:
# Select columns containing continuous data
continuous_columns = df_leavetimes[
    ['DAYOFSERVICE', 'TRIPID', 'PROGRNUMBER', 'STOPPOINTID',
     'PLANNEDTIME_ARR', 'PLANNEDTIME_DEP', 'ACTUALTIME_ARR',
     'ACTUALTIME_DEP', 'VEHICLEID', 'PASSENGERS', 'PASSENGERSIN',
     'PASSENGERSOUT', 'DISTANCE', 'JUSTIFICATIONID',
     'LASTUPDATE', 'NOTE']
].columns

<a id="duplicates_and_constants"></a>
## 1.3. Duplicates and Constants
[Top of section](#data_quality_report)

<a id="desc_stats_continuous"></a>
## 1.4. Descriptive Statistics for Continuous Features
[Top of section](#data_quality_report)

<a id="desc_stats_categorical"></a>
## 1.5. Descriptive Statistics for Categorical Features
[Top of section](#data_quality_report)

<a id="plotting_continuous"></a>
## 1.6. Plotting Continuous Features
[Top of section](#data_quality_report)

<a id="plotting_categorical"></a>
## 1.7. Plotting Categorical Features
[Top of section](#data_quality_report)

<a id="logical_integrity_check"></a>
## 1.8. Logical Integrity Check
[Top of section](#data_quality_report)

<a id="data_quality_plan"></a>
# 2. Data Quality Plan
[Back to contents](#contents)

<a id="the_plan"></a>
## 2.1. The Plan
[Top of section](#data_quality_plan)

<a id="applying_solutions"></a>
## 2.2. Applying the Solutions
[Top of section](#data_quality_plan)

***

[Back to top](#top)