<a id="top"></a>

The aim of this notebook is to clean and prepare the data in the leavetimes dataset.  
The data will be used to build a predictive model that will predict the travel time for a bus route.

***

# Import Packages

In [1]:
import pandas as pd
import datetime
import matplotlib.pyplot as plt
import calendar
import numpy as np
import time

***

# Bold Print Function

In [2]:
def print_bold(string):
    """
    Function to print a given string in bold text.
    """
    print("\033[1m" + string + "\033[0m")

***

# Load Original Data File

In [3]:
# load in original data file using pandas
# this may take a couple of minutes and can result in a memory error if many notebooks are in use
t1 = datetime.datetime.now()
df_leavetimes = pd.read_csv('/home/faye/data/rt_leavetimes_DB_2018.txt', sep=';',error_bad_lines=False)
diff = datetime.datetime.now() - t1
print(f"Time to read: {diff}")

MemoryError: Unable to allocate 6.10 GiB for an array with shape (7, 116949113) and data type float64

In [None]:
# load in original data file using a feather file
t1 = datetime.datetime.now()
df_leavetimes = pd.read_feather('/home/faye/data/leavetimes.feather')
diff = datetime.datetime.now() - t1
print(f"Time to read: {diff}")

***

<a id="contents"></a>
# Contents

- [1. Data Quality Report](#data_quality_report)
    - [1.1. Overview of the Dataset](#overview)
    - [1.2. Feature Data Types](#feature_data_types)
    - [1.3. Duplicates and Constants](#duplicates_and_constants)
    - [1.4. Descriptive Statistics for Continuous Features](#desc_stats_continuous)
    - [1.5. Descriptive Statistics for Categorical Features](#desc_stats_categorical)
    - [1.6. Plotting Continuous Features](#plotting_continuous)
    - [1.7. Plotting Categorical Features](#plotting_categorical)
    - [1.8. Logical Integrity Check](#logical_integrity_check)
    
    
- [2. Data Quality Plan](#data_quality_plan)
    - [2.1. The Plan](#the_plan)
    - [2.2. Applying the Solutions](#applying_solutions)  
  
  
- [3. Transforming and Extending Features](#transforming_and_extending)
    - [3.1. Add feature: MONTH](#add_month)
    - [3.2. Add feature: DAYOFWEEK](#add_dayofweek)
    - [3.3. Add feature: DWELLTIME](#add_dwelltime)
    - [3.4. Add feature: PLANNEDTIME_TRAVEL](#add_plannedtime_travel)
    - [3.5. Add feature: ACTUALTIME_TRAVEL](#add_actualtime_travel)
    - [3.6. Add feature: IS_HOLIDAY](#add_isholiday)
    - [3.7. Add feature: IS_WEEKDAY](#add_isweekday)

***

<a id="data_quality_report"></a>
# 1. Data Quality Report
[Back to contents](#contents)

<a id="overview"></a>
## 1.1. Overview of the Dataset
[Top of section](#data_quality_report)

> Each entry in the leavetimes dataset contains the actual data for one leave time of one vehicle at one stop point of a route.

In [None]:
# print the number of rows and features
num_rows = df_leavetimes.shape[0]
features = df_leavetimes.shape[1]
print(f"The dataset has {num_rows} rows with {features} features.")

In [None]:
# print the first 5 rows of the dataset
print("The first 5 Rows are:")
df_leavetimes.head(5)

In [None]:
# print the last 5 rows of the dataset
print("The last 5 Rows are:")
df_leavetimes.tail(5)

***

<a id="feature_data_types"></a>
## 1.2. Feature Data Types
[Top of section](#data_quality_report)

In [None]:
# print the data type for each feature
df_leavetimes.dtypes

- I am not concerned with the feature DATASOURCE so I will leave it as an object.  
- DAYOFSERVICE and LASTUPDATE are dates so I will convert these to datetime objects.  
- TRIPID, STOPPOINTID, and VEHICLEID are all unique ID numbers. I will convert these to type category.  
- PROGRNUMBER is the sequential position of the stop point in the trip. I will convert this to category.  
- PLANNEDTIME_ARR, PLANNEDTIME_DEP, ACTUALTIME_ARR, and ACTUALTIME_DEP are in unit seconds, I will leave these as int64 for the moment.  
- PASSENGERS, PASSENGERSIN, and PASSENGERSOUT describe the number of passengers on board, boarding, and descending. I will leave these features as float64.  
- DISTANCE describes the distance measured from the beginning of the trip, I will leave this as float64.  
- SUPPRESSED is a flag feature. If the leave time is planned and achieved, the record will contain both the planned and actual data. If the leave time is planned and suppressed, the record will contain only the planned data and the SUPRESSED flag will be checked. When the trip is partially suppressed it says the previous link is suppressed (0=achieved,1=suppressed).  
- The JUSTIFICATIONID feature is simply described as 'fault code' and listed as a number. I will examine this feature.
- The NOTE feature is described as 'free note', I will also need to examine this feature.

### 1.2.1. Convert DAYOFSERVICE and LASTUPDATE to datetime objects

In [None]:
# convert DAYOFSERVICE and LASTUPDATE to datetime object
df_leavetimes['DAYOFSERVICE'] = df_leavetimes['DAYOFSERVICE'].astype('datetime64')
df_leavetimes['LASTUPDATE'] = df_leavetimes['LASTUPDATE'].astype('datetime64')

### 1.2.2. Convert TRIPID, STOPPOINTID, VEHICLEID, and PROGRNUMBER to category

In [None]:
# convert the selected columns to type category
cols = ['TRIPID', 'STOPPOINTID', 'VEHICLEID', 'PROGRNUMBER']

for col in cols:
    df_leavetimes[col] = df_leavetimes[col].astype('category')

### 1.2.3. Convert SUPPRESSED to boolean

In [None]:
# convert SUPPRESSED to boolean
df_leavetimes['SUPPRESSED'] = df_leavetimes['SUPPRESSED'].astype('boolean')

### 1.2.4. Examine JUSTIFICATIONID and NOTE

#### JUSTIFICATIONID

In [None]:
# print the unique values of the JUSTIFICATIONID feature
df_leavetimes['JUSTIFICATIONID'].unique()

In [None]:
# print the number of unique values for the JUSTIFICATIONID feature
len(df_leavetimes['JUSTIFICATIONID'].unique())

> In the Concept Design document for the database, from which this data has come, we see that JUSTIFICATIONID is used as a foreign key to link to a justifications table whose data we do not have.  
As we are missing the key data to which this feature relates I believe it will be unusable for modelling an there for I will drop this feature later. 
For the moment I will convert this feature to type category as it is a unique identification number.

In [None]:
# convert JUSTIFICATIONID to type category
df_leavetimes['JUSTIFICATIONID'] = df_leavetimes['JUSTIFICATIONID'].astype('category')

#### NOTE

In [None]:
# print the unique values for the NOTE feature
df_leavetimes['NOTE'].unique()

> The NOTE feature is a constant column of missing values. This will be dealt with later so I will leave the data type as float64.

### 1.2.5. Converted Feature Data Types

In [None]:
# display the datatype of each feature
df_leavetimes.dtypes

In [None]:
# select all categorical columns
categorical_columns = df_leavetimes[
    ['DATASOURCE', 'TRIPID', 'PROGRNUMBER', 'STOPPOINTID', 
     'VEHICLEID', 'SUPPRESSED', 'JUSTIFICATIONID'
    ]
].columns

In [None]:
# select all continuous columns
continuous_columns = df_leavetimes[
    ['DAYOFSERVICE', 'PLANNEDTIME_ARR', 'PLANNEDTIME_DEP', 'ACTUALTIME_ARR', 'ACTUALTIME_DEP', 
     'PASSENGERS', 'PASSENGERSIN', 'PASSENGERSOUT', 'DISTANCE', 'LASTUPDATE', 'NOTE']
].columns

***

<a id="duplicates_and_constants"></a>
## 1.3. Duplicates and Constants
[Top of section](#data_quality_report)

### 1.3.1. Duplicate Rows

In [None]:
# Find number of duplicate rows
num_duplicate_rows = df_leavetimes.duplicated().sum()
print(f"There are {num_duplicate_rows} duplicated rows in this dataset (excluding the first row).")
num_duplicate_rows_inclusive = df_leavetimes[df_leavetimes.duplicated(keep=False)].shape[0]
print(f"There are {num_duplicate_rows_inclusive} duplicated rows in this dataset (including row that is duplicated).")

> There are no duplicate rows in this dataset.

### 1.3.2. Duplicate Columns

> There does not appear to be any duplicate columns in this dataset

### 1.3.3. Constant Columns

In [None]:
# Print Categorical Descriptive Statistics
df_leavetimes[categorical_columns].describe().T

> From this we see that DATASOURCE is a constant column as it only has 1 unique value, this will be dropped.

In [None]:
# Check that the continuous features have a standard deviation greater than zero
df_leavetimes[continuous_columns].std()

> From the standard deviation of the continuous features above we can see that PASSENGERS, PASSENGERSIN, PASSENGERSOUT, DISTANCE, and NOTE are constant columns of null values, these will be dropped.

### 1.3.4. Drop Constant Columns

In [None]:
# drop constant columns
constant_columns = [
    'DATASOURCE', 'PASSENGERS', 'PASSENGERSIN', 'PASSENGERSOUT',
    'DISTANCE', 'NOTE'
]

df_leavetimes = df_leavetimes.drop(columns=constant_columns)

## Save Cleaned Data Frame - 1

In [None]:
# reorder columns
reordered_columns = [
    'DAYOFSERVICE', 'LASTUPDATE', 'TRIPID', 'PROGRNUMBER', 'STOPPOINTID', 
    'VEHICLEID', 'JUSTIFICATIONID', 'SUPPRESSED',
    'PLANNEDTIME_ARR', 'ACTUALTIME_ARR', 
    'PLANNEDTIME_DEP', 'ACTUALTIME_DEP',    
]

df_leavetimes = df_leavetimes[reordered_columns]

In [None]:
# save dataframe to csv
df_leavetimes.to_csv('/home/faye/data/leavetimes_cleaned_1.csv')

In [None]:
# save dataframe to feather
df_leavetimes.to_feather('/home/faye/data/leavetimes_cleaned_1.feather')

# Load Cleaned Data (if needed)

In [3]:
# load in cleaned data from feather file
df_leavetimes = pd.read_feather('/home/faye/data/leavetimes_cleaned_1.feather')

In [4]:
df_leavetimes.head(5)

Unnamed: 0,DAYOFSERVICE,LASTUPDATE,TRIPID,PROGRNUMBER,STOPPOINTID,VEHICLEID,JUSTIFICATIONID,SUPPRESSED,PLANNEDTIME_ARR,ACTUALTIME_ARR,PLANNEDTIME_DEP,ACTUALTIME_DEP
0,2018-01-01,2018-01-08 17:21:10,5972116,12,119,2693211,,,48030,48012,48030,48012
1,2018-01-01,2018-01-08 17:21:10,5966674,12,119,2693267,,,54001,54023,54001,54023
2,2018-01-01,2018-01-08 17:21:10,5959105,12,119,2693263,,,60001,59955,60001,59955
3,2018-01-01,2018-01-08 17:21:10,5966888,12,119,2693284,,,58801,58771,58801,58771
4,2018-01-01,2018-01-08 17:21:10,5965960,12,119,2693209,,,56401,56309,56401,56323


In [5]:
df_leavetimes.dtypes

DAYOFSERVICE       datetime64[ns]
LASTUPDATE         datetime64[ns]
TRIPID                   category
PROGRNUMBER              category
STOPPOINTID              category
VEHICLEID                category
JUSTIFICATIONID          category
SUPPRESSED                boolean
PLANNEDTIME_ARR             int64
ACTUALTIME_ARR              int64
PLANNEDTIME_DEP             int64
ACTUALTIME_DEP              int64
dtype: object

<a id="desc_stats_continuous"></a>
## 1.4. Descriptive Statistics for Continuous Features
[Top of section](#data_quality_report)

In [None]:
# Select columns containing continuous data
continuous_columns = df_leavetimes[
    ['DAYOFSERVICE', 'LASTUPDATE',
     'PLANNEDTIME_ARR', 'ACTUALTIME_ARR', 
     'PLANNEDTIME_DEP', 'ACTUALTIME_DEP']
].columns

### 1.4.1. Descriptive Statistics

In [None]:
# print table with descriptive statistics for all the continuous features
continuous_feat_desc = df_leavetimes[continuous_columns].describe(datetime_is_numeric=True).T

print_bold("Descriptive Statistics for Continuous Features")
continuous_feat_desc

> From the min and max for DAYOFSERVICE we can see that the data ranges from the 1st January to 31st December for the year 2018. By looking at the percentiles it also appears that the data is evenly spread throughout the year.

In [None]:
# convert the min ACTUALTIME_ARR and max ACTUALTIME_DEP to HH:MM:SS
print_bold("min ACTUALTIME_ARR and max ACTUALTIME_DEP")
print(datetime.timedelta(seconds=15974))
print(datetime.timedelta(seconds=97177))
# convert the median ACTUALTIME_ARR and ACTUALTIME_DEP to HH:MM:SS
print_bold("median for ACTUALTIME_ARR and ACTUALTIME_DEP")
print(datetime.timedelta(seconds=53511))
print(datetime.timedelta(seconds=53526))

> If we look at the min for ACTUALTIME_ARR and max for ACTUALTIME_DEP and convert the seconds to time we can see that the leave times range from 04:26 to 02:59 the following day. If we look at the median value for ACTUALTIME_ARR and ACTUALTIME_DEP it appears that the data is evenly spread throughout this range.

### 1.4.2. Range of Values

In [None]:
# for each continuous feature display the range of values and the number of instances for the 15 most common values
for feat in continuous_columns:
    
    print_bold(feat)
    print("Range {} is: {}" .format( feat, (df_leavetimes[feat].max() - df_leavetimes[feat].min()) ) )
    print("-"*10)
    
    print_bold("{0:10.5} {1}" .format("Value", "Number of Instances") )
    print(df_leavetimes[feat].value_counts().nlargest(15), "\n\n\n")
    print("-"*50)

In [None]:
# check for null values for each continuous feature
print_bold("The sum of null values for Continuous Features")
print("-"*50)
print(df_leavetimes[continuous_columns].isnull().sum())

> From the above we can that we have no null values for the continuous features.

In [None]:
# print the cardinality of each continuous feature
features_cardinality = list(df_leavetimes[continuous_columns].columns.values)

print_bold("{0:35} {1}" .format("Feature","Cardinality") )
print("{0:35} {1}" .format("-------","-----------") )

for c in features_cardinality:
    print("{0:35} {1}" .format( c, (len(df_leavetimes[c].unique()))) )

> We can see from the cardinality for DAYOFSERVICE that we have almost a full years worth of data for the leavetimes.

<a id="desc_stats_categorical"></a>
## 1.5. Descriptive Statistics for Categorical Features
[Top of section](#data_quality_report)

In [None]:
# Select columns containing continuous data
categorical_columns = df_leavetimes[
    ['TRIPID', 'PROGRNUMBER', 'STOPPOINTID', 'VEHICLEID', 
     'JUSTIFICATIONID', 'SUPPRESSED']
].columns

In [None]:
# print the descriptive statistics for all the categorical features
categorical_feat_desc = df_leavetimes[categorical_columns].describe().T
categorical_feat_desc

In [None]:
# check for null values for each categorical feature
print_bold("The sum of null values for Continuous Features")
print("-"*50)
print(df_leavetimes[categorical_columns].isnull().sum())

> From the above we can see that there is a very high number of entries with null values for the features JUSTIFICATIONID and SUPPRESSED.

In [None]:
# print the % of null values for JUSTIFICATIONID and SUPPRESSED
print_bold("The percentage of null values for JUSTIFICATIONID and SUPPRESSED")
print("-"*50)
print("{:20}{:.2f}" .format("JUSTIFICATIONID", df_leavetimes['JUSTIFICATIONID'].isnull().sum() / df_leavetimes.shape[0] * 100) )
print("{:20}{:.2f}" .format("SUPPRESSED", df_leavetimes['SUPPRESSED'].isnull().sum() / df_leavetimes.shape[0] * 100) )

> We have a extremely high number of missing values for both the JUSTIFICATIONID and SUPPRESSED feature.  
I don't believe JUSTIFICATIONID will prove useful for modelling due to the high number of missing values and the fact it references to data which we do not have.  
If we cannot infer the missing values for SUPPRESSED then I do not believe this feature will prove useful for any predictive modelling.

In [None]:
# frequency table for categorical features
for feat in categorical_columns:
    title = "Frequency Table for " + feat + ":"
    print_bold(title)
    print("-"*50)
    print(df_leavetimes[feat].value_counts(normalize=True) * 100)
    print()

> Nearly three quarters of the values that we have for SUPPRESSED are False meaning the trips were achieved. 25% of these trips were suppressed. 

<a id="plotting_continuous"></a>
## 1.6. Plotting Continuous Features
[Top of section](#data_quality_report)

In [None]:
# Plot histograms for continuous features
df_leavetimes[continuous_columns].hist(layout=(2, 2), figsize=(30,30), bins=12)
#plt.savefig('continuous_histograms_1-1.pdf')
plt.xlabel("Time (seconds)")
plt.show()

> From the plots above we can see that we have an increase in the number of entries where the arrival and departure times are around 30,000 and 60,000-70,000 seconds.

In [None]:
# convert seconds to HH:MM:SS
print_bold("{:10}{}" .format("Seconds", "HH:MM:SS"))
print("{:<10}{}" .format(30000, datetime.timedelta(seconds=30000)))
print("{:<10}{}" .format(60000, datetime.timedelta(seconds=60000)))
print("{:<10}{}" .format(70000, datetime.timedelta(seconds=70000)))

> These times are roughly correspond with rush hour times.

<a id="plotting_categorical"></a>
## 1.7. Plotting Categorical Features
[Top of section](#data_quality_report)

> The only continuous feature I see any purpose in plotting is SUPPRESSED. I do not believe there is anything to gain from plotting the ID numbers or sequential stop number (PROGRNUMBER)

In [None]:
# plot the feature SUPPRESSED
f = df_leavetimes['SUPPRESSED'].value_counts().plot(kind='bar', figsize=(10,10))
plt.show()

<a id="logical_integrity_check"></a>
## 1.8. Logical Integrity Check
[Top of section](#data_quality_report)

### 1.8.1. Check 1: Check that LASTUPDATE > DAYOFSERVICE

In [None]:
# create a dataframe for check 1
check_1 = df_leavetimes[['DAYOFSERVICE','LASTUPDATE']][df_leavetimes['LASTUPDATE']<df_leavetimes['DAYOFSERVICE']]
print(f"Number of rows failing the check: {check_1.shape[0]}")
check_1

### 1.8.2. Check 2: Check that ACTUALTIME_DEP > ACTUALTIME_ARR

In [6]:
# create a dataframe for check 2
check_2 = df_leavetimes[['ACTUALTIME_ARR','ACTUALTIME_DEP']][df_leavetimes['ACTUALTIME_DEP']<df_leavetimes['ACTUALTIME_ARR']]
print(f"Number of rows failing the check: {check_2.shape[0]}")
check_2

Number of rows failing the check: 3


Unnamed: 0,ACTUALTIME_ARR,ACTUALTIME_DEP
74759275,46208,46207
81013777,39260,39258
86217868,71991,71989


In [None]:
# print 
df_leavetimes.iloc[[74759275,81013777,86217868]]

<a id="data_quality_plan"></a>
# 2. Data Quality Plan
[Back to contents](#contents)

<a id="the_plan"></a>
## 2.1. The Plan
[Top of section](#data_quality_plan)

| Variable Name           | Data Quality Issue | Handling Strategy      |
| :---------------------- | :----------------- | :--------------------- |
| DAYOFSERVICE            | none               | do nothing             |
| LASTUPDATE              | irrelevant for modelling| drop feature      |
| TRIPID                  | none               | do nothing             |
| PROGRNUMBER             | none               | do nothing             |
| STOPPOINTID             | none               | do nothing             |
| VEHICLEID               | none               | do nothing             |
| JUSTIFICATIONID         | 99.5% missing values| drop feature          |
| SUPPRESSED              | 99.5% missing values| drop feature          |
| PLANNEDTIME_ARR         | none               | do nothing             |
| ACTUALTIME_ARR          | logical integrity  | swap with ACTUALTIME_DEP    |
| PLANNEDTIME_DEP         | none               | do nothing             |
| ACTUALTIME_DEP          | logical integrity  | swap with ACTUALTIME_ARR    |

<a id="applying_solutions"></a>
## 2.2. Applying the Solutions
[Top of section](#data_quality_plan)

### 2.2.1. Drop Features: LASTUPDATE, JUSTIFICATIONID, and SUPPRESSED

In [7]:
# drop features LASTUPDATE, JUSTIFICATIONID, and SUPPRESSED
df_leavetimes = df_leavetimes.drop(columns=['LASTUPDATE','JUSTIFICATIONID','SUPPRESSED'])

### 2.2.2. Swap illogical rows from Logical Integrity Check 2

In [8]:
# print rows failing logical integrity check
df_leavetimes[['ACTUALTIME_ARR','ACTUALTIME_DEP']].iloc[[74759275,81013777,86217868]]

Unnamed: 0,ACTUALTIME_ARR,ACTUALTIME_DEP
74759275,46208,46207
81013777,39260,39258
86217868,71991,71989


In [9]:
# swap ACTUALTIME_ARR and ACTUALTIME_DEP values for illogical rows
for i in check_2.index:
    df_leavetimes['ACTUALTIME_ARR'][i] = check_2['ACTUALTIME_DEP'][i]
    df_leavetimes['ACTUALTIME_DEP'][i] = check_2['ACTUALTIME_ARR'][i]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


In [10]:
# check values have been swapped
df_leavetimes[['ACTUALTIME_ARR','ACTUALTIME_DEP']].iloc[[74759275,81013777,86217868]]

Unnamed: 0,ACTUALTIME_ARR,ACTUALTIME_DEP
74759275,46207,46208
81013777,39258,39260
86217868,71989,71991


***

## Save Cleaned Data Frame - 2

In [11]:
# save dataframe to csv
df_leavetimes.to_csv('/home/faye/data/leavetimes_cleaned_2.csv')

OSError: [Errno 28] No space left on device

In [None]:
# save dataframe to feather
# !! doesn't work due to an index error
df_leavetimes.to_feather('/home/faye/data/leavetimes_cleaned_2.feather', index=False)

***

# Load Cleaned Data (if needed)

In [12]:
# load cleaned data from csv file
df_leavetimes = pd.read_csv('/home/faye/data/leavetimes_cleaned_2.csv')

In [2]:
# load in cleaned data from feather file
df_leavetimes = pd.read_feather('/home/faye/data/leavetimes_cleaned_2.feather')

In [3]:
df_leavetimes.dtypes

DAYOFSERVICE       object
TRIPID              int64
PROGRNUMBER         int64
STOPPOINTID         int64
VEHICLEID           int64
PLANNEDTIME_ARR     int64
ACTUALTIME_ARR      int64
PLANNEDTIME_DEP     int64
ACTUALTIME_DEP      int64
dtype: object

In [4]:
df_leavetimes['DAYOFSERVICE'] = pd.to_datetime(df_leavetimes['DAYOFSERVICE'])

In [5]:
cat_cols = ['TRIPID','PROGRNUMBER','STOPPOINTID','VEHICLEID']
for col in cat_cols:
    df_leavetimes[col] = df_leavetimes[col].astype('category')

In [16]:
df_leavetimes.keys()

Index(['Unnamed: 0', 'DAYOFSERVICE', 'TRIPID', 'PROGRNUMBER', 'STOPPOINTID',
       'VEHICLEID', 'PLANNEDTIME_ARR', 'ACTUALTIME_ARR', 'PLANNEDTIME_DEP',
       'ACTUALTIME_DEP'],
      dtype='object')

In [20]:
df_leavetimes = df_leavetimes.drop(columns=['Unnamed: 0'])

In [6]:
df_leavetimes.dtypes

DAYOFSERVICE       datetime64[ns]
TRIPID                   category
PROGRNUMBER              category
STOPPOINTID              category
VEHICLEID                category
PLANNEDTIME_ARR             int64
ACTUALTIME_ARR              int64
PLANNEDTIME_DEP             int64
ACTUALTIME_DEP              int64
dtype: object

***

<a id="transforming_and_extending"></a>
# 3. Transforming and Extending Features
[Back to contents](#contents)

New features to add:
1. MONTH
2. DAYOFWEEK
3. DWELLTIME
4. PLANNEDTIME_TRAVEL
5. ACTUALTIME_TRAVEL
6. IS_HOLIDAY
7. IS_WEEKDAY

<a id="add_month"></a>
## 3.1. Add feature: MONTH
[Top of section](#transforming_and_extending)

In [7]:
# add feature MONTHOFSERVICE
df_leavetimes['MONTHOFSERVICE'] = df_leavetimes['DAYOFSERVICE'].dt.strftime('%B')

In [8]:
# save cleaned data in feather file
df_leavetimes.to_feather('/home/faye/data/leavetimes_1_extra_features.feather')

<a id="add_dayofweek"></a>
## 3.2. Add feature: DAYOFWEEK
[Top of section](#transforming_and_extending)

In [9]:
# add feature DAYOFWEEK
df_leavetimes['DAYOFWEEK'] = [calendar.day_name[val.weekday()] for val in df_leavetimes['DAYOFSERVICE']]

In [10]:
# save cleaned data in feather file
df_leavetimes.to_feather('/home/faye/data/leavetimes_2_extra_features.feather')

<a id="add_dwelltime"></a>
## 3.3. Add feature: DWELLTIME
[Top of section](#transforming_and_extending)

In [11]:
# add feature DWELLTIME
df_leavetimes['DWELLTIME'] = df_leavetimes['ACTUALTIME_DEP'] - df_leavetimes['ACTUALTIME_ARR']

In [12]:
# save cleaned data in feather file
df_leavetimes.to_feather('/home/faye/data/leavetimes_3_extra_features.feather')

***

In [2]:
# read cleaned data in feather file
df_leavetimes = pd.read_feather('/home/faye/data/leavetimes_3_extra_features.feather')

In [3]:
df_leavetimes.dtypes

DAYOFSERVICE       datetime64[ns]
TRIPID                   category
PROGRNUMBER              category
STOPPOINTID              category
VEHICLEID                category
PLANNEDTIME_ARR             int64
ACTUALTIME_ARR              int64
PLANNEDTIME_DEP             int64
ACTUALTIME_DEP              int64
MONTHOFSERVICE             object
DAYOFWEEK                  object
DWELLTIME                   int64
dtype: object

<a id="add_plannedtime_travel"></a>
## 3.4. Add feature: PLANNEDTIME_TRAVEL
[Top of section](#transforming_and_extending)

In [None]:
# load data in feather file
df_leavetimes = pd.read_feather('/home/faye/data/leavetimes_3_extra_features.feather')

In [4]:
# sort sequentially by TRIPID, DAYOFSERVICE, PROGRNUMBER
df_leavetimes = df_leavetimes.sort_values(['TRIPID','DAYOFSERVICE','PROGRNUMBER'])

In [5]:
# calculate planned travel time
df_leavetimes['PLANNEDTIME_TRAVEL'] = (df_leavetimes['PLANNEDTIME_ARR']-df_leavetimes['PLANNEDTIME_DEP'].shift()).fillna(0)

In [16]:
# save cleaned data in feather file
df_leavetimes.to_csv('/home/faye/data/leavetimes_4_extra_features.csv')

***

In [None]:
df_leavetimes = pd.read_csv('/home/faye/data/leavetimes_4_extra_features.csv')

<a id="add_actualtime_travel"></a>
## 3.5. Add feature: ACTUALTIME_TRAVEL
[Top of section](#transforming_and_extending)

In [6]:
# calculate planned travel time
df_leavetimes['ACTUALTIME_TRAVEL'] = (df_leavetimes['ACTUALTIME_ARR']-df_leavetimes['ACTUALTIME_DEP'].shift()).fillna(0)

In [None]:
# save cleaned data in feather file
df_leavetimes.to_feather('/home/faye/data/leavetimes_5_extra_features.feather')

<a id="add_isholiday"></a>
## 3.6. Add feature: IS_HOLIDAY
[Top of section](#transforming_and_extending)

In [7]:
import holidays

In [8]:
# Getting the Irish holidays - including bank holidays
# https://towardsdatascience.com/5-minute-guide-to-detecting-holidays-in-python-c270f8479387
irish_holidays_2018 = []
for date in holidays.Ireland(years=2018).items():
    irish_holidays_2018.append(str(date[0]))

In [9]:
# add feature IS_HOLIDAY
df_leavetimes['IS_HOLIDAY'] = [1 if str(val).split()[0] in irish_holidays_2018 else 0 for val in df_leavetimes['DAYOFSERVICE']]

In [None]:
# save cleaned data in feather file
df_leavetimes.to_feather('/home/faye/data/leavetimes_6_extra_features.feather')

<a id="add_isweekday"></a>
## 3.7. Add feature: IS_WEEKDAY
[Top of section](#transforming_and_extending)

In [10]:
# add feature IS_WEEKDAY
df_leavetimes['IS_WEEKDAY'] = [1 if int(val.weekday()) < 5 else 0 for val in df_leavetimes['DAYOFSERVICE']]

In [None]:
# save cleaned data in feather file
df_leavetimes.to_feather('/home/faye/data/leavetimes_7_extra_features.feather')

***

# Reorder Columns and Save Dataset
[Back to contents](#contents)

In [11]:
df_leavetimes.dtypes

DAYOFSERVICE          datetime64[ns]
TRIPID                      category
PROGRNUMBER                 category
STOPPOINTID                 category
VEHICLEID                   category
PLANNEDTIME_ARR                int64
ACTUALTIME_ARR                 int64
PLANNEDTIME_DEP                int64
ACTUALTIME_DEP                 int64
MONTHOFSERVICE                object
DAYOFWEEK                     object
DWELLTIME                      int64
PLANNEDTIME_TRAVEL           float64
ACTUALTIME_TRAVEL            float64
IS_HOLIDAY                     int64
IS_WEEKDAY                     int64
dtype: object

In [12]:
# reorder columns
reorder_cols = [
    'DAYOFSERVICE', 'DAYOFWEEK', 'MONTHOFSERVICE', 'TRIPID', 'PROGRNUMBER', 'STOPPOINTID', 'VEHICLEID',
    'PLANNEDTIME_ARR', 'ACTUALTIME_ARR', 'PLANNEDTIME_DEP', 'ACTUALTIME_DEP', 'DWELLTIME',
    'PLANNEDTIME_TRAVEL', 'ACTUALTIME_TRAVEL', 'IS_HOLIDAY', 'IS_WEEKDAY'
]

In [14]:
# save dataframe to csv
df_leavetimes.to_csv('/home/faye/data/leavetimes_extra_features.csv')

In [None]:
# save dataframe to feather
df_leavetimes.to_feather('/home/faye/data/leavetimes_extra_features.feather')

***

# Save dataframe to DB

In [15]:
import sqlite3
import sqlalchemy

In [None]:
# convert continuous features to int32
continuous_cols = [
    'PLANNEDTIME_ARR', 'ACTUALTIME_ARR', 'PLANNEDTIME_DEP', 'ACTUALTIME_DEP', 
    'DWELLTIME', 'PLANNEDTIME_TRAVEL', 'ACTUALTIME_TRAVEL'
]

for col in continuous_cols:
    df_leavetimes[col] = df_leavetimes[col].astype('int32')

***

[Back to top](#top)