### Process dataset.

- read data from pkl format onto a dataframe
- replace null/nan values with median
- remove columns that do not have predictive power
- extract extra features from the datetime column
- remove columns that have data leakage from target
- convert columns with text information to columns with categorical types
- convert object types to integer and float
- target encode some categorical columns
- one hot encode some categorical columns
- save the dataframe for further analysis 
- dataframe saved in Cleaned_Trips2016_10.csv, issue types of varibales not preserved.
- Solution is to save dataframe in feather format 

In [1]:
import sys
import os

current_path = os.path.abspath(os.path.join('..'))
if current_path not in sys.path:
    sys.path.append(current_path)
import pandas as pd
import numpy as np
from helper.imports import *
from helper.structured  import *
from IPython.display import display
import matplotlib.pyplot as plt 
%matplotlib inline



In [2]:
# setUp the display 
def display_all(df):
    with pd.option_context("display.max_rows", 1000, "display.max_columns", 1000): 
        display(df)

# Metadata
<table style="width:100%">
    <tr><td><b>OperationDate</b></td> <td>The operation date</td></tr>
    <tr><td><b>Day Type</b></td> <td>Weekend or Weekday</td></tr>
    <tr><td><b>Vehicle No</b></td> <td>Vehicle unique identifier</td></tr>
    <tr><td><b>Block No</b></td> <td>Block unique identifier</td></tr>
    <tr><td><b>Pattern</b></td> <td>The scheduled trip identifier (from starting station to ending station), line number</td></tr>
    <tr><td><b>Trip</b></td> <td>The scheduled trip identifier (from starting station to ending station), line number, and starting time</td></tr>
    <tr><td><b>Line</b></td> <td>Also known as bus stop number</td></tr>
    <tr><td><b>Timing Point</b></td> <td>A particular bus stop that the bus can stop at so that the driver is not too ahead of schedule.</td></tr>
    <tr><td><b>Min Stop No</b></td> <td>Always going to be 1.</td></tr>
    <tr><td><b>Max Stop No</b></td> <td>Basically just the number of total stops that this particular bus line has.</td></tr>
    <tr><td><b>Stop Name</b></td> <td>The stop name</td></tr>
    <tr><td><b>Arrive Load Compensated</b></td> <td>How many people were calculated to be on the bus at this particular bus stop arrival.</td></tr>
    <tr><td><b>Ons Load Compensated</b></td> <td>How many people got on the bus at this particular bus stop arrival.</td></tr>
    <tr><td><b>Offs Load Compensated</b></td> <td>How many people got off the bus at this particular bus stop arrival.</td></tr>
    <tr><td><b>Leave Load Compensated</b></td> <td>How many people were still on the bus after it leaves the bus stop.</td></tr>
    <tr><td><b>Scheduled Arrive Time</b></td> <td>When the bus was scheduled to arrive. (ignore)</td></tr>
    <tr><td><b>Actual Arrive Time</b></td> <td>When the bus actually arrived. (ignore)</td></tr>
    <tr><td><b>Scheduled Leave Time</b></td> <td>When the bus was scheduled to leave. (ignore)</td></tr>
    <tr><td><b>Actual Leave Time</b></td> <td>When the bus actually left. (ignore)</td></tr>
    <tr><td><b>WC Lift Activated</b></td> <td>Did the wheelchair lift activate (Y/N)</td></tr>
    <tr><td><b>Bike Loaded</b></td> <td>Was a bike added  (Y/N)</td></tr>
    <tr><td><b>Bike Unloaded</b></td> <td>Was a bike unloaded (Y/N)</td></tr>
    <tr><td><b>Dwell Time</b></td> <td>How long the bus stayed at this stop.</td></tr>
    <tr><td><b>Arrive Delay</b></td> <td>How far behind in seconds was the bus arrival according to schedule.</td></tr>
    <tr><td><b>Departure Delay</b></td> <td>How far behind in seconds was the bus departure according to schedule.</td></tr>
    <tr><td><b>Ons and Offs Compensated</b></td> <td>Ons + Offs</td></tr>
    <tr><td><b>Temp</b></td> <td>Temperature</td></tr>
    <tr><td><b>Dew</b></td> <td>Was it a "dewy" kinda day.</td></tr>
    <tr><td><b>Humidity</b></td> <td>What the humidity of that day was. (I assume this is grams of water vapor / cubic meter of air )</td></tr>
    <tr><td><b>Conditions</b></td> <td>Self-explanatory</td></tr>
    <tr><td><b>Visibility</b></td> <td>Not sure what units this uses</td></tr>
    <tr><td><b>Wind Speed</b></td> <td>Wind kilometres / hour?</td></tr>
    <tr><td><b>Stop To Stop Time</b></td> <td>Number of seconds from stop A to stop B.</td></tr>
    <tr><td><b>Travel Time</b></td> <td>Stop to Stop Time + Dwell Time</td></tr>
    <tr><td><b>Leg</b></td> <td>e.g., Leg 2 is the second stop of this overall line</td></tr>
    <tr><td><b>Origin</b></td> <td>Bus stop identifier that was point A in this set.</td></tr>
    <tr><td><b>Origin Lat/Long</b></td> <td>The geographic latitude and longitude for the origin of this trip.</td></tr>
    <tr><td><b>Destination</b></td> <td>Bus stop identifier that was point B in this set.</td></tr>
    <tr><td><b>Destination Lat/Long</b></td> <td>The geographic latitude and longitude for the destination of this trip.</td></tr>
    <tr><td><b>Start.Stop</b></td> <td>A->B</td></tr>
    <tr><td><b>Distance.GglMps</b></td> <td>The distance from A to B in metres according to Google</td></tr>
    <tr><td><b>Duration.GglMps</b></td> <td>The number of seconds that it would take from A to B according to Google</td></tr>
    <tr><td><b>Scheduled Headway</b></td> <td>The recommended number of seconds between this vehicle and the vehicle ahead of it.</td></tr>
    <tr><td><b>Actual Headway</b></td> <td>The actual number of seconds between this vehicle and the vehicle ahead of it.</td></tr>
    <tr><td><b>Headway Within Target</b></td> <td>Is the headway within target (1/0).</td></tr>
    <tr><td><b>Headway Offset</b></td> <td>Scheduled headway minus actual headway.</td></tr>
    <tr><td><b>Bus Bunching Flag</b></td> <td>Is the bus in front of this bus, and too close? (1/0) (also using this column should be considered cheating!).</td></tr>
    <tr><td><b>Bus Gapping Flag</b></td> <td>Is the bus in front of this bus too far away? (1/0).</td></tr>
    <tr><td><b>Trip Pattern Completeness</b></td> <td>The % of completeness of the number of stops visited. (using this column would probably be a bit of information leakage)</td></tr>
    <tr><td><b>Next Leg Bunching Flag</b></td> <td>Is the next trip going to suffer from bus bunching? (B->C)</td></tr>
    <tr><td><b>Next Next Leg Bunching Flag</b></td> <td>" " " " " " " " " " (C->D)</td></tr>
    <tr><td><b>Next Next Next Leg Bunching Flag</b></td> <td>" " " " " " " " " " (D->E)</td></tr>

    
</table>

In [3]:
# Read the data from pkl file.
df = pd.read_pickle(f'{current_path}/data/original/Trips2016_10.pkl')

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 142640 entries, 0 to 142639
Data columns (total 59 columns):
 #   Column                          Non-Null Count   Dtype 
---  ------                          --------------   ----- 
 0   OperationDate                   142640 non-null  object
 1   DayType                         142640 non-null  object
 2   VehicleNo                       142640 non-null  object
 3   BlockNo                         142640 non-null  object
 4   Pattern                         142640 non-null  object
 5   Trip                            142640 non-null  object
 6   Line                            142640 non-null  object
 7   TimingPoint                     142640 non-null  object
 8   MinStopNo                       142640 non-null  int64 
 9   MaxStopNo                       142640 non-null  int64 
 10  StopName                        142640 non-null  object
 11  ArriveLoadCompensated           142640 non-null  object
 12  OnsLoadCompensated            

#### Change dates to datetime format

In [5]:
df['OperationDate'] = pd.to_datetime(df['OperationDate'])

#### Use a helper function to add new columns to data from the Operation date.

In [6]:
add_datepart(df,'OperationDate')

In [7]:
display_all(df)
#df.head(3).transpose()

Unnamed: 0,DayType,VehicleNo,Week,BlockNo,Pattern,Trip,Line,TimingPoint,MinStopNo,MaxStopNo,StopName,ArriveLoadCompensated,OnsLoadCompensated,OffsLoadCompensated,LeaveLoadCompensated,ScheduledArriveTime,ActualArriveTime,ScheduledLeaveTime,ActualLeaveTime,WCLiftActivated,BikeLoaded,BikeUnloaded,DwellTime,ArriveDelay,DepartureDelay,OnsAndOffsCompensated,Temp,Dew,Humidity,Conditions,Visibility,Wind.Speed,StopToStopTime,TravelTime,TripLeg,Origin,OriginLat,OriginLong,Destination,GPSLat,GPSLong,Start.Stop,Distance.GglMps,Duration.GglMps,ScheduledHeadway,ActualHeadway,HeadwayWithinTarget,HeadwayOffset,BusBunchingFlag,BusGappingFlag,TripPatternCompleteness,StartScheduledLeaveTimeSeconds,StartActualLeaveTimeSeconds,StopScheduledArriveTiemSeconds,StopActualArriveTimeSeconds,TripLegInt,NextLegBunchingFlag,NextNextLegBunchingFlag,NextNextNextLegBunchingFlag,OperationYear,OperationMonth,OperationDay,OperationDayofweek,OperationDayofyear,OperationIs_month_end,OperationIs_month_start,OperationIs_quarter_end,OperationIs_quarter_start,OperationIs_year_end,OperationIs_year_start,OperationElapsed
0,Mon-Fri,2553,35,34132,010 - NB1DT,010 - NB1DT - 04:53,010,N,1,63,WB SW MARINE DR AT HEATHER ST,1,0,0,1,04:55:34,04:54:41,04:55:34,04:54:41,N,N,N,0,42,5,0,12.9,FALSE,96,Mainly Clear,48.3,10,35,35,2,55896,49.20914,-123.11976,55670,49.20741,-123.12395,55896 -> 55670,361,37,600,518,1,82,0,0,100,17662,17657,17734,17692,2,0,0,0,2016,9,1,3,245,False,True,False,False,False,False,1472688000
1,Mon-Fri,2553,35,34132,010 - NB1DT,010 - NB1DT - 04:53,010,N,1,63,WB SW MARINE DR AT LAUREL ST,1,1,0,2,04:56:24,04:55:15,04:56:24,04:55:23,N,N,N,0,71,42,1,12.9,FALSE,96,Mainly Clear,48.3,10,21,21,3,55670,49.20741,-123.12395,55898,49.20618,-123.12682,55670 -> 55898,250,24,600,514,1,86,0,0,100,17734,17692,17784,17713,3,0,0,0,2016,9,1,3,245,False,True,False,False,False,False,1472688000
2,Mon-Fri,2553,35,34132,010 - NB1DT,010 - NB1DT - 04:53,010,N,1,63,WB SW MARINE DR AT OAK ST,2,0,0,2,04:57:37,04:56:01,04:57:37,04:56:01,N,N,N,13,81,58,0,12.9,FALSE,96,Mainly Clear,48.3,10,63,50,4,55898,49.20618,-123.12682,54736,49.20453,-123.13107,55898 -> 54736,359,37,600,525,1,75,0,0,100,17784,17726,17857,17776,4,0,0,0,2016,9,1,3,245,False,True,False,False,False,False,1472688000
3,Mon-Fri,2553,35,34132,010 - NB1DT,010 - NB1DT - 04:53,010,N,1,63,WB SW MARINE DR AT GRANVILLE ST,-1,0,0,-1,04:59:54,04:59:47,04:59:54,04:59:47,N,N,N,0,-9,-22,0,12.9,FALSE,96,Mainly Clear,48.3,10,22,22,8,54681,49.20417,-123.13726,54682,49.20505,-123.13988,54681 -> 54682,214,22,600,531,1,69,0,0,100,17959,17981,17994,18003,8,0,0,0,2016,9,1,3,245,False,True,False,False,False,False,1472688000
4,Mon-Fri,2553,35,34132,010 - NB1DT,010 - NB1DT - 04:53,010,N,1,63,NB GRANVILLE ST AT W 71 AVE,-1,2,0,1,05:00:41,05:00:33,05:00:41,05:00:54,N,N,N,0,8,-9,2,13.1,FALSE,96,Mainly Clear,48.3,7,30,30,9,54682,49.20505,-123.13988,54692,49.20743,-123.14041,54682 -> 54692,289,43,600,535,1,65,0,0,100,17994,18003,18041,18033,9,0,0,0,2016,9,1,3,245,False,True,False,False,False,False,1472688000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
142635,Saturday,2562,52,21633,010 - SB1DT,010 - SB1DT - 15:02,010,N,1,63,EB SW MARINE DR AT MONTCALM ST,17,0,4,13,15:46:53,15:47:17,15:46:53,15:47:34,N,N,N,0,-25,-49,-4,0.6,FALSE,97,"""Rain,Snow,Fog""",4,18,50,50,52,54545,49.20573,-123.14077,54547,49.20359,-123.13671,54545 -> 54547,387,39,900,1064,1,-164,0,0,100,56739,56788,56813,56838,52,0,0,0,2016,12,31,5,366,True,False,True,False,True,False,1483142400
142636,Saturday,2562,52,21633,010 - SB1DT,010 - SB1DT - 15:02,010,N,1,63,EB SW MARINE DR AT LAUREL ST,19,0,0,19,15:50:33,15:51:22,15:50:33,15:51:22,N,N,N,64,-53,-105,0,0.6,FALSE,97,"""Rain,Snow,Fog""",4,18,114,50,56,54664,49.20435,-123.13068,55818,49.20644,-123.1257,54664 -> 55818,431,42,900,1096,0,-196,0,1,100,56931,57036,57033,57086,56,0,0,0,2016,12,31,5,366,True,False,True,False,True,False,1483142400
142637,Saturday,2562,52,21633,010 - SB1DT,010 - SB1DT - 15:02,010,N,1,63,EB SW MARINE DR AT HEATHER ST,19,5,2,22,15:51:10,15:51:41,15:51:10,15:51:56,N,N,N,0,-29,-53,3,0.6,FALSE,97,"""Rain,Snow,Fog""",4,18,13,13,57,55818,49.20644,-123.1257,55815,49.20718,-123.12387,55818 -> 55815,158,15,900,1077,1,-177,0,0,100,57033,57086,57070,57099,57,0,0,0,2016,12,31,5,366,True,False,True,False,True,False,1483142400
142638,Saturday,2562,52,21633,010 - SB1DT,010 - SB1DT - 15:02,010,N,1,63,EB SW MARINE DR AT W 70 AVE,22,0,3,19,15:52:14,15:52:32,15:52:14,15:52:47,N,N,N,21,-20,-50,-3,0.6,FALSE,97,"""Rain,Snow,Fog""",4,18,55,34,58,55815,49.20718,-123.12387,55816,49.2085,-123.12075,55815 -> 55816,270,27,900,1089,0,-189,0,1,100,57070,57120,57134,57154,58,0,0,0,2016,12,31,5,366,True,False,True,False,True,False,1483142400


### Data Preprocessing

In [8]:
# Check for NaN values in the data.
df.isnull().sum()

DayType                      0
VehicleNo                    0
Week                         0
BlockNo                      0
Pattern                      0
                            ..
OperationIs_quarter_end      0
OperationIs_quarter_start    0
OperationIs_year_end         0
OperationIs_year_start       0
OperationElapsed             0
Length: 71, dtype: int64

In [9]:
# Remove NaN values from the data.
df=df.dropna()

#### Drop the dew column

In [10]:
df.Dew.unique()

array(['FALSE'], dtype=object)

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 138752 entries, 0 to 142639
Data columns (total 71 columns):
 #   Column                          Non-Null Count   Dtype 
---  ------                          --------------   ----- 
 0   DayType                         138752 non-null  object
 1   VehicleNo                       138752 non-null  object
 2   Week                            138752 non-null  UInt32
 3   BlockNo                         138752 non-null  object
 4   Pattern                         138752 non-null  object
 5   Trip                            138752 non-null  object
 6   Line                            138752 non-null  object
 7   TimingPoint                     138752 non-null  object
 8   MinStopNo                       138752 non-null  int64 
 9   MaxStopNo                       138752 non-null  int64 
 10  StopName                        138752 non-null  object
 11  ArriveLoadCompensated           138752 non-null  object
 12  OnsLoadCompensated            

In [12]:
# Drop the unnecessary columns.
df=df.drop(['Line','Trip','TimingPoint','MinStopNo','MaxStopNo','ScheduledArriveTime','ActualArriveTime',\
              'ScheduledLeaveTime','ActualLeaveTime','Origin','Dew','Destination','Start.Stop','BusBunchingFlag',\
              'TripPatternCompleteness','TripLegInt',\
            ],axis=1)

In [13]:
# Change dtype from object to interger and float
df[['Temp','Visibility','OriginLat','OriginLong','GPSLat','GPSLong']] = df[['Temp','Visibility','OriginLat',\
                                                                              'OriginLong','GPSLat','GPSLong']].astype('float64', copy=False)

In [14]:
# Change dtype from object to interger and float. I found out that there are 'NULL' values in the ScheduledHeadway,ActualHeadway,HeadwayOffset columns. 
# So using the following method with errors set to coerce option which sets the value to NaN if it is NULL. So that NaNs can 
# be removed later.
featnum=['VehicleNo','BlockNo','ArriveLoadCompensated','OnsLoadCompensated','OffsLoadCompensated',\
         'LeaveLoadCompensated','OnsAndOffsCompensated','DwellTime','TravelTime', 'Humidity','Wind.Speed',\
         'TripLeg','ScheduledHeadway','ActualHeadway','HeadwayOffset']
for i in featnum:
    df[i] = pd.to_numeric(df[i], errors='coerce')

In [15]:
df.isnull().sum()

DayType                              0
VehicleNo                            0
Week                                 0
BlockNo                              0
Pattern                              0
StopName                             0
ArriveLoadCompensated                0
OnsLoadCompensated                   0
OffsLoadCompensated                  0
LeaveLoadCompensated                 0
WCLiftActivated                      0
BikeLoaded                           0
BikeUnloaded                         0
DwellTime                            0
ArriveDelay                          0
DepartureDelay                       0
OnsAndOffsCompensated                0
Temp                                 0
Humidity                             0
Conditions                           0
Visibility                           0
Wind.Speed                           0
StopToStopTime                       0
TravelTime                           0
TripLeg                              0
OriginLat                

In [16]:
#Replace the null values with median.
for col in ['ScheduledHeadway','ActualHeadway','HeadwayOffset']:
    df[col] = df[col].fillna(df[col].median())

In [17]:
df.dtypes

DayType                            object
VehicleNo                           int64
Week                               UInt32
BlockNo                             int64
Pattern                            object
StopName                           object
ArriveLoadCompensated               int64
OnsLoadCompensated                  int64
OffsLoadCompensated                 int64
LeaveLoadCompensated                int64
WCLiftActivated                    object
BikeLoaded                         object
BikeUnloaded                       object
DwellTime                           int64
ArriveDelay                         int64
DepartureDelay                      int64
OnsAndOffsCompensated               int64
Temp                              float64
Humidity                            int64
Conditions                         object
Visibility                        float64
Wind.Speed                          int64
StopToStopTime                      int64
TravelTime                        

#### Convert all the string columns to categorical values.

In [18]:
train_cats(df)

#### Target encode the StopName. 
Use package category_encoders. 

In [19]:
df.StopName.unique()

['WB SW MARINE DR AT HEATHER ST', 'WB SW MARINE DR AT LAUREL ST', 'WB SW MARINE DR AT OAK ST', 'WB SW MARINE DR AT GRANVILLE ST', 'NB GRANVILLE ST AT W 71 AVE', ..., 'WB W PENDER ST AT GRANVILLE ST', 'SB HOWE ST AT W GEORGIA ST', 'SB HOWE ST AT ROBSON ST', 'SB HOWE ST AT NELSON ST', 'SB HOWE ST AT DAVIE ST']
Length: 83
Categories (83, object): ['EB SW MARINE DR AT ASH ST' < 'EB SW MARINE DR AT HEATHER ST' < 'EB SW MARINE DR AT LAUREL ST' < 'EB SW MARINE DR AT MONTCALM ST' ... 'WB SW MARINE DR AT HEATHER ST' < 'WB SW MARINE DR AT LAUREL ST' < 'WB SW MARINE DR AT OAK ST' < 'WB W PENDER ST AT GRANVILLE ST']

In [20]:
from category_encoders.target_encoder import TargetEncoder
df = df.reset_index() # not sure why TargetEncoder needs this but it does
targetfeatures = ['StopName']
encoder = TargetEncoder(cols=targetfeatures)
encoder.fit(df, df['NextLegBunchingFlag'])
df_encoded = encoder.transform(df, df['NextLegBunchingFlag'])

  elif pd.api.types.is_categorical(cols):


In [21]:
df_encoded.StopName.unique()

array([0.01992, 0.02408, 0.     , 0.03918, 0.04541, 0.051  , 0.03476, 0.05744, 0.06434, 0.06287, 0.06133,
       0.06093, 0.06388, 0.04554, 0.05042, 0.05093, 0.04949, 0.04552, 0.04941, 0.04846, 0.0514 , 0.05284,
       0.04851, 0.0394 , 0.04049, 0.04432, 0.04379, 0.04386, 0.04561, 0.25   , 0.01695, 0.02564, 0.01709,
       0.05033, 0.049  , 0.05329, 0.05647, 0.05755, 0.03913, 0.04573, 0.04858, 0.0464 , 0.04757, 0.05203,
       0.0509 , 0.052  , 0.05528, 0.05747, 0.05911, 0.06076, 0.04324, 0.04543, 0.04874, 0.04934, 0.05101,
       0.04263, 0.05425, 0.05531, 0.05702, 0.05924, 0.06096, 0.05937, 0.07206, 0.06827, 0.07219, 0.06995,
       0.01786])

#### One hot encode the DayType column
- Use pandas get_dummies function

In [22]:
df_encoded.DayType.unique()

['Mon-Fri', 'Saturday', 'Sunday/Holidays', 'XMAS Day', 'Boxing Day']
Categories (5, object): ['Boxing Day' < 'Mon-Fri' < 'Saturday' < 'Sunday/Holidays' < 'XMAS Day']

In [23]:
temp = pd.get_dummies(df_encoded['DayType'])
temp.head()

Unnamed: 0,Boxing Day,Mon-Fri,Saturday,Sunday/Holidays,XMAS Day
0,0,1,0,0,0
1,0,1,0,0,0
2,0,1,0,0,0
3,0,1,0,0,0
4,0,1,0,0,0


In [24]:
frames = [df_encoded, temp]
df_encoded = pd.concat(frames,axis=1, join='outer', ignore_index=False)

In [25]:
df_encoded = df_encoded.drop('DayType',axis=1)

In [26]:
df_encoded.Week.unique()

<IntegerArray>
[35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52]
Length: 18, dtype: UInt32

In [27]:
display_all(df_encoded.dtypes)

index                                int64
VehicleNo                            int64
Week                                UInt32
BlockNo                              int64
Pattern                           category
StopName                           float64
ArriveLoadCompensated                int64
OnsLoadCompensated                   int64
OffsLoadCompensated                  int64
LeaveLoadCompensated                 int64
WCLiftActivated                   category
BikeLoaded                        category
BikeUnloaded                      category
DwellTime                            int64
ArriveDelay                          int64
DepartureDelay                       int64
OnsAndOffsCompensated                int64
Temp                               float64
Humidity                             int64
Conditions                        category
Visibility                         float64
Wind.Speed                           int64
StopToStopTime                       int64
TravelTime 

### Save the cleaned dataset.
- if saving in csv format loses declared type in the variables
- pay attention that category is changed to object when the file is read back.

In [28]:
df_encoded.to_csv(f'{current_path}/data/processed/Cleaned_Trips2016_10.csv', index=False)

In [29]:
df_all = pd.read_csv(f'{current_path}/data/processed/Cleaned_Trips2016_10.csv')

In [30]:
display_all(df_all.dtypes)

index                               int64
VehicleNo                           int64
Week                                int64
BlockNo                             int64
Pattern                            object
StopName                          float64
ArriveLoadCompensated               int64
OnsLoadCompensated                  int64
OffsLoadCompensated                 int64
LeaveLoadCompensated                int64
WCLiftActivated                    object
BikeLoaded                         object
BikeUnloaded                       object
DwellTime                           int64
ArriveDelay                         int64
DepartureDelay                      int64
OnsAndOffsCompensated               int64
Temp                              float64
Humidity                            int64
Conditions                         object
Visibility                        float64
Wind.Speed                          int64
StopToStopTime                      int64
TravelTime                        

### Save in feather. 
- this makes sure all processing done so far is valid after saving the file.

In [31]:
df_encoded.to_feather(f'{current_path}/data/processed/Cleaned_Trips2016_10.csv')

In [32]:
df_all = pd.read_feather(f'{current_path}/data/processed/Cleaned_Trips2016_10.csv')

In [33]:
display_all(df_all.dtypes)

index                                int64
VehicleNo                            int64
Week                                UInt32
BlockNo                              int64
Pattern                           category
StopName                           float64
ArriveLoadCompensated                int64
OnsLoadCompensated                   int64
OffsLoadCompensated                  int64
LeaveLoadCompensated                 int64
WCLiftActivated                   category
BikeLoaded                        category
BikeUnloaded                      category
DwellTime                            int64
ArriveDelay                          int64
DepartureDelay                       int64
OnsAndOffsCompensated                int64
Temp                               float64
Humidity                             int64
Conditions                        category
Visibility                         float64
Wind.Speed                           int64
StopToStopTime                       int64
TravelTime 