# Star ratings prediction data preparation:

Task Manager/Author: [Leon Hamnett](https://www.linkedin.com/in/leon-hamnett/) 

## Introduction:

This notebook can be used to load data regarding the irap star rating from csv files, clean the data and then encode the input features in an appropiate fashion and then output a parquet file which can be used for model comparisons/training/evaluation in a different notebook.

This notebook has the following steps:
1. Load the irap star ratings data from the CSV files.
2. Clean the data to leave only rows with valid star ratings for all four categories of star rating.
3. Choose a sample of the data (Optional)
4. Choose columns to include in the star ratings prediction models (This can be decided later during model comparisons , if unsure include all columns).
5. Encode the columns in an appropiate fashion:


    1. Option 1 - Ordinal encoding for all categorical inputs

    2. Option 2 - Apply Ordinal encoding to ordinal inputs and apply target encoding to   all nominal inputs with respect to a specific star rating:
    
    a) Target encoding with respect to Car star rating
    
    b) Target encoding w.r.t Motorcycle star rating
    
    c) Target encoding w.r.t Pedestrian star rating
    
    d) Target encoding w.r.t Bicycle star rating 
    
    3. Option 3- Apply One-hot encoding to all categorical 
    

6. Save the dataset into parquet format for portability and ease of loading

### Summary:

Any required packages should be automatically installed via the !pip commands.
For reference the following packages are used and more information on specific functions can be found within the associated documentation:

pandas - package for data manipulation and data analysis [docs](https://pandas.pydata.org/docs/) 

numpy - package for numerical and mathematical operations over arrays of values 
[docs](https://numpy.org/doc/stable/reference/index.html)

sklearn - package that provides a large number of useful machine learning processes and models [docs](https://scikit-learn.org/stable/modules/classes.html) 

dask - package that allows distributed/multi-threaded data manipulation and machine learning. Mainly used for data manipulation and a few functions which operate smoother than the sklearn counterparts. [docs](https://docs.dask.org/en/latest/) 

cython - wrapper that converts python code into c++ code. Used to reduce the time needed to process significant volumes of data. [docs](https://cython.readthedocs.io/en/latest/)

## Stage 1: Load star ratings data from CSV(s):

In [29]:
#install relevant libraries:
!pip install pandas
!pip install numpy
!pip install "dask[complete]"
!pip install Cython



In [1]:
#import libaries
import pandas as pd
import numpy as np
from dask import dataframe as dd
from dask.diagnostics import ProgressBar


In [31]:
#Cython - Cython package is needed but cannot be directly imported as it only uses python magic commands
#enable cython for faster processing
%reload_ext Cython

In [32]:
#list of column names to import from csvs
important_columns = ['car_star_rating_star', 'motorcycle_star_rating_star', 'pedestrian_star_rating_star', 'bicycle_star_rating_star', 'vehicle_flow', 'motorcycle_percent', 'ped_peak_hour_flow_across', 'ped_peak_hour_flow_along_driver_side', 'ped_peak_hour_flow_along_passenger_side', 'bicycle_peak_hour_flow', 'carriageway', 'intersection_type', 'sidewalk_driver_side', 'sidewalk_passenger_side', 'facilities_for_bicycles', 'paved_shoulder_driver_side', 'paved_shoulder_passenger_side', 'property_access_points', 'roadside_severity_driver_side_object', 'skid_resistance_grip', 'curvature', 'median_type', 'number_of_lanes', 'ped_fencing', 'intersection_quality', 'intersecting_road_volume', 'roadside_severity_driver_side_distance', 'roadside_severity_passenger_side_distance', 'operating_speed_85th_percentile', 'operating_speed_mean', 'speed_limit', 'school_zone_crossing_supervisor', 'grade', 'lane_width', 'road_condition', 'ped_crossing_quality', 'sight_distance', 'quality_of_curve', 'area_type', 'shoulder_rumble_strips', 'street_lighting', 'speed_management', 'intersection_channelisation', 'delineation', 'school_zone_warning', 'upgrade_cost', 'motorcycle_observed_flow', 'bicycle_observed_flow', 'ped_observed_flow_across', 'ped_observed_flow_along_driver_side', 'ped_observed_flow_along_passenger_side', 'land_use_driver_side', 'land_use_passenger_side', 'motorcycle_speed_limit', 'truck_speed_limit', 'roadside_severity_passenger_side_object', 'service_road', 'roadworks', 'differential_speed_limits', 'centreline_rumble_strips', 'ped_crossing_facilities_inspected_road', 'ped_crossing_facilities_intersecting_road', 'vehicle_parking', 'facilities_for_motorised_two_wheelers', 'roads_that_cars_can_read', 'car_star_rating_policy_target', 'motorcycle_star_rating_policy_targer', 'ped_star_rating_policy_target', 'bicycle_star_rating_policy_target', 'annual_fatality_growth_multiplier']

Now we load the star ratings data from the csv files. The csv files should be placed in the same directory that the notebook is being run from. 

In [35]:
#delete
import os
os.chdir('..')
print(os.getcwd())

/home/leon/Documents/datscience practise/Odema - irap/star ratings


In [36]:
#load both csvs into dask dataframe
star_rating_dtypes = {'car_star_rating_star':np.float64, 'motorcycle_star_rating_star':np.float64, 'pedestrian_star_rating_star':np.float64, 'bicycle_star_rating_star':np.float64} #error if dtypes not specified
#create separate dataframes for each csv - can be extended for additional csvs
df1 = dd.read_csv('omdena_301.csv',sep='\t',usecols=important_columns,dtype=np.float64,error_bad_lines=False)
df2 = dd.read_csv('omdena_302.csv',sep='\t',usecols=important_columns,dtype=np.float64,error_bad_lines=False)
#add the rows from the second dataframe into the first to create a dataframe with all rows
df_comb = df1.append(df2)

## Stage 2: Cleaning the data:

Now we have all of the data into a single dataframe, we will clean the data and select only rows that have a valid star ratings value (1,2,3,4,5) for all four star ratings categories (car,motorcycle,pedestrian,bicycle). We use the cython wrapper to turn normal python code into c++ code which is much more efficient and so we can handle the large amount of star ratings data in a much faster and more efficient way.

We define two functions using the cython c++ wrapper:

"good_star" returns a 1 if the star rating value is valid

"good_all" returns a 1 if all star ratings are valid

In [37]:
%%cython
def good_star(self):
    if self in [1.0,2.0,3.0,4.0,5.0]:
        return 1
    else:
        return 0 
    
def good_all(a,b,c,d):
    if (a == 1) & (b == 1) & (c ==1) & (d==1):
        return 1
    else:
        return 0

Now we apply the above functions to the dataset:

In [38]:
# for each star rating, we check if it is good (1-5) and if so assign a 1
# we then perform an AND operation on all four categories, if a 1 is returned that row has "good values" for all four categories and so is useful

# we choose axis = 1 to apply the function over each row
# for each row we create a four new columns each containing a 1 if the star rating is valid
# we then create a new aggregate column that gives a 1 if the 4 other columns have a 1
#the entire dataset is then subset based on the presence of a 1 in this column

df_agg = df_comb.copy()

with ProgressBar():
    df_agg['good_car'] = df_agg.apply(lambda row: good_star(row['car_star_rating_star']),axis=1) 
    print('good car done')
    df_agg['good_mb'] = df_agg.apply(lambda row: good_star(row['motorcycle_star_rating_star']),axis=1)
    print('good mb done')
    df_agg['good_ped']= df_agg.apply(lambda row: good_star(row['pedestrian_star_rating_star']),axis=1)
    print('good ped done')
    df_agg['good_bike'] = df_agg.apply(lambda row: good_star(row['bicycle_star_rating_star']),axis=1)
    print('good bike done')
    df_agg['good_all'] = df_agg.apply(lambda row: good_all(row['good_car'],row['good_mb'],row['good_ped'],row['good_bike']),axis=1)
    print('good all done')
    df_agg_good_sub = df_agg.query('good_all==1') #choose only those rows which have a valid star rating for all four categories
    print('good dataset subset done')

You did not provide metadata, so Dask is running your function on a small dataset to guess output types. It is possible that Dask will guess incorrectly.
To provide an explicit output types or to silence this message, please provide the `meta=` keyword, as described in the map or apply function that you are using.
  Before: .apply(func)
  After:  .apply(func, meta=(None, 'int64'))



good car done
good mb done
good ped done
good bike done
good all done
good dataset subset done


In [2]:
#reorder columns and drop the uneeded "good" columns eg good_car,good_bike,good_all etc
df_agg_good_sub = df_agg_good_sub[important_columns]
df_agg_good_sub.columns

NameError: name 'df_agg_good_sub' is not defined

## Stage 3: Choosing a sample of the data (Optional):

Now we have the dataset with only rows that have a valid star rating for all four star rating categories. We can choose a sample of all the data, this stage is optional but significantly speeds up the later parts of the notebook when comparing the performance of different models.



In [40]:
with ProgressBar():
    df_agg_good_sub_samp = df_agg_good_sub.sample(frac=0.1) #choose fraction of all rows, frac = 0.1 = 10% of all rows

In [None]:
#repartition and reset index for easier computing in the following steps
#14min for both
with ProgressBar():
    df_agg_good_sub_samp_repart = df_agg_good_sub_samp.repartition(partition_size='100MB',force=True)
    print('repartitioning: done')
    df_agg_good_sub_samp_repart = df_agg_good_sub_samp_repart.reset_index()
    print('resetting index: done')

## Loading/Saving the Data:

Comment out the cell as required

In [41]:
#saving the data: # change back to good
with ProgressBar():
    dd.to_parquet(df_agg_good_sub_samp,'good_parquet')

[########################################] | 100% Completed | 14min 52.9s


In [21]:
#loading the data: #change back to good
with ProgressBar():
    df_agg_good_sub_samp = dd.read_parquet('good_parquet')  

In [22]:
df_agg_good_sub_samp.compute().shape

(82860, 70)

## Stage 4: Choose columns to include:

In [23]:
cols_to_include = ['car_star_rating_star', 'motorcycle_star_rating_star',
       'pedestrian_star_rating_star', 'bicycle_star_rating_star','vehicle_flow', 
        'motorcycle_percent', 'ped_peak_hour_flow_across',
       'ped_peak_hour_flow_along_driver_side',
       'ped_peak_hour_flow_along_passenger_side', 'bicycle_peak_hour_flow',
       'carriageway', 'intersection_type', 'sidewalk_driver_side',
       'sidewalk_passenger_side', 'facilities_for_bicycles',
       'paved_shoulder_driver_side', 'paved_shoulder_passenger_side',
       'property_access_points', 'roadside_severity_driver_side_object',
       'skid_resistance_grip', 'curvature', 'median_type', 'number_of_lanes',
       'ped_fencing', 'intersection_quality', 'intersecting_road_volume',
       'roadside_severity_driver_side_distance',
       'roadside_severity_passenger_side_distance',
       'operating_speed_85th_percentile', 'operating_speed_mean',
       'speed_limit', 'school_zone_crossing_supervisor', 'grade', 'lane_width',
       'road_condition', 'ped_crossing_quality', 'sight_distance',
       'quality_of_curve', 'area_type', 'shoulder_rumble_strips',
       'street_lighting', 'speed_management', 'intersection_channelisation',
       'delineation', 'school_zone_warning',
       'motorcycle_observed_flow', 'bicycle_observed_flow',
       'ped_observed_flow_across', 'ped_observed_flow_along_driver_side',
       'ped_observed_flow_along_passenger_side', 'land_use_driver_side',
       'land_use_passenger_side', 'motorcycle_speed_limit',
       'truck_speed_limit', 'roadside_severity_passenger_side_object',
       'service_road', 'roadworks', 'differential_speed_limits',
       'centreline_rumble_strips', 'ped_crossing_facilities_inspected_road',
       'ped_crossing_facilities_intersecting_road', 'vehicle_parking',
       'facilities_for_motorised_two_wheelers', 'roads_that_cars_can_read']

#create new dataframe with only the selected columns
df_agg_good_sub_select = df_agg_good_sub_samp[cols_to_include]

## Stage 5: Encoding the columns:

Now we need to encode all of the input features so that they can be fed into a machine learning model easily.

First we separate the dataframe into input features (X) and target values (Y) which correspond to the irap road features as X and the star ratings as Y:

In [24]:
#split into features and targets
with ProgressBar():
    df_y = df_agg_good_sub_select.iloc[:,0:4]# choose targets by selecting first 4 columns
    df_x = df_agg_good_sub_select.iloc[:,4:] #choose inputs, by selecting the rest of the columns
    #compute() turns dask dataframe into pandas dataframe
    df_x , df_y = dd.compute(df_x,df_y)

[########################################] | 100% Completed |  3.0s


In [25]:
#check the columns are as expected:
print('df x columns : \n')
print(df_x.columns)
print('\n df y columns : \n')
print(df_y.columns)

df x columns : 

Index(['vehicle_flow', 'motorcycle_percent', 'ped_peak_hour_flow_across',
       'ped_peak_hour_flow_along_driver_side',
       'ped_peak_hour_flow_along_passenger_side', 'bicycle_peak_hour_flow',
       'carriageway', 'intersection_type', 'sidewalk_driver_side',
       'sidewalk_passenger_side', 'facilities_for_bicycles',
       'paved_shoulder_driver_side', 'paved_shoulder_passenger_side',
       'property_access_points', 'roadside_severity_driver_side_object',
       'skid_resistance_grip', 'curvature', 'median_type', 'number_of_lanes',
       'ped_fencing', 'intersection_quality', 'intersecting_road_volume',
       'roadside_severity_driver_side_distance',
       'roadside_severity_passenger_side_distance',
       'operating_speed_85th_percentile', 'operating_speed_mean',
       'speed_limit', 'school_zone_crossing_supervisor', 'grade', 'lane_width',
       'road_condition', 'ped_crossing_quality', 'sight_distance',
       'quality_of_curve', 'area_type', 'shoulder

Now we must encode the variables (transform them from the original irap categories into standardised numbers whilst retaining as much information as possible both between different features and for categories within the same input feature).

There are two different options here which are discussed in more detail within the task9 - star rating prediction report:

Option A: All categorical input features are encoded as ordinal

Option B: Ordinal categorical input features are encoded as ordinal like before
          Nominal categorical input features have target encoding applied.
          
Option C: One-hot encoding is applied to all categorical input features
          
For both of the above options, the only input feature that is numerical and not categorical (vehicle count) is encoded with minmax encoding. This avoids giving undue weight to the vehicle count as it can be many times larger than the other input features.

First we divide up the different input features:


In [9]:
#split into numerical,ordinal,and nominal input features and decide which transforms to be applied
#numerical input features
num_cols = ['vehicle_flow']

#ordinal categorical input features
ordinal_cols = ['motorcycle_percent', #normal - transform to n-1 and include zero value
                'ped_peak_hour_flow_across',#normal
               'ped_peak_hour_flow_along_driver_side',#normal
               'ped_peak_hour_flow_along_passenger_side',#normal
               'bicycle_peak_hour_flow',#normal
                'curvature', #convert to int
                'ped_fencing', #normal
                'grade', #transform and add 1
                'shoulder_rumble_strips', #normal
                'street_lighting', #normal
                'speed_management', #normal
                'intersection_channelisation', #normal
                'motorcycle_observed_flow', # normal
                'bicycle_observed_flow', #normal
                'ped_observed_flow_across', #normal
                'ped_observed_flow_along_driver_side',#normal
                'ped_observed_flow_along_passenger_side',#normal
                'roadworks', #normal
                'differential_speed_limits', #normal
                'centreline_rumble_strips', #normal
                'paved_shoulder_driver_side', #convert to int
                'paved_shoulder_passenger_side', #convert to int
                'skid_resistance_grip', #convert to int
                'intersecting_road_volume', #map reverse
                'roadside_severity_driver_side_distance',#map reverse
                'roadside_severity_passenger_side_distance', #map reverse
                'lane_width', #convert to int
                'road_condition', #covert to int
                'sight_distance', #convert to int
                'delineation' #convert to int
                   ]

#nominal categorical input features
nominal_cols = ['carriageway', 
               'intersection_type',
               'sidewalk_driver_side', 
                'sidewalk_passenger_side',
                'facilities_for_bicycles', 
                'property_access_points', 
                'roadside_severity_driver_side_object', 
                'median_type', 
                'number_of_lanes', 
                'intersection_quality', 
                'operating_speed_85th_percentile',
                'operating_speed_mean',
                'speed_limit',
                'school_zone_crossing_supervisor',
                'ped_crossing_quality',
                'quality_of_curve',
                'area_type',
                'school_zone_warning',
                'land_use_driver_side',
                'land_use_passenger_side',
                'motorcycle_speed_limit',
                'truck_speed_limit',
                'roadside_severity_passenger_side_object',
                'service_road',
                 'ped_crossing_facilities_inspected_road',
                'ped_crossing_facilities_intersecting_road',
                'vehicle_parking',
                'facilities_for_motorised_two_wheelers',
                'roads_that_cars_can_read']
               

In [12]:
#install sklearn
!pip install -U scikit-learn 


Collecting scikit-learn
  Downloading scikit_learn-0.24.0-cp38-cp38-manylinux2010_x86_64.whl (24.9 MB)
[K     |████████████████████████████████| 24.9 MB 15.1 MB/s eta 0:00:01   |██▌                             | 2.0 MB 1.4 MB/s eta 0:00:17     |██████▏                         | 4.8 MB 1.9 MB/s eta 0:00:11     |██████▌                         | 5.1 MB 1.9 MB/s eta 0:00:11     |███████                         | 5.4 MB 1.9 MB/s eta 0:00:11     |█████████▌                      | 7.4 MB 1.9 MB/s eta 0:00:10     |██████████▎                     | 8.0 MB 1.9 MB/s eta 0:00:10     |███████████████                 | 11.7 MB 7.2 MB/s eta 0:00:02     |███████████████████▎            | 15.0 MB 4.6 MB/s eta 0:00:03     |████████████████████▌           | 16.0 MB 4.6 MB/s eta 0:00:02     |██████████████████████████      | 20.3 MB 7.5 MB/s eta 0:00:01     |█████████████████████████████   | 22.6 MB 7.5 MB/s eta 0:00:01     |███████████████████████████████▌| 24.5 MB 15.1 MB/s eta 0:00:01
Installing coll

In [26]:
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import MinMaxScaler

### Option A: All categorical inputs ordinal encoding:

In [None]:
# if saving encoding as joblib uncomment the code below:
# import joblib
# from joblib import dump

In [27]:
#create dataframe copy of df_x so that encodings can be applied
df_x_enc = df_x.copy() 

In [29]:
os.getcwd()

'/home/leon/Documents/datscience practise/Odema - irap/star ratings'

In [18]:
#create ordinal encoder instance
#if an uknown value is seen during the .transform function a value of -1 will be passed to make it easy to raise en error during the pipeline
OE = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1) 
for col in df_x_enc.columns[1:]: #skip vehicle flow , loop through columns applying ordinal encoding
    #apply ordinal encoding:
    df_x_enc[col] = df_x_enc[col].astype(str) #set as string so it can be passed to the encoder correctly
    OE.fit(X=df_x_enc[[col]]) #fit ordinal encoder on data
    #df_x_enc[col] = OE.transform(df_x_enc[[col]]) #transform data and save it as the matching column in df_x_enc
    print(col,' encodings done')
    
    #if the encoding scheme for each input feature needs to be saved, 
    # the following code will save csv files for each input feature with the encoding information:
    
    #for savings csvs
#   ordinal_encoding_df = pd.DataFrame() #create empty df
#   uniq_vals = df_x_enc[col].unique() #get unique input values for each feature
#   ordinal_encoding_df['uniq_vals'] = uniq_vals #add the unique values into df

#   trans_vals = OE.transform(ordinal_encoding_df[['uniq_vals']]) #encode input vals
#   ordinal_encoding_df['trans_vals'] = trans_vals #add transformed values to dataframe
    
#   filename = col+'_encodings.csv'
#   ordinal_encoding_df.to_csv(filename) #save dataframe with original and encoded value for each input feature


#   uniq_vals = df_x_enc[col].unique() #get unique input values for each feature
#   ordinal_encoding_df['uniq_vals'] = uniq_vals #add the unique values into df

#   trans_vals = OE.transform(ordinal_encoding_df[['uniq_vals']]) #encode input vals
#   ordinal_encoding_df['trans_vals'] = trans_vals #add transformed values to dataframe
    
#   ordinal_encoding_df.to_csv(filename) #save dataframe with original and encoded value

#     #for savings joblibs
#     filename = col+'_ord_fit.joblib'
#     dump(OE,filename)
df_x_enc.head()

motorcycle_observed_flow  encodings done
bicycle_observed_flow  encodings done
ped_observed_flow_across  encodings done
ped_observed_flow_along_driver_side  encodings done
ped_observed_flow_along_passenger_side  encodings done
land_use_driver_side  encodings done
land_use_passenger_side  encodings done
motorcycle_speed_limit  encodings done
truck_speed_limit  encodings done
roadside_severity_passenger_side_object  encodings done
service_road  encodings done
roadworks  encodings done
differential_speed_limits  encodings done
centreline_rumble_strips  encodings done
ped_crossing_facilities_inspected_road  encodings done
ped_crossing_facilities_intersecting_road  encodings done
vehicle_parking  encodings done
facilities_for_motorised_two_wheelers  encodings done
roads_that_cars_can_read  encodings done


In [28]:
#we also apply minmax encoding to scale the vehicle count between (0,1)
minmax = MinMaxScaler() #call instance of minmax scaler
minmax.fit(df_x_enc[['vehicle_flow']]) # fit on data 
df_x_enc['vehicle_flow'] = minmax.transform(df_x_enc[['vehicle_flow']]) #transform and save
dump(minmax,'vehicle_flow_min_max_fit.joblib')                                          
df_x_enc.head()

Unnamed: 0,vehicle_flow,motorcycle_percent,ped_peak_hour_flow_across,ped_peak_hour_flow_along_driver_side,ped_peak_hour_flow_along_passenger_side,bicycle_peak_hour_flow,carriageway,intersection_type,sidewalk_driver_side,sidewalk_passenger_side,...,roadside_severity_passenger_side_object,service_road,roadworks,differential_speed_limits,centreline_rumble_strips,ped_crossing_facilities_inspected_road,ped_crossing_facilities_intersecting_road,vehicle_parking,facilities_for_motorised_two_wheelers,roads_that_cars_can_read
10504,0.024584,6.0,3.0,4.0,4.0,3.0,3.0,12.0,7.0,7.0,...,17.0,1.0,1.0,1.0,1.0,7.0,7.0,1.0,6.0,0.0
1965,0.014181,4.0,3.0,4.0,4.0,4.0,3.0,12.0,5.0,5.0,...,11.0,2.0,1.0,1.0,1.0,7.0,7.0,1.0,6.0,1.0
8385,0.041313,7.0,3.0,3.0,3.0,3.0,3.0,12.0,5.0,5.0,...,12.0,1.0,1.0,1.0,1.0,5.0,7.0,3.0,6.0,2.0
2405,0.071748,4.0,2.0,1.0,3.0,4.0,2.0,12.0,5.0,3.0,...,12.0,1.0,1.0,1.0,1.0,5.0,5.0,1.0,6.0,2.0
11214,0.018403,6.0,1.0,2.0,2.0,2.0,3.0,12.0,7.0,7.0,...,11.0,1.0,1.0,1.0,1.0,7.0,7.0,1.0,6.0,0.0


Now we recombine the encoded x features with the y features and save the file:

In [17]:
df_both = pd.concat([df_y,df_x_enc],axis = 1) # combine y df and x df by adding encoded x columns to original ys
df_both.head()

Unnamed: 0,car_star_rating_star,motorcycle_star_rating_star,pedestrian_star_rating_star,bicycle_star_rating_star,vehicle_flow,motorcycle_percent,ped_peak_hour_flow_across,ped_peak_hour_flow_along_driver_side,ped_peak_hour_flow_along_passenger_side,bicycle_peak_hour_flow,...,roadside_severity_passenger_side_object,service_road,roadworks,differential_speed_limits,centreline_rumble_strips,ped_crossing_facilities_inspected_road,ped_crossing_facilities_intersecting_road,vehicle_parking,facilities_for_motorised_two_wheelers,roads_that_cars_can_read
10504,1.0,1.0,1.0,1.0,0.024584,3.0,4.0,5.0,5.0,2.0,...,8.0,0.0,0.0,0.0,0.0,9.0,8.0,0.0,5.0,0.0
1965,3.0,2.0,2.0,3.0,0.014181,1.0,4.0,5.0,5.0,3.0,...,2.0,1.0,0.0,0.0,0.0,9.0,8.0,0.0,5.0,1.0
8385,4.0,3.0,2.0,3.0,0.041313,4.0,4.0,4.0,4.0,2.0,...,3.0,0.0,0.0,0.0,0.0,7.0,8.0,2.0,5.0,2.0
2405,4.0,4.0,2.0,4.0,0.071748,1.0,3.0,0.0,4.0,3.0,...,3.0,0.0,0.0,0.0,0.0,7.0,6.0,0.0,5.0,2.0
11214,2.0,1.0,1.0,1.0,0.018403,3.0,0.0,3.0,3.0,1.0,...,2.0,0.0,0.0,0.0,0.0,9.0,8.0,0.0,5.0,0.0


In [18]:
df_both.to_parquet('all_cat_ord_enc_num_minmax_enc') #filename: all categorical ordinal encoding , numerical min max encoding

### Option B: Nominal categorical inputs have target encoding applied and Ordinal categorical inputs have ordinal encoding applied as before:

In [19]:
df_x_enc = df_x.copy()
#split into datasets based on the type on encoding to be applied:

#nominal input features
df_x_nom = df_x_enc[nominal_cols]
print('Nominal columns : \n',df_x_nom.columns)

#ordinal input features
df_x_ord = df_x_enc[ordinal_cols]
print('Ordinal columns : \n',df_x_ord.columns)

#numerical input features
df_x_numerical = df_x_enc[num_cols]
print('Numerical columns : \n',df_x_numerical.columns)


Nominal columns : 
 Index(['carriageway', 'intersection_type', 'sidewalk_driver_side',
       'sidewalk_passenger_side', 'facilities_for_bicycles',
       'property_access_points', 'roadside_severity_driver_side_object',
       'median_type', 'number_of_lanes', 'intersection_quality',
       'operating_speed_85th_percentile', 'operating_speed_mean',
       'speed_limit', 'school_zone_crossing_supervisor',
       'ped_crossing_quality', 'quality_of_curve', 'area_type',
       'land_use_passenger_side', 'motorcycle_speed_limit',
       'truck_speed_limit', 'roadside_severity_passenger_side_object',
       'service_road', 'ped_crossing_facilities_inspected_road',
       'ped_crossing_facilities_intersecting_road', 'vehicle_parking',
       'facilities_for_motorised_two_wheelers', 'roads_that_cars_can_read'],
      dtype='object')
Ordinal columns : 
 Index(['motorcycle_percent', 'ped_peak_hour_flow_across',
       'ped_peak_hour_flow_along_driver_side',
       'ped_peak_hour_flow_along_pas

We can encode the ordinal and numerical input features using the same methodology as before:

In [20]:
#encode the ordinal features as before
df_x_ord_enc = df_x_ord.copy()

OE = OrdinalEncoder() #create ordinal encoder instance
for col in df_x_ord_enc.columns: #skip vehicle flow , loop through columns applying ordinal encoding
    #apply ordinal encoding:
    df_x_ord_enc[col] = df_x_ord_enc[col].astype(str) #set as string so it can be passed to the encoder correctly
    OE.fit(X=df_x_ord_enc[[col]]) #fit ordinal encoder on data
    df_x_ord_enc[col] = OE.transform(df_x_ord_enc[[col]]) #transform data and save it as the matching column in df_x_enc
    print(col,' encodings done')
    
df_x_ord_enc.head()

motorcycle_percent  encodings done
ped_peak_hour_flow_across  encodings done
ped_peak_hour_flow_along_driver_side  encodings done
ped_peak_hour_flow_along_passenger_side  encodings done
bicycle_peak_hour_flow  encodings done
curvature  encodings done
ped_fencing  encodings done
grade  encodings done
shoulder_rumble_strips  encodings done
street_lighting  encodings done
speed_management  encodings done
intersection_channelisation  encodings done
motorcycle_observed_flow  encodings done
bicycle_observed_flow  encodings done
ped_observed_flow_across  encodings done
ped_observed_flow_along_driver_side  encodings done
ped_observed_flow_along_passenger_side  encodings done
roadworks  encodings done
differential_speed_limits  encodings done
centreline_rumble_strips  encodings done
paved_shoulder_driver_side  encodings done
paved_shoulder_passenger_side  encodings done
skid_resistance_grip  encodings done
intersecting_road_volume  encodings done
roadside_severity_driver_side_distance  encoding

Unnamed: 0,motorcycle_percent,ped_peak_hour_flow_across,ped_peak_hour_flow_along_driver_side,ped_peak_hour_flow_along_passenger_side,bicycle_peak_hour_flow,curvature,ped_fencing,grade,shoulder_rumble_strips,street_lighting,...,paved_shoulder_driver_side,paved_shoulder_passenger_side,skid_resistance_grip,intersecting_road_volume,roadside_severity_driver_side_distance,roadside_severity_passenger_side_distance,lane_width,road_condition,sight_distance,delineation
10504,3.0,4.0,5.0,5.0,2.0,1.0,0.0,1.0,0.0,0.0,...,3.0,3.0,0.0,6.0,0.0,3.0,0.0,0.0,0.0,1.0
1965,1.0,4.0,5.0,5.0,3.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,5.0,1.0,1.0,0.0,0.0,0.0,0.0
8385,4.0,4.0,4.0,4.0,2.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,6.0,1.0,1.0,0.0,0.0,0.0,0.0
2405,1.0,3.0,0.0,4.0,3.0,0.0,0.0,0.0,0.0,1.0,...,3.0,0.0,0.0,6.0,0.0,1.0,0.0,0.0,0.0,0.0
11214,3.0,0.0,3.0,3.0,1.0,0.0,0.0,0.0,0.0,0.0,...,3.0,3.0,1.0,6.0,3.0,1.0,0.0,0.0,0.0,1.0


In [21]:
# encode numerical input features as before:
df_x_numerical_enc = df_x_numerical.copy()
minmax = MinMaxScaler() #call instance of minmax scaler
minmax.fit(df_x_numerical[['vehicle_flow']]) # fit on data 
df_x_numerical_enc['vehicle_flow'] = minmax.transform(df_x_enc[['vehicle_flow']])

Now we need to apply target encoding to all of the nominal input features. 

In [22]:
!pip install category-encoders



In [23]:
#custom wrapper so that multiclass targets can be target encoded without losing one target label
# based on this code:
# http://contrib.scikit-learn.org/category_encoders/targetencoder.html

import copy
from category_encoders import utils
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.model_selection import StratifiedKFold
import category_encoders as encoders
import pandas as pd
import numpy as np


class PolynomialWrapper2(BaseEstimator, TransformerMixin):
    
    def __init__(self, feature_encoder):
        self.feature_encoder = feature_encoder
        self.feature_encoders = {}
        self.label_encoder = None

    def fit(self, X, y, **kwargs):
        # unite the input into pandas types
        X = utils.convert_input(X)
        y = utils.convert_input(y)
        y.columns = ['target']

        # apply one-hot-encoder on the label
        self.label_encoder = encoders.OneHotEncoder(handle_missing='error', handle_unknown='error', cols=['target'], drop_invariant=True,
                                                    use_cat_names=True)
        labels = self.label_encoder.fit_transform(y)
        labels.columns = [column[7:] for column in labels.columns]
        labels = labels.iloc[:, 1:]  # drop one label , was labels = labels.iloc[:, 1:] 

        # train the feature encoders
        for class_name, label in labels.iteritems():
            self.feature_encoders[class_name] = copy.deepcopy(self.feature_encoder).fit(X, label)

    def transform(self, X):
        # unite the input into pandas types
        X = utils.convert_input(X)

        # initialization
        encoded = None
        feature_encoder = None
        all_new_features = pd.DataFrame()

        # transform the features
        for class_name, feature_encoder in self.feature_encoders.items():
            encoded = feature_encoder.transform(X)

            # decorate the encoded features with the label class suffix
            new_features = encoded[feature_encoder.cols]
            new_features.columns = [str(column) + '_' + class_name for column in new_features.columns]

            all_new_features = pd.concat((all_new_features, new_features), axis=1)

        # add features that were not encoded
        result = pd.concat((encoded[encoded.columns[~encoded.columns.isin(feature_encoder.cols)]], all_new_features), axis=1)

        return result

    def fit_transform(self, X, y=None, **fit_params):
        # When we are training the feature encoders, we have to use fit_transform() method on the features.

        # unite the input into pandas types
        X = utils.convert_input(X)
        y = utils.convert_input(y)
        y.columns = ['target']

        # apply one-hot-encoder on the label
        self.label_encoder = encoders.OneHotEncoder(handle_missing='error', handle_unknown='error', cols=['target'], drop_invariant=True,
                                                    use_cat_names=True)
        labels = self.label_encoder.fit_transform(y)
        labels.columns = [column[7:] for column in labels.columns]
        labels = labels.iloc[:, 0:]  # keep all labels , was labels = labels.iloc[:, 1:] 

        # initialization of the feature encoders
        encoded = None
        feature_encoder = None
        all_new_features = pd.DataFrame()

        # fit_transform the feature encoders
        for class_name, label in labels.iteritems():
            feature_encoder = copy.deepcopy(self.feature_encoder)
            encoded = feature_encoder.fit_transform(X, label)

            # decorate the encoded features with the label class suffix
            new_features = encoded[feature_encoder.cols]
            new_features.columns = [str(column) + '_' + class_name for column in new_features.columns]

            all_new_features = pd.concat((all_new_features, new_features), axis=1)
            self.feature_encoders[class_name] = feature_encoder

        # add features that were not encoded
        result = pd.concat((encoded[encoded.columns[~encoded.columns.isin(feature_encoder.cols)]], all_new_features), axis=1)

        return result

In [24]:
from category_encoders import TargetEncoder

ModuleNotFoundError: No module named 'PolynomialWrapper2'

In [25]:
#set all columns as string so that they can be target encoded
df_x_nom_enc = df_x_nom.copy()
for col in df_x_nom_enc.columns:
    df_x_nom_enc[col] = df_x_nom_enc[col].astype(str) 

#### Target encoding applied to nominal input features with respect to car star rating:

In [26]:
TE = TargetEncoder(cols=df_x_nom_enc.columns)
PW = PolynomialWrapper2(TE) #adds multi-class encoding to target encoder

df_x_nom_enc_car = df_x_nom_enc.copy()

#create df with nominal input features encoded based on car star rating probability
df_x_nom_enc_car = PW.fit_transform(df_x_nom_enc_car,df_y.car_star_rating_star)

#reorder columns for easier readability
col_order = df_x_nom_enc_car.columns[:].sort_values().values
df_x_nom_enc_car = df_x_nom_enc_car[col_order]
df_x_nom_enc_car.head()

  elif pd.api.types.is_categorical(cols):
  elif pd.api.types.is_categorical(cols):
  elif pd.api.types.is_categorical(cols):
  elif pd.api.types.is_categorical(cols):
  elif pd.api.types.is_categorical(cols):


Unnamed: 0,area_type_1.0,area_type_2.0,area_type_3.0,area_type_4.0,area_type_5.0,carriageway_1.0,carriageway_2.0,carriageway_3.0,carriageway_4.0,carriageway_5.0,...,truck_speed_limit_1.0,truck_speed_limit_2.0,truck_speed_limit_3.0,truck_speed_limit_4.0,truck_speed_limit_5.0,vehicle_parking_1.0,vehicle_parking_2.0,vehicle_parking_3.0,vehicle_parking_4.0,vehicle_parking_5.0
10504,0.345459,0.287588,0.283935,0.061777,0.02124,0.282561,0.255863,0.337043,0.096948,0.027586,...,0.584965,0.260553,0.143255,0.011137,9e-05,0.283704,0.254386,0.33491,0.095828,0.031172
1965,0.345459,0.287588,0.283935,0.061777,0.02124,0.282561,0.255863,0.337043,0.096948,0.027586,...,0.285203,0.349735,0.304646,0.045864,0.014552,0.283704,0.254386,0.33491,0.095828,0.031172
8385,0.130314,0.17474,0.441865,0.191045,0.062036,0.282561,0.255863,0.337043,0.096948,0.027586,...,0.120827,0.136785,0.399471,0.247242,0.095676,0.188341,0.20085,0.383762,0.180788,0.046259
2405,0.130314,0.17474,0.441865,0.191045,0.062036,0.198457,0.204154,0.372819,0.158813,0.065757,...,0.210207,0.220914,0.454616,0.099334,0.01493,0.283704,0.254386,0.33491,0.095828,0.031172
11214,0.345459,0.287588,0.283935,0.061777,0.02124,0.282561,0.255863,0.337043,0.096948,0.027586,...,0.584965,0.260553,0.143255,0.011137,9e-05,0.283704,0.254386,0.33491,0.095828,0.031172


We will also create a dataframe which includes the original values for each of the nominal input features so that the variation in the input feature can be related to the change in probability for the car star rating.

In [30]:
#combine original values with target encoding
df_x_nom_enc_car_plus_original = pd.concat([df_x_nom_enc,df_x_nom_enc_car],axis=1)

#reorder
col_order = df_x_nom_enc_car_plus_original.columns[:].sort_values().values
df_x_nom_enc_car_plus_original = df_x_nom_enc_car_plus_original[col_order]

#display
df_x_nom_enc_car_plus_original.head()

Unnamed: 0,area_type,area_type_1.0,area_type_2.0,area_type_3.0,area_type_4.0,area_type_5.0,carriageway,carriageway_1.0,carriageway_2.0,carriageway_3.0,...,truck_speed_limit_2.0,truck_speed_limit_3.0,truck_speed_limit_4.0,truck_speed_limit_5.0,vehicle_parking,vehicle_parking_1.0,vehicle_parking_2.0,vehicle_parking_3.0,vehicle_parking_4.0,vehicle_parking_5.0
10504,1.0,0.345459,0.287588,0.283935,0.061777,0.02124,3.0,0.282561,0.255863,0.337043,...,0.260553,0.143255,0.011137,9e-05,1.0,0.283704,0.254386,0.33491,0.095828,0.031172
1965,1.0,0.345459,0.287588,0.283935,0.061777,0.02124,3.0,0.282561,0.255863,0.337043,...,0.349735,0.304646,0.045864,0.014552,1.0,0.283704,0.254386,0.33491,0.095828,0.031172
8385,2.0,0.130314,0.17474,0.441865,0.191045,0.062036,3.0,0.282561,0.255863,0.337043,...,0.136785,0.399471,0.247242,0.095676,3.0,0.188341,0.20085,0.383762,0.180788,0.046259
2405,2.0,0.130314,0.17474,0.441865,0.191045,0.062036,2.0,0.198457,0.204154,0.372819,...,0.220914,0.454616,0.099334,0.01493,1.0,0.283704,0.254386,0.33491,0.095828,0.031172
11214,1.0,0.345459,0.287588,0.283935,0.061777,0.02124,3.0,0.282561,0.255863,0.337043,...,0.260553,0.143255,0.011137,9e-05,1.0,0.283704,0.254386,0.33491,0.095828,0.031172


Note that there will be many duplicate rows in the above dataframe and for each input feature, the data can be reduced to dimensions (a,6) where there are 'a' rows which is the number of unique values for that specific input feature and there are 6 columns (original input value plus a probability column for each star rating (1,2,3,4,5). 

Now we combine the car star ratings data, with the encoded numerical input feature (vehicle count) , ordinal encoded ordinal input features and target encoded nominal input features (with respect to car star rating) to create a new data set: 

In [27]:
df_targ_enc_car = pd.concat([df_y['car_star_rating_star'],
                            df_x_numerical_enc['vehicle_flow'],
                            df_x_ord_enc,
                            df_x_nom_enc_car],axis=1)

df_targ_enc_car.head()

Unnamed: 0,car_star_rating_star,vehicle_flow,motorcycle_percent,ped_peak_hour_flow_across,ped_peak_hour_flow_along_driver_side,ped_peak_hour_flow_along_passenger_side,bicycle_peak_hour_flow,curvature,ped_fencing,grade,...,truck_speed_limit_1.0,truck_speed_limit_2.0,truck_speed_limit_3.0,truck_speed_limit_4.0,truck_speed_limit_5.0,vehicle_parking_1.0,vehicle_parking_2.0,vehicle_parking_3.0,vehicle_parking_4.0,vehicle_parking_5.0
10504,1.0,0.024584,3.0,4.0,5.0,5.0,2.0,1.0,0.0,1.0,...,0.584965,0.260553,0.143255,0.011137,9e-05,0.283704,0.254386,0.33491,0.095828,0.031172
1965,3.0,0.014181,1.0,4.0,5.0,5.0,3.0,0.0,0.0,0.0,...,0.285203,0.349735,0.304646,0.045864,0.014552,0.283704,0.254386,0.33491,0.095828,0.031172
8385,4.0,0.041313,4.0,4.0,4.0,4.0,2.0,0.0,0.0,0.0,...,0.120827,0.136785,0.399471,0.247242,0.095676,0.188341,0.20085,0.383762,0.180788,0.046259
2405,4.0,0.071748,1.0,3.0,0.0,4.0,3.0,0.0,0.0,0.0,...,0.210207,0.220914,0.454616,0.099334,0.01493,0.283704,0.254386,0.33491,0.095828,0.031172
11214,2.0,0.018403,3.0,0.0,3.0,3.0,1.0,0.0,0.0,0.0,...,0.584965,0.260553,0.143255,0.011137,9e-05,0.283704,0.254386,0.33491,0.095828,0.031172


In [None]:
df_targ_enc_car.to_parquet('car_target_enc_ord_ordenc_num_minmaxenc') 
# filename car star ratings,ordinal has ordinal encoding, nominal has target encoding, 
#numerical has minmax encoding

#### Target encoding(motorcycle star rating):

In [28]:
TE = TargetEncoder(cols=df_x_nom_enc.columns)
PW = PolynomialWrapper2(TE) #adds multi-class encoding to target encoder

df_x_nom_enc_motorbike = df_x_nom_enc.copy()

#create df with nominal input features encoded based on car star rating probability
df_x_nom_enc_motorbike = PW.fit_transform(df_x_nom_enc_motorbike,df_y.motorcycle_star_rating_star)

#reorder columns for easier readability
col_order = df_x_nom_enc_motorbike.columns[:].sort_values().values
df_x_nom_enc_motorbike = df_x_nom_enc_motorbike[col_order]
df_x_nom_enc_motorbike.head()

  elif pd.api.types.is_categorical(cols):
  elif pd.api.types.is_categorical(cols):
  elif pd.api.types.is_categorical(cols):
  elif pd.api.types.is_categorical(cols):
  elif pd.api.types.is_categorical(cols):


Unnamed: 0,area_type_1.0,area_type_2.0,area_type_3.0,area_type_4.0,area_type_5.0,carriageway_1.0,carriageway_2.0,carriageway_3.0,carriageway_4.0,carriageway_5.0,...,truck_speed_limit_1.0,truck_speed_limit_2.0,truck_speed_limit_3.0,truck_speed_limit_4.0,truck_speed_limit_5.0,vehicle_parking_1.0,vehicle_parking_2.0,vehicle_parking_3.0,vehicle_parking_4.0,vehicle_parking_5.0
10504,0.436522,0.30673,0.21248,0.034241,0.010027,0.361322,0.279821,0.274264,0.068796,0.015798,...,0.697683,0.236662,0.063679,0.001976,0.0,0.369003,0.288079,0.268177,0.061468,0.013272
1965,0.436522,0.30673,0.21248,0.034241,0.010027,0.361322,0.279821,0.274264,0.068796,0.015798,...,0.379555,0.393153,0.215483,0.011332,0.000477,0.369003,0.288079,0.268177,0.061468,0.013272
8385,0.194261,0.232604,0.401994,0.143244,0.027897,0.361322,0.279821,0.274264,0.068796,0.015798,...,0.166716,0.176644,0.40984,0.204074,0.042727,0.250649,0.226811,0.351192,0.14279,0.028558
2405,0.194261,0.232604,0.401994,0.143244,0.027897,0.284748,0.269436,0.329496,0.096973,0.019347,...,0.284142,0.271413,0.39216,0.049132,0.003153,0.369003,0.288079,0.268177,0.061468,0.013272
11214,0.436522,0.30673,0.21248,0.034241,0.010027,0.361322,0.279821,0.274264,0.068796,0.015798,...,0.697683,0.236662,0.063679,0.001976,0.0,0.369003,0.288079,0.268177,0.061468,0.013272


We will also create a dataframe which includes the original values for each of the nominal input features so that the variation in the input feature can be related to the change in probability for the motorcycle star rating.

In [31]:
#combine original values with target encoding
df_x_nom_enc_motorbike_plus_original = pd.concat([df_x_nom_enc,df_x_nom_enc_motorbike],axis=1)

#reorder
col_order = df_x_nom_enc_motorbike_plus_original.columns[:].sort_values().values
df_x_nom_enc_motorbike_plus_original  = df_x_nom_enc_motorbike_plus_original[col_order]

#display
df_x_nom_enc_motorbike_plus_original.head()

Unnamed: 0,area_type,area_type_1.0,area_type_2.0,area_type_3.0,area_type_4.0,area_type_5.0,carriageway,carriageway_1.0,carriageway_2.0,carriageway_3.0,...,truck_speed_limit_2.0,truck_speed_limit_3.0,truck_speed_limit_4.0,truck_speed_limit_5.0,vehicle_parking,vehicle_parking_1.0,vehicle_parking_2.0,vehicle_parking_3.0,vehicle_parking_4.0,vehicle_parking_5.0
10504,1.0,0.436522,0.30673,0.21248,0.034241,0.010027,3.0,0.361322,0.279821,0.274264,...,0.236662,0.063679,0.001976,0.0,1.0,0.369003,0.288079,0.268177,0.061468,0.013272
1965,1.0,0.436522,0.30673,0.21248,0.034241,0.010027,3.0,0.361322,0.279821,0.274264,...,0.393153,0.215483,0.011332,0.000477,1.0,0.369003,0.288079,0.268177,0.061468,0.013272
8385,2.0,0.194261,0.232604,0.401994,0.143244,0.027897,3.0,0.361322,0.279821,0.274264,...,0.176644,0.40984,0.204074,0.042727,3.0,0.250649,0.226811,0.351192,0.14279,0.028558
2405,2.0,0.194261,0.232604,0.401994,0.143244,0.027897,2.0,0.284748,0.269436,0.329496,...,0.271413,0.39216,0.049132,0.003153,1.0,0.369003,0.288079,0.268177,0.061468,0.013272
11214,1.0,0.436522,0.30673,0.21248,0.034241,0.010027,3.0,0.361322,0.279821,0.274264,...,0.236662,0.063679,0.001976,0.0,1.0,0.369003,0.288079,0.268177,0.061468,0.013272


Now we combine the motorcycle star ratings data, with the encoded numerical input feature (vehicle count) , ordinal encoded ordinal input features and target encoded nominal input features (with respect to motorcycle star rating) to create a new data set: 

In [35]:
df_targ_enc_motorcycle = pd.concat([df_y['motorcycle_star_rating_star'],
                            df_x_numerical_enc['vehicle_flow'],
                            df_x_ord_enc,
                            df_x_nom_enc_motorbike],axis=1)

df_targ_enc_motorcycle.head()

Unnamed: 0,motorcycle_star_rating_star,vehicle_flow,motorcycle_percent,ped_peak_hour_flow_across,ped_peak_hour_flow_along_driver_side,ped_peak_hour_flow_along_passenger_side,bicycle_peak_hour_flow,curvature,ped_fencing,grade,...,truck_speed_limit_1.0,truck_speed_limit_2.0,truck_speed_limit_3.0,truck_speed_limit_4.0,truck_speed_limit_5.0,vehicle_parking_1.0,vehicle_parking_2.0,vehicle_parking_3.0,vehicle_parking_4.0,vehicle_parking_5.0
10504,1.0,0.024584,3.0,4.0,5.0,5.0,2.0,1.0,0.0,1.0,...,0.697683,0.236662,0.063679,0.001976,0.0,0.369003,0.288079,0.268177,0.061468,0.013272
1965,2.0,0.014181,1.0,4.0,5.0,5.0,3.0,0.0,0.0,0.0,...,0.379555,0.393153,0.215483,0.011332,0.000477,0.369003,0.288079,0.268177,0.061468,0.013272
8385,3.0,0.041313,4.0,4.0,4.0,4.0,2.0,0.0,0.0,0.0,...,0.166716,0.176644,0.40984,0.204074,0.042727,0.250649,0.226811,0.351192,0.14279,0.028558
2405,4.0,0.071748,1.0,3.0,0.0,4.0,3.0,0.0,0.0,0.0,...,0.284142,0.271413,0.39216,0.049132,0.003153,0.369003,0.288079,0.268177,0.061468,0.013272
11214,1.0,0.018403,3.0,0.0,3.0,3.0,1.0,0.0,0.0,0.0,...,0.697683,0.236662,0.063679,0.001976,0.0,0.369003,0.288079,0.268177,0.061468,0.013272


In [None]:
df_targ_enc_motorcycle.to_parquet('motorcycle_target_enc_ord_ordenc_num_minmaxenc'') 
# filename motorcycle star ratings,ordinal has ordinal encoding, nominal has target encoding, 
#numerical has minmax encoding

#### Target encoding(pedestrian star rating):

In [29]:
TE = TargetEncoder(cols=df_x_nom_enc.columns)
PW = PolynomialWrapper2(TE) #adds multi-class encoding to target encoder

df_x_nom_enc_pedest = df_x_nom_enc.copy()

#create df with nominal input features encoded based on car star rating probability
df_x_nom_enc_pedest = PW.fit_transform(df_x_nom_enc_pedest,df_y.pedestrian_star_rating_star)

#reorder columns for easier readability
col_order = df_x_nom_enc_pedest.columns[:].sort_values().values
df_x_nom_enc_pedest = df_x_nom_enc_pedest[col_order]
df_x_nom_enc_pedest.head()

  elif pd.api.types.is_categorical(cols):
  elif pd.api.types.is_categorical(cols):
  elif pd.api.types.is_categorical(cols):
  elif pd.api.types.is_categorical(cols):
  elif pd.api.types.is_categorical(cols):


Unnamed: 0,area_type_1.0,area_type_2.0,area_type_3.0,area_type_4.0,area_type_5.0,carriageway_1.0,carriageway_2.0,carriageway_3.0,carriageway_4.0,carriageway_5.0,...,truck_speed_limit_1.0,truck_speed_limit_2.0,truck_speed_limit_3.0,truck_speed_limit_4.0,truck_speed_limit_5.0,vehicle_parking_1.0,vehicle_parking_2.0,vehicle_parking_3.0,vehicle_parking_4.0,vehicle_parking_5.0
10504,0.620999,0.279815,0.075886,0.016887,0.006413,0.486513,0.316392,0.141035,0.042113,0.013946,...,0.570684,0.417011,0.010239,0.001078,0.000988,0.528154,0.303531,0.121943,0.03261,0.013762
1965,0.620999,0.279815,0.075886,0.016887,0.006413,0.486513,0.316392,0.141035,0.042113,0.013946,...,0.742053,0.206656,0.044254,0.004175,0.002863,0.528154,0.303531,0.121943,0.03261,0.013762
8385,0.316742,0.325372,0.23512,0.089265,0.033502,0.486513,0.316392,0.141035,0.042113,0.013946,...,0.282321,0.255846,0.281144,0.140609,0.040079,0.397923,0.298796,0.20793,0.073401,0.021949
2405,0.316742,0.325372,0.23512,0.089265,0.033502,0.614481,0.200831,0.111573,0.048902,0.024214,...,0.500595,0.339817,0.134785,0.018142,0.006662,0.528154,0.303531,0.121943,0.03261,0.013762
11214,0.620999,0.279815,0.075886,0.016887,0.006413,0.486513,0.316392,0.141035,0.042113,0.013946,...,0.570684,0.417011,0.010239,0.001078,0.000988,0.528154,0.303531,0.121943,0.03261,0.013762


We will also create a dataframe which includes the original values for each of the nominal input features so that the variation in the input feature can be related to the change in probability for the pedestrian star rating.

In [32]:
#combine original values with target encoding
df_x_nom_enc_pedest_plus_original = pd.concat([df_x_nom_enc,df_x_nom_enc_pedest],axis=1)

#reorder
col_order = df_x_nom_enc_pedest_plus_original.columns[:].sort_values().values
df_x_nom_enc_pedest_plus_original  = df_x_nom_enc_pedest_plus_original[col_order]

#display
df_x_nom_enc_pedest_plus_original.head()

Unnamed: 0,area_type,area_type_1.0,area_type_2.0,area_type_3.0,area_type_4.0,area_type_5.0,carriageway,carriageway_1.0,carriageway_2.0,carriageway_3.0,...,truck_speed_limit_2.0,truck_speed_limit_3.0,truck_speed_limit_4.0,truck_speed_limit_5.0,vehicle_parking,vehicle_parking_1.0,vehicle_parking_2.0,vehicle_parking_3.0,vehicle_parking_4.0,vehicle_parking_5.0
10504,1.0,0.620999,0.279815,0.075886,0.016887,0.006413,3.0,0.486513,0.316392,0.141035,...,0.417011,0.010239,0.001078,0.000988,1.0,0.528154,0.303531,0.121943,0.03261,0.013762
1965,1.0,0.620999,0.279815,0.075886,0.016887,0.006413,3.0,0.486513,0.316392,0.141035,...,0.206656,0.044254,0.004175,0.002863,1.0,0.528154,0.303531,0.121943,0.03261,0.013762
8385,2.0,0.316742,0.325372,0.23512,0.089265,0.033502,3.0,0.486513,0.316392,0.141035,...,0.255846,0.281144,0.140609,0.040079,3.0,0.397923,0.298796,0.20793,0.073401,0.021949
2405,2.0,0.316742,0.325372,0.23512,0.089265,0.033502,2.0,0.614481,0.200831,0.111573,...,0.339817,0.134785,0.018142,0.006662,1.0,0.528154,0.303531,0.121943,0.03261,0.013762
11214,1.0,0.620999,0.279815,0.075886,0.016887,0.006413,3.0,0.486513,0.316392,0.141035,...,0.417011,0.010239,0.001078,0.000988,1.0,0.528154,0.303531,0.121943,0.03261,0.013762


Now we combine the pedestrian star ratings data, with the encoded numerical input feature (vehicle count) , ordinal encoded ordinal input features and target encoded nominal input features (with respect to pedestrian star rating) to create a new data set: 

In [34]:
df_targ_enc_pedest = pd.concat([df_y['pedestrian_star_rating_star'],
                            df_x_numerical_enc['vehicle_flow'],
                            df_x_ord_enc,
                            df_x_nom_enc_pedest],axis=1)

df_targ_enc_pedest.head()

Unnamed: 0,pedestrian_star_rating_star,vehicle_flow,motorcycle_percent,ped_peak_hour_flow_across,ped_peak_hour_flow_along_driver_side,ped_peak_hour_flow_along_passenger_side,bicycle_peak_hour_flow,curvature,ped_fencing,grade,...,truck_speed_limit_1.0,truck_speed_limit_2.0,truck_speed_limit_3.0,truck_speed_limit_4.0,truck_speed_limit_5.0,vehicle_parking_1.0,vehicle_parking_2.0,vehicle_parking_3.0,vehicle_parking_4.0,vehicle_parking_5.0
10504,1.0,0.024584,3.0,4.0,5.0,5.0,2.0,1.0,0.0,1.0,...,0.570684,0.417011,0.010239,0.001078,0.000988,0.528154,0.303531,0.121943,0.03261,0.013762
1965,2.0,0.014181,1.0,4.0,5.0,5.0,3.0,0.0,0.0,0.0,...,0.742053,0.206656,0.044254,0.004175,0.002863,0.528154,0.303531,0.121943,0.03261,0.013762
8385,2.0,0.041313,4.0,4.0,4.0,4.0,2.0,0.0,0.0,0.0,...,0.282321,0.255846,0.281144,0.140609,0.040079,0.397923,0.298796,0.20793,0.073401,0.021949
2405,2.0,0.071748,1.0,3.0,0.0,4.0,3.0,0.0,0.0,0.0,...,0.500595,0.339817,0.134785,0.018142,0.006662,0.528154,0.303531,0.121943,0.03261,0.013762
11214,1.0,0.018403,3.0,0.0,3.0,3.0,1.0,0.0,0.0,0.0,...,0.570684,0.417011,0.010239,0.001078,0.000988,0.528154,0.303531,0.121943,0.03261,0.013762


In [None]:
df_targ_enc_pedest.to_parquet('pedest_target_enc_ord_ordenc_num_minmaxenc') 
# filename pedestrian star ratings,ordinal has ordinal encoding, nominal has target encoding, 
#numerical has minmax encoding

#### Target encoding(Bicycle star rating):

In [36]:
TE = TargetEncoder(cols=df_x_nom_enc.columns)
PW = PolynomialWrapper2(TE) #adds multi-class encoding to target encoder

df_x_nom_enc_bicycle = df_x_nom_enc.copy()

#create df with nominal input features encoded based on car star rating probability
df_x_nom_enc_bicycle = PW.fit_transform(df_x_nom_enc_bicycle,df_y.bicycle_star_rating_star)

#reorder columns for easier readability
col_order = df_x_nom_enc_bicycle.columns[:].sort_values().values
df_x_nom_enc_bicycle = df_x_nom_enc_bicycle[col_order]
df_x_nom_enc_bicycle.head()

  elif pd.api.types.is_categorical(cols):
  elif pd.api.types.is_categorical(cols):
  elif pd.api.types.is_categorical(cols):
  elif pd.api.types.is_categorical(cols):
  elif pd.api.types.is_categorical(cols):


Unnamed: 0,area_type_1.0,area_type_2.0,area_type_3.0,area_type_4.0,area_type_5.0,carriageway_1.0,carriageway_2.0,carriageway_3.0,carriageway_4.0,carriageway_5.0,...,truck_speed_limit_1.0,truck_speed_limit_2.0,truck_speed_limit_3.0,truck_speed_limit_4.0,truck_speed_limit_5.0,vehicle_parking_1.0,vehicle_parking_2.0,vehicle_parking_3.0,vehicle_parking_4.0,vehicle_parking_5.0
10504,0.503644,0.252356,0.145106,0.031637,0.067257,0.398047,0.255986,0.227879,0.052247,0.065841,...,0.762888,0.226154,0.005928,0.001168,0.003862,0.423902,0.258078,0.206961,0.046876,0.064182
1965,0.503644,0.252356,0.145106,0.031637,0.067257,0.398047,0.255986,0.227879,0.052247,0.065841,...,0.514642,0.332797,0.120654,0.002743,0.029164,0.423902,0.258078,0.206961,0.046876,0.064182
8385,0.243145,0.255438,0.355084,0.091589,0.054743,0.398047,0.255986,0.227879,0.052247,0.065841,...,0.16576,0.192896,0.415944,0.156126,0.069275,0.316025,0.257966,0.31697,0.069389,0.039651
2405,0.243145,0.255438,0.355084,0.091589,0.054743,0.425994,0.256499,0.215193,0.06457,0.037745,...,0.380264,0.284975,0.292291,0.028432,0.014038,0.423902,0.258078,0.206961,0.046876,0.064182
11214,0.503644,0.252356,0.145106,0.031637,0.067257,0.398047,0.255986,0.227879,0.052247,0.065841,...,0.762888,0.226154,0.005928,0.001168,0.003862,0.423902,0.258078,0.206961,0.046876,0.064182


We will also create a dataframe which includes the original values for each of the nominal input features so that the variation in the input feature can be related to the change in probability for the bicycle star rating.

In [37]:
#combine original values with target encoding
df_x_nom_enc_bicycle_plus_original = pd.concat([df_x_nom_enc,df_x_nom_enc_bicycle],axis=1)

#reorder
col_order = df_x_nom_enc_bicycle_plus_original.columns[:].sort_values().values
df_x_nom_enc_bicycle_plus_original  = df_x_nom_enc_bicycle_plus_original[col_order]

#display
df_x_nom_enc_bicycle_plus_original.head()

Unnamed: 0,area_type,area_type_1.0,area_type_2.0,area_type_3.0,area_type_4.0,area_type_5.0,carriageway,carriageway_1.0,carriageway_2.0,carriageway_3.0,...,truck_speed_limit_2.0,truck_speed_limit_3.0,truck_speed_limit_4.0,truck_speed_limit_5.0,vehicle_parking,vehicle_parking_1.0,vehicle_parking_2.0,vehicle_parking_3.0,vehicle_parking_4.0,vehicle_parking_5.0
10504,1.0,0.503644,0.252356,0.145106,0.031637,0.067257,3.0,0.398047,0.255986,0.227879,...,0.226154,0.005928,0.001168,0.003862,1.0,0.423902,0.258078,0.206961,0.046876,0.064182
1965,1.0,0.503644,0.252356,0.145106,0.031637,0.067257,3.0,0.398047,0.255986,0.227879,...,0.332797,0.120654,0.002743,0.029164,1.0,0.423902,0.258078,0.206961,0.046876,0.064182
8385,2.0,0.243145,0.255438,0.355084,0.091589,0.054743,3.0,0.398047,0.255986,0.227879,...,0.192896,0.415944,0.156126,0.069275,3.0,0.316025,0.257966,0.31697,0.069389,0.039651
2405,2.0,0.243145,0.255438,0.355084,0.091589,0.054743,2.0,0.425994,0.256499,0.215193,...,0.284975,0.292291,0.028432,0.014038,1.0,0.423902,0.258078,0.206961,0.046876,0.064182
11214,1.0,0.503644,0.252356,0.145106,0.031637,0.067257,3.0,0.398047,0.255986,0.227879,...,0.226154,0.005928,0.001168,0.003862,1.0,0.423902,0.258078,0.206961,0.046876,0.064182


Now we combine the pedestrian star ratings data, with the encoded numerical input feature (vehicle count) , ordinal encoded ordinal input features and target encoded nominal input features (with respect to pedestrian star rating) to create a new data set: 

In [38]:
df_targ_enc_bicycle = pd.concat([df_y['bicycle_star_rating_star'],
                            df_x_numerical_enc['vehicle_flow'],
                            df_x_ord_enc,
                            df_x_nom_enc_bicycle],axis=1)

df_targ_enc_bicycle.head()

Unnamed: 0,bicycle_star_rating_star,vehicle_flow,motorcycle_percent,ped_peak_hour_flow_across,ped_peak_hour_flow_along_driver_side,ped_peak_hour_flow_along_passenger_side,bicycle_peak_hour_flow,curvature,ped_fencing,grade,...,truck_speed_limit_1.0,truck_speed_limit_2.0,truck_speed_limit_3.0,truck_speed_limit_4.0,truck_speed_limit_5.0,vehicle_parking_1.0,vehicle_parking_2.0,vehicle_parking_3.0,vehicle_parking_4.0,vehicle_parking_5.0
10504,1.0,0.024584,3.0,4.0,5.0,5.0,2.0,1.0,0.0,1.0,...,0.762888,0.226154,0.005928,0.001168,0.003862,0.423902,0.258078,0.206961,0.046876,0.064182
1965,3.0,0.014181,1.0,4.0,5.0,5.0,3.0,0.0,0.0,0.0,...,0.514642,0.332797,0.120654,0.002743,0.029164,0.423902,0.258078,0.206961,0.046876,0.064182
8385,3.0,0.041313,4.0,4.0,4.0,4.0,2.0,0.0,0.0,0.0,...,0.16576,0.192896,0.415944,0.156126,0.069275,0.316025,0.257966,0.31697,0.069389,0.039651
2405,4.0,0.071748,1.0,3.0,0.0,4.0,3.0,0.0,0.0,0.0,...,0.380264,0.284975,0.292291,0.028432,0.014038,0.423902,0.258078,0.206961,0.046876,0.064182
11214,1.0,0.018403,3.0,0.0,3.0,3.0,1.0,0.0,0.0,0.0,...,0.762888,0.226154,0.005928,0.001168,0.003862,0.423902,0.258078,0.206961,0.046876,0.064182


In [None]:
df_targ_enc_bicycle.to_parquet('bicycle_target_enc_ord_ordenc_num_minmaxenc') 
# filename bicycle star ratings,ordinal has ordinal encoding, nominal has target encoding, 
#numerical has minmax encoding

### Option C: One-hot encoding is applied to all categorical inputs:

In [40]:
!pip install dask-ml

In [50]:
from dask_ml import preprocessing as dp

In [52]:
df_x_one_hot_enc = df_x.copy()
df_x_one_hot_enc = df_x_one_hot_enc.drop(columns='vehicle_flow')

for col in df_x_one_hot_enc: #set all columns as categorical
    df_x_one_hot_enc[col] = df_x_one_hot_enc[col].astype('category')

#use one hot encoding
DE = dp.DummyEncoder(drop_first=False) #keep all input categories
df_x_one_hot_enc = DE.fit_transform(df_x_one_hot_enc)

#we also apply minmax encoding to scale the vehicle count between (0,1)
minmax = MinMaxScaler() #call instance of minmax scaler
minmax.fit(df_x[['vehicle_flow']]) # fit on data 
df_x_numerical_enc['vehicle_flow'] = minmax.transform(df_x[['vehicle_flow']])

df_x_one_hot_enc.head()

Index(['motorcycle_percent_3.0', 'motorcycle_percent_4.0',
       'motorcycle_percent_5.0', 'motorcycle_percent_6.0',
       'motorcycle_percent_7.0', 'motorcycle_percent_8.0',
       'motorcycle_percent_9.0', 'ped_peak_hour_flow_across_1.0',
       'ped_peak_hour_flow_across_2.0', 'ped_peak_hour_flow_across_3.0',
       ...
       'vehicle_parking_3.0', 'facilities_for_motorised_two_wheelers_1.0',
       'facilities_for_motorised_two_wheelers_2.0',
       'facilities_for_motorised_two_wheelers_3.0',
       'facilities_for_motorised_two_wheelers_4.0',
       'facilities_for_motorised_two_wheelers_5.0',
       'facilities_for_motorised_two_wheelers_6.0',
       'roads_that_cars_can_read_0.0', 'roads_that_cars_can_read_1.0',
       'roads_that_cars_can_read_2.0'],
      dtype='object', length=433)


Unnamed: 0,motorcycle_percent_3.0,motorcycle_percent_4.0,motorcycle_percent_5.0,motorcycle_percent_6.0,motorcycle_percent_7.0,motorcycle_percent_8.0,motorcycle_percent_9.0,ped_peak_hour_flow_across_1.0,ped_peak_hour_flow_across_2.0,ped_peak_hour_flow_across_3.0,...,vehicle_parking_3.0,facilities_for_motorised_two_wheelers_1.0,facilities_for_motorised_two_wheelers_2.0,facilities_for_motorised_two_wheelers_3.0,facilities_for_motorised_two_wheelers_4.0,facilities_for_motorised_two_wheelers_5.0,facilities_for_motorised_two_wheelers_6.0,roads_that_cars_can_read_0.0,roads_that_cars_can_read_1.0,roads_that_cars_can_read_2.0
10504,0,0,0,1,0,0,0,0,0,1,...,0,0,0,0,0,0,1,1,0,0
1965,0,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,1,0,1,0
8385,0,0,0,0,1,0,0,0,0,1,...,1,0,0,0,0,0,1,0,0,1
2405,0,1,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,1,0,0,1
11214,0,0,0,1,0,0,0,1,0,0,...,0,0,0,0,0,0,1,1,0,0


Now we add back in the vehicle count and star ratings:

In [54]:
df_one_hot = pd.concat([df_y,
                        df_x_numerical_enc['vehicle_flow'],
                       df_x_one_hot_enc],axis=1)
df_one_hot.head()

Unnamed: 0,car_star_rating_star,motorcycle_star_rating_star,pedestrian_star_rating_star,bicycle_star_rating_star,vehicle_flow,motorcycle_percent_3.0,motorcycle_percent_4.0,motorcycle_percent_5.0,motorcycle_percent_6.0,motorcycle_percent_7.0,...,vehicle_parking_3.0,facilities_for_motorised_two_wheelers_1.0,facilities_for_motorised_two_wheelers_2.0,facilities_for_motorised_two_wheelers_3.0,facilities_for_motorised_two_wheelers_4.0,facilities_for_motorised_two_wheelers_5.0,facilities_for_motorised_two_wheelers_6.0,roads_that_cars_can_read_0.0,roads_that_cars_can_read_1.0,roads_that_cars_can_read_2.0
10504,1.0,1.0,1.0,1.0,0.024584,0,0,0,1,0,...,0,0,0,0,0,0,1,1,0,0
1965,3.0,2.0,2.0,3.0,0.014181,0,1,0,0,0,...,0,0,0,0,0,0,1,0,1,0
8385,4.0,3.0,2.0,3.0,0.041313,0,0,0,0,1,...,1,0,0,0,0,0,1,0,0,1
2405,4.0,4.0,2.0,4.0,0.071748,0,1,0,0,0,...,0,0,0,0,0,0,1,0,0,1
11214,2.0,1.0,1.0,1.0,0.018403,0,0,0,1,0,...,0,0,0,0,0,0,1,1,0,0


In [None]:
df_one_hot.to_parquet('all_star_ord_nom_one_hot_enc_num_minmax_enc')