# Ride-Hailing Platform Price Predictor

## Description
### Objective
`Predicting ride-hailing platform ride prices using Linear Regression Models.`
### Problem Statement
`1. What are column that able to affect the price value significantly ?`
<br>
`2. Does the linear regression's method suitable for this kind of dataset ?`
<br>
`3. How much is the error value compared to actual value ?`
<br>
`4. How to reduce the error value of predicted price comparing to actual price ?`

### Working Area
#### 1. Import Libraries

In [199]:
# Import and define libraries
import joblib
import json
import pandas as pd
import numpy as np

# Split and Standarize the Dataset
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import OrdinalEncoder
from feature_engine.outliers import Winsorizer

# Regression Problems
from sklearn.linear_model import LinearRegression

# Evaluate Regression Models
from sklearn.metrics import mean_absolute_error

#### 2. Data Loading
Save the dataset prior using rideshare_kaggle name and csv format file in the same folder as this notebook.

In [200]:
# Load Dataset
df_ori = pd.read_csv('rideshare_kaggle.csv')

#### 3. Exploratory Data Analysis (EDA)

In [201]:
# Duplicate the Dataset and brief check
df = df_ori.copy()
pd.set_option('display.max_columns', None) # Change max number of columns that able to be displayed.
df.head(20)

Unnamed: 0,id,timestamp,hour,day,month,datetime,timezone,source,destination,cab_type,product_id,name,price,distance,surge_multiplier,latitude,longitude,temperature,apparentTemperature,short_summary,long_summary,precipIntensity,precipProbability,humidity,windSpeed,windGust,windGustTime,visibility,temperatureHigh,temperatureHighTime,temperatureLow,temperatureLowTime,apparentTemperatureHigh,apparentTemperatureHighTime,apparentTemperatureLow,apparentTemperatureLowTime,icon,dewPoint,pressure,windBearing,cloudCover,uvIndex,visibility.1,ozone,sunriseTime,sunsetTime,moonPhase,precipIntensityMax,uvIndexTime,temperatureMin,temperatureMinTime,temperatureMax,temperatureMaxTime,apparentTemperatureMin,apparentTemperatureMinTime,apparentTemperatureMax,apparentTemperatureMaxTime
0,424553bb-7174-41ea-aeb4-fe06d4f4b9d7,1544953000.0,9,16,12,2018-12-16 09:30:07,America/New_York,Haymarket Square,North Station,Lyft,lyft_line,Shared,5.0,0.44,1.0,42.2148,-71.033,42.34,37.12,Mostly Cloudy,Rain throughout the day.,0.0,0.0,0.68,8.66,9.17,1545015600,10.0,43.68,1544968800,34.19,1545048000,37.95,1544968800,27.39,1545044400,partly-cloudy-night,32.7,1021.98,57,0.72,0,10.0,303.8,1544962084,1544994864,0.3,0.1276,1544979600,39.89,1545012000,43.68,1544968800,33.73,1545012000,38.07,1544958000
1,4bd23055-6827-41c6-b23b-3c491f24e74d,1543284000.0,2,27,11,2018-11-27 02:00:23,America/New_York,Haymarket Square,North Station,Lyft,lyft_premier,Lux,11.0,0.44,1.0,42.2148,-71.033,43.58,37.35,Rain,"Rain until morning, starting again in the eve...",0.1299,1.0,0.94,11.98,11.98,1543291200,4.786,47.3,1543251600,42.1,1543298400,43.92,1543251600,36.2,1543291200,rain,41.83,1003.97,90,1.0,0,4.786,291.1,1543232969,1543266992,0.64,0.13,1543251600,40.49,1543233600,47.3,1543251600,36.2,1543291200,43.92,1543251600
2,981a3613-77af-4620-a42a-0c0866077d1e,1543367000.0,1,28,11,2018-11-28 01:00:22,America/New_York,Haymarket Square,North Station,Lyft,lyft,Lyft,7.0,0.44,1.0,42.2148,-71.033,38.33,32.93,Clear,Light rain in the morning.,0.0,0.0,0.75,7.33,7.33,1543334400,10.0,47.55,1543320000,33.1,1543402800,44.12,1543320000,29.11,1543392000,clear-night,31.1,992.28,240,0.03,0,10.0,315.7,1543319437,1543353364,0.68,0.1064,1543338000,35.36,1543377600,47.55,1543320000,31.04,1543377600,44.12,1543320000
3,c2d88af2-d278-4bfd-a8d0-29ca77cc5512,1543554000.0,4,30,11,2018-11-30 04:53:02,America/New_York,Haymarket Square,North Station,Lyft,lyft_luxsuv,Lux Black XL,26.0,0.44,1.0,42.2148,-71.033,34.38,29.63,Clear,Partly cloudy throughout the day.,0.0,0.0,0.73,5.28,5.28,1543514400,10.0,45.03,1543510800,28.9,1543579200,38.53,1543510800,26.2,1543575600,clear-night,26.64,1013.73,310,0.0,0,10.0,291.1,1543492370,1543526114,0.75,0.0,1543507200,34.67,1543550400,45.03,1543510800,30.3,1543550400,38.53,1543510800
4,e0126e1f-8ca9-4f2e-82b3-50505a09db9a,1543463000.0,3,29,11,2018-11-29 03:49:20,America/New_York,Haymarket Square,North Station,Lyft,lyft_plus,Lyft XL,9.0,0.44,1.0,42.2148,-71.033,37.44,30.88,Partly Cloudy,Mostly cloudy throughout the day.,0.0,0.0,0.7,9.14,9.14,1543446000,10.0,42.18,1543420800,36.71,1543478400,35.75,1543420800,30.29,1543460400,partly-cloudy-night,28.61,998.36,303,0.44,0,10.0,347.7,1543405904,1543439738,0.72,0.0001,1543420800,33.1,1543402800,42.18,1543420800,29.11,1543392000,35.75,1543420800
5,f6f6d7e4-3e18-4922-a5f5-181cdd3fa6f2,1545071000.0,18,17,12,2018-12-17 18:25:12,America/New_York,Haymarket Square,North Station,Lyft,lyft_lux,Lux Black,16.5,0.44,1.0,42.2148,-71.033,38.75,33.51,Overcast,Light rain in the morning and overnight.,0.0,0.0,0.84,7.19,8.88,1545022800,8.325,40.61,1545076800,24.07,1545130800,34.97,1545080400,12.04,1545134400,cloudy,34.41,1000.46,294,1.0,1,8.325,335.8,1545048523,1545081282,0.33,0.0221,1545066000,34.19,1545048000,40.66,1545022800,27.39,1545044400,34.97,1545080400
6,462816a3-820d-408b-8549-0b39e82f65ac,1543209000.0,5,26,11,2018-11-26 05:03:00,America/New_York,Back Bay,Northeastern University,Lyft,lyft_plus,Lyft XL,10.5,1.08,1.0,42.3503,-71.081,41.99,41.99,Overcast,"Rain until morning, starting again in the eve...",0.0,0.0,0.91,0.53,0.88,1543287600,4.675,46.46,1543255200,42.17,1543298400,43.81,1543251600,37.08,1543298400,cloudy,39.54,1014.11,91,1.0,0,4.675,312.3,1543233004,1543266980,0.64,0.1245,1543251600,40.67,1543233600,46.46,1543255200,37.45,1543291200,43.81,1543251600
7,474d6376-bc59-4ec9-bf57-4e6d6faeb165,1543780000.0,19,2,12,2018-12-02 19:53:04,America/New_York,Back Bay,Northeastern University,Lyft,lyft_lux,Lux Black,16.5,1.08,1.0,42.3503,-71.081,49.88,49.22,Light Rain,Light rain until evening.,0.0246,1.0,0.93,3.38,3.38,1543755600,3.052,50.8,1543788000,44.97,1543816800,50.13,1543788000,45.62,1543816800,rain,48.02,1004.33,159,1.0,0,3.052,282.5,1543751798,1543785242,0.86,0.0916,1543770000,36.32,1543726800,50.8,1543788000,35.84,1543748400,50.13,1543788000
8,4f9fee41-fde3-4767-bbf1-a00e108701fb,1543818000.0,6,3,12,2018-12-03 06:28:02,America/New_York,Back Bay,Northeastern University,Lyft,lyft_line,Shared,3.0,1.08,1.0,42.3503,-71.081,45.58,45.58,Foggy,Foggy in the morning.,0.0,0.0,0.96,1.25,2.09,1543856400,1.413,57.02,1543852800,33.74,1543921200,56.35,1543852800,28.53,1543914000,fog,44.5,1001.06,307,1.0,0,1.413,290.9,1543838259,1543871628,0.89,0.0004,1543852800,43.09,1543896000,57.02,1543852800,39.9,1543896000,56.35,1543852800
9,8612d909-98b8-4454-a093-30bd48de0cb3,1543316000.0,10,27,11,2018-11-27 10:45:22,America/New_York,Back Bay,Northeastern University,Lyft,lyft_luxsuv,Lux Black XL,27.5,1.08,1.0,42.3503,-71.081,45.45,41.77,Light Rain,Light rain in the morning.,0.0624,1.0,0.93,6.87,7.42,1543338000,2.686,46.91,1543320000,33.82,1543399200,44.01,1543320000,30.19,1543399200,rain,43.52,989.98,79,1.0,0,2.686,296.2,1543319472,1543353352,0.68,0.1425,1543338000,36.34,1543377600,46.91,1543320000,32.43,1543377600,44.01,1543320000


In [202]:
# Check the format for each column
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 693071 entries, 0 to 693070
Data columns (total 57 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   id                           693071 non-null  object 
 1   timestamp                    693071 non-null  float64
 2   hour                         693071 non-null  int64  
 3   day                          693071 non-null  int64  
 4   month                        693071 non-null  int64  
 5   datetime                     693071 non-null  object 
 6   timezone                     693071 non-null  object 
 7   source                       693071 non-null  object 
 8   destination                  693071 non-null  object 
 9   cab_type                     693071 non-null  object 
 10  product_id                   693071 non-null  object 
 11  name                         693071 non-null  object 
 12  price                        637976 non-null  float64
 13 

In [203]:
# Count the number of unique values for each column
df.nunique()

id                             693071
timestamp                       36179
hour                               24
day                                17
month                               2
datetime                        31350
timezone                            1
source                             12
destination                        12
cab_type                            2
product_id                         13
name                               13
price                             147
distance                          549
surge_multiplier                    7
latitude                           11
longitude                          12
temperature                       308
apparentTemperature               319
short_summary                       9
long_summary                       11
precipIntensity                    63
precipProbability                  29
humidity                           51
windSpeed                         291
windGust                          286
windGustTime

In [204]:
# Check whether visibility column and visibility.1 column has the same value
(df['visibility'] == df['visibility.1']).sum()

693071

Statement :
<br>
Because the number of visibility and visibility.1 column that have the same value is same as the number of total rows, therefore the visibility and visibility.1 column has the same value.
<br>
In the other side, the same total number of unique value in ID column with the total number of not null value in the ID column indicates that there is no duplicate rows in the dataset.

Description :
<br>
`First phase of column filtering`
<br>
Some of the columns are containted with trivia data.
<br>
Therefore, columns eliminate with be required.
<br>
The name of columns and the reason will be mentioned as the following:
1. id, ID is only an identity that won't affect the price.
2. timezone, only containing one value.
3. timestamp, hour, day, and month. The datetime column will be represented as the time.
4. latitude & longitude, the source, destination, and distance will represent latitute and longitude.
5. short_summary, long_summary , and icon. The other weather indicators such as temperature, humidity, windspeed, etc. will be represented as the summary data.
6. visibility.1, is only a duplicate of visibility column.
7. every single ...time columns such as temperatureMinTime, apparentTemperatureMinTime, etc due to my unability to read the data.
8. product_id, the product_id of Uber cap_type has a unique value that not represent anything.


In [205]:
# Remove the columns that had been mentioned before
df = df.drop( columns = ['id' , 'timezone' , 'timestamp' , 'hour' , 'day' , 'month' , 'latitude' , 'longitude' , 'short_summary' , 'long_summary' , 'icon' , 'visibility.1' , 'product_id' , 'temperatureHighTime' , 'temperatureLowTime' , 'apparentTemperatureHighTime' , 'apparentTemperatureLowTime' , 'sunriseTime' , 'sunsetTime' , 'temperatureMinTime' , 'temperatureMaxTime' , 'apparentTemperatureMinTime' , 'apparentTemperatureMaxTime' , 'windGustTime'])
# Brief check of data type for each column
df.dtypes

datetime                    object
source                      object
destination                 object
cab_type                    object
name                        object
price                      float64
distance                   float64
surge_multiplier           float64
temperature                float64
apparentTemperature        float64
precipIntensity            float64
precipProbability          float64
humidity                   float64
windSpeed                  float64
windGust                   float64
visibility                 float64
temperatureHigh            float64
temperatureLow             float64
apparentTemperatureHigh    float64
apparentTemperatureLow     float64
dewPoint                   float64
pressure                   float64
windBearing                  int64
cloudCover                 float64
uvIndex                      int64
ozone                      float64
moonPhase                  float64
precipIntensityMax         float64
uvIndexTime         

In [206]:
# Converting the datetime column format from object to datetime
df['datetime'] = pd.to_datetime(df['datetime'])

In [207]:
# Check the number of missing values
df.isnull().sum().sort_values( ascending = False ) # Sort from the highest number to the lowest number

price                      55095
datetime                       0
ozone                          0
apparentTemperatureLow         0
dewPoint                       0
pressure                       0
windBearing                    0
cloudCover                     0
uvIndex                        0
moonPhase                      0
temperatureLow                 0
precipIntensityMax             0
uvIndexTime                    0
temperatureMin                 0
temperatureMax                 0
apparentTemperatureMin         0
apparentTemperatureHigh        0
temperatureHigh                0
source                         0
visibility                     0
windGust                       0
windSpeed                      0
humidity                       0
precipProbability              0
precipIntensity                0
apparentTemperature            0
temperature                    0
surge_multiplier               0
distance                       0
name                           0
cab_type  

Description :
<br>
Since manipulating test data is a forbidden action, therefore the missing values and outlier manipulating will be performed before training data is performed.

#### 4. Data Preprocessing

In [208]:
# Get Data for Inference Model
df_no_missing = df.dropna() #  Making a dataset with no missing values in order to make an inference model without missing values
data_inf = df_no_missing.sample(15, random_state = 1) # 15 datas will be used as inference model

# Remove Inference-Set from Dataset
data_train_test = df.drop(data_inf.index)

In [209]:
# Reset index for every data set in order to minimize the probability of error for the next code command
data_train_test.reset_index(drop = True , inplace = True)
data_inf.reset_index(drop = True , inplace = True)

In [210]:
# Splitting the Features and Target
X = data_train_test.drop(['price'], axis = 1) # Features
y = pd.DataFrame(data_train_test['price']) # Target

In [211]:
# Splitting between Train-Set and Test-Set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1) # 20% of overall data will be the test dataset

In [212]:
# Checking the number of missing values for every dataset (1)
X_train.isnull().sum().sort_values( ascending = False ) # Feature train set

datetime                   0
source                     0
apparentTemperatureMin     0
temperatureMax             0
temperatureMin             0
uvIndexTime                0
precipIntensityMax         0
moonPhase                  0
ozone                      0
uvIndex                    0
cloudCover                 0
windBearing                0
pressure                   0
dewPoint                   0
apparentTemperatureLow     0
apparentTemperatureHigh    0
temperatureLow             0
temperatureHigh            0
visibility                 0
windGust                   0
windSpeed                  0
humidity                   0
precipProbability          0
precipIntensity            0
apparentTemperature        0
temperature                0
surge_multiplier           0
distance                   0
name                       0
cab_type                   0
destination                0
apparentTemperatureMax     0
dtype: int64

In [213]:
# Checking the number of missing values for every dataset (2)
X_test.isnull().sum().sort_values( ascending = False ) # Feature test set

datetime                   0
source                     0
apparentTemperatureMin     0
temperatureMax             0
temperatureMin             0
uvIndexTime                0
precipIntensityMax         0
moonPhase                  0
ozone                      0
uvIndex                    0
cloudCover                 0
windBearing                0
pressure                   0
dewPoint                   0
apparentTemperatureLow     0
apparentTemperatureHigh    0
temperatureLow             0
temperatureHigh            0
visibility                 0
windGust                   0
windSpeed                  0
humidity                   0
precipProbability          0
precipIntensity            0
apparentTemperature        0
temperature                0
surge_multiplier           0
distance                   0
name                       0
cab_type                   0
destination                0
apparentTemperatureMax     0
dtype: int64

In [214]:
# Checking the number of missing values for every dataset (3)
y_train.isnull().sum() # Target train set

price    43984
dtype: int64

In [215]:
# Checking the number of missing values for every dataset (4)
y_test.isnull().sum() # Target test set

price    11111
dtype: int64

Statement :
<br>
The number of rows with null values in feature train and test dataset is 0.
<br>
While in target train and test dataset, contains 43984 null values in train set and 11111 null values in test set.
<br>
Consider that in the hundreds of thousands number of dataset, contains thousands of null data. Removing the rows with null values won't be tolerated. Therefore, multiple ways to fill the null values will be performed.

In [216]:
# Checking the data distribution in target dataset before and after splitting
data_train_test['price'].describe()
y_train.describe()
y_test.describe()

Unnamed: 0,price
count,127501.0
mean,16.542926
std,9.319643
min,2.5
25%,9.0
50%,13.5
75%,22.5
max,92.0


In [217]:
# Cheking the skewness value of train dataset before and after splitting
print('Skewness Value before splitting =', data_train_test['price'].skew())
print('Skewness Value after splitting (train set) =', y_train.skew())
print('Skewness Value after splitting (test set) =', y_test.skew())

Skewness Value before splitting = 1.0457373430289802
Skewness Value after splitting (train set) = price    1.047316
dtype: float64
Skewness Value after splitting (test set) = price    1.039417
dtype: float64


Statement :
<br>
The value of mean, standard deviation, minimum and maximum value, Q1, Q2 / median, and Q3 of target dataset before and after splitting, show a slightly difference. Yet the data distribution is able to be considered as the same.
<br>
In the other side, the skewness value for every dataset is very close to each other. Therefore, the statement about data distribution that able to be considered as the same is a true statement based on skewness value.

#### 4. Data Preprocessing I - Fill NA value
The first trial will fill the null value data with mean value.

In [218]:
# Fill the null value for every target dataset
y_train_mean = y_train.fillna(y_train.mean())
y_test_mean = y_test.fillna(y_test.mean())

In [219]:
# Handling the outlier data in train data set using capping method, in order to not reduce the data
windsoriser_mean = Winsorizer(capping_method='iqr', tail = 'both', fold = 1.5)
windsoriser_mean.fit(y_train_mean)
y_train_mean_final = windsoriser_mean.transform(y_train_mean)

In [220]:
# Check the correlation for the second phase of column filtering
df_mean = pd.concat([X_train, y_train_mean_final], axis = 1)
mean_corr_linear = df_mean.corr(method = 'pearson') # Linear correlation
abs(mean_corr_linear['price']).sort_values(ascending = False)

price                      1.000000
distance                   0.328607
surge_multiplier           0.202508
visibility                 0.002116
humidity                   0.001775
apparentTemperatureLow     0.001597
pressure                   0.001417
moonPhase                  0.001345
temperatureMin             0.001157
apparentTemperatureMin     0.000922
temperatureLow             0.000767
windBearing                0.000763
uvIndexTime                0.000749
dewPoint                   0.000747
precipProbability          0.000744
windGust                   0.000726
precipIntensity            0.000458
temperatureHigh            0.000411
temperatureMax             0.000388
windSpeed                  0.000317
apparentTemperature        0.000273
apparentTemperatureMax     0.000198
uvIndex                    0.000198
temperature                0.000190
cloudCover                 0.000172
apparentTemperatureHigh    0.000114
precipIntensityMax         0.000095
ozone                      0

Statement :
<br>
Unfortunately, there is no column that have high correlation with price column in linear or non linear patterns. Therefore, the column with highest correlation will be chosen.

In [221]:
# The distance and surge_multiplier will be chosen, while the rest of categorical column will be used
feature_mean = ['distance' , 'surge_multiplier' , 'source' , 'destination' , 'cab_type' , 'name']
X_train_final_mean = X_train[feature_mean]
X_test_final_mean = X_test[feature_mean]

In [222]:
# Split the numerical and categorical column
num_columns_mean = X_train_final_mean.select_dtypes(include=np.number).columns.tolist()
cat_columns_mean = X_train_final_mean.select_dtypes(include=['object']).columns.tolist()

# For numerical column
X_train_mean_num = X_train_final_mean[num_columns_mean]
X_test_mean_num = X_test_final_mean[num_columns_mean]

# For categorical column
X_train_mean_cat = X_train_final_mean[cat_columns_mean]
X_test_mean_cat = X_test_final_mean[cat_columns_mean]

In [223]:
# Check train set data distribution
X_train_mean_num.skew()

distance            0.836915
surge_multiplier    8.297322
dtype: float64

Statement :
<br>
The Normalize's method will be performed to do value scaling in feature train set because the data distribution is not Gaussian's distribution based on skewnss value ( over than 0.5 or less than -0.5 ).

#### 4. Data Preprocessing I - Feature Scaling

In [224]:
# Perform a value scaling
scaler_mean = MinMaxScaler()
scaler_mean.fit(X_train_mean_num)

# Making new dataset that contains a scaling value
X_train_mean_num_scaled = scaler_mean.transform(X_train_mean_num)
X_test_mean_num_scaled = scaler_mean.transform(X_test_mean_num)

#### 4. Data Preprocessing I - Feature Encoding

In [225]:
# Making function to encoding categorical data 
encoder_mean = OrdinalEncoder(categories = [['Theatre District', 'Haymarket Square', 'North End', 'Beacon Hill', 'Financial District', 'South Station', 'West End', 'Back Bay', 'Northeastern University', 'Fenway', 'Boston University', 'North Station'],
                                           ['South Station', 'Financial District', 'Haymarket Square', 'North Station', 'Boston University', 'Back Bay', 'Northeastern University', 'Theatre District', 'Beacon Hill', 'North End', 'West End', 'Fenway'],
                                           ['Uber', 'Lyft'],
                                           ['Black SUV', 'Lux Black XL', 'UberPool', 'WAV', 'Lyft XL', 'Lux', 'Black', 'Lyft', 'Taxi', 'UberX', 'UberXL', 'Shared', 'Lux Black']])

encoder_mean.fit(X_train_mean_cat) # Training the dataset pattern

# Making a new encoding dataset
X_train_mean_cat_encoded = encoder_mean.transform(X_train_mean_cat)
X_test_mean_cat_encoded = encoder_mean.transform(X_test_mean_cat)

In [226]:
# Combine the numerical and categorical columns
X_train_mean_endmost = np.concatenate([X_train_mean_num_scaled, X_train_mean_cat_encoded], axis=1)
X_test_mean_endmost = np.concatenate([X_test_mean_num_scaled, X_test_mean_cat_encoded], axis=1)

In [227]:
# Making new dataframe that contains train set
df_X_train_mean_endmost = pd.DataFrame(X_train_mean_endmost, columns=[num_columns_mean + cat_columns_mean])

# Brief check
df_X_train_mean_endmost

Unnamed: 0,distance,surge_multiplier,source,destination,cab_type,name
0,0.067602,0.0,0.0,0.0,0.0,0.0
1,0.128827,0.0,1.0,1.0,1.0,1.0
2,0.133929,0.0,2.0,1.0,0.0,0.0
3,0.167092,0.0,3.0,2.0,1.0,1.0
4,0.156888,0.0,4.0,2.0,0.0,2.0
...,...,...,...,...,...,...
554439,0.352041,0.0,11.0,6.0,0.0,8.0
554440,0.177296,0.0,7.0,4.0,0.0,0.0
554441,0.059949,0.0,1.0,3.0,0.0,10.0
554442,0.190051,0.0,7.0,4.0,0.0,10.0


#### 5. Model Definition and Training I
<br>
The model will be using Linear Regression Model to predict the value in test set.

In [228]:
# Model Definition
model_lin_reg_mean = LinearRegression() # Linear Regression Model

# Model Training
model_lin_reg_mean.fit(X_train_mean_endmost, y_train_mean)

#### 6. Model Evaluation I
<br>
Mean Absolute Error as the metrics since the outliers handled using capping's method.

In [229]:
# Model Evaluation
y_pred_mean_train = model_lin_reg_mean.predict(X_train_mean_endmost) # Predict train set
y_pred_mean_test = model_lin_reg_mean.predict(X_test_mean_endmost) # Predict test set
MAE_mean_train_pred = mean_absolute_error(y_pred_mean_train, y_train_mean)
MAE_mean_test_pred = mean_absolute_error(y_pred_mean_test, y_test_mean)
print('Error - Train Set : ', MAE_mean_train_pred)
print('Error - Test Set  : ', MAE_mean_test_pred)
print('Error Gap (%) :', (MAE_mean_test_pred - MAE_mean_train_pred) / MAE_mean_test_pred * 100)

Error - Train Set :  6.530336402998687
Error - Test Set  :  6.538314610290558
Error Gap (%) : 0.12202238294429088


In [230]:
# Check the error in error to train test set range ratio
range_y_train_mean = y_train_mean.max() - y_train_mean.min()
range_y_test_mean = y_test_mean.max() - y_test_mean.min()
print('Train Prediction Error (%):', MAE_mean_train_pred / range_y_train_mean * 100)
print('Test Prediction Error (%):', MAE_mean_test_pred / range_y_test_mean * 100)

Train Prediction Error (%): price    6.874038
dtype: float64
Test Prediction Error (%): price    7.305379
dtype: float64


#### 4. Data Preprocessing II - Fill NA value
The first trial will fill the null value data with median value.

In [231]:
# Fill the null value for every target dataset
y_train_med = y_train.fillna(y_train.median())
y_test_med = y_test.fillna(y_test.median())

In [232]:
# Handling the outlier data in train data set using capping method, in order to not reduce the data
windsoriser_med = Winsorizer(capping_method='iqr', tail = 'both', fold = 1.5)
windsoriser_med.fit(y_train_med)
y_train_med_final = windsoriser_med.transform(y_train_med)

In [233]:
# Check the correlation for the second phase of column filtering
df_med = pd.concat([X_train, y_train_med_final], axis = 1)
med_corr_linear = df_med.corr(method = 'pearson') # Linear correlation
abs(med_corr_linear['price']).sort_values(ascending = False)

price                      1.000000
distance                   0.327093
surge_multiplier           0.205852
visibility                 0.002291
humidity                   0.001868
apparentTemperatureLow     0.001607
pressure                   0.001402
moonPhase                  0.001250
temperatureMin             0.001233
apparentTemperatureMin     0.001005
dewPoint                   0.000878
precipProbability          0.000855
temperatureLow             0.000724
windGust                   0.000708
uvIndexTime                0.000603
windBearing                0.000594
precipIntensity            0.000512
temperatureHigh            0.000472
temperatureMax             0.000470
cloudCover                 0.000382
windSpeed                  0.000328
uvIndex                    0.000203
apparentTemperature        0.000147
apparentTemperatureMax     0.000146
apparentTemperatureHigh    0.000087
ozone                      0.000085
temperature                0.000062
precipIntensityMax         0

In [234]:
med_corr_non_linear = df_med.corr(method = 'spearman') # Non Linear correlation
abs(med_corr_non_linear['price']).sort_values(ascending = False)

price                      1.000000
distance                   0.311207
surge_multiplier           0.163776
pressure                   0.001706
apparentTemperatureLow     0.001703
moonPhase                  0.001660
humidity                   0.001226
visibility                 0.000966
temperatureLow             0.000957
temperatureMin             0.000830
uvIndex                    0.000783
cloudCover                 0.000665
precipIntensityMax         0.000639
precipProbability          0.000562
apparentTemperatureHigh    0.000538
ozone                      0.000531
precipIntensity            0.000524
apparentTemperatureMax     0.000479
apparentTemperatureMin     0.000441
temperatureHigh            0.000422
windBearing                0.000375
dewPoint                   0.000351
temperatureMax             0.000305
apparentTemperature        0.000293
temperature                0.000222
uvIndexTime                0.000044
windSpeed                  0.000028
windGust                   0

Statement :
<br>
Unfortunately, there is no column that have high correlation with price column in linear or non linear patterns. Therefore, the column with highest correlation will be chosen.

In [235]:
# The distance and surge_multiplier will be chosen, while the rest of categorical column will be used
feature_med = ['distance' , 'surge_multiplier' , 'source' , 'destination' , 'cab_type' , 'name']
X_train_final_med = X_train[feature_med]
X_test_final_med = X_test[feature_med]

In [236]:
# Split the numerical and categorical column
num_columns_med = X_train_final_med.select_dtypes(include=np.number).columns.tolist()
cat_columns_med = X_train_final_med.select_dtypes(include=['object']).columns.tolist()

# For numerical column
X_train_med_num = X_train_final_med[num_columns_med]
X_test_med_num = X_test_final_med[num_columns_med]

# For categorical column
X_train_med_cat = X_train_final_med[cat_columns_med]
X_test_med_cat = X_test_final_med[cat_columns_med]

In [237]:
# Check train set data distribution
X_train_med_num.skew()

distance            0.836915
surge_multiplier    8.297322
dtype: float64

Statement :
<br>
The Normalize's method will be performed to do value scaling in feature train set because the data distribution is not Gaussian's distribution based on skewnss value ( over than 0.5 or less than -0.5 ).

#### 4. Data Preprocessing II - Feature Scaling

In [238]:
# Perform a value scaling
scaler_med = MinMaxScaler()
scaler_med.fit(X_train_med_num)

# Making new dataset that contains a scaling value
X_train_med_num_scaled = scaler_med.transform(X_train_med_num)
X_test_med_num_scaled = scaler_med.transform(X_test_med_num)

#### 4. Data Preprocessing II - Feature Encoding

In [239]:
# Making function to encoding categorical data 
encoder_med = OrdinalEncoder(categories = [['Theatre District', 'Haymarket Square', 'North End', 'Beacon Hill', 'Financial District', 'South Station', 'West End', 'Back Bay', 'Northeastern University', 'Fenway', 'Boston University', 'North Station'],
                                           ['South Station', 'Financial District', 'Haymarket Square', 'North Station', 'Boston University', 'Back Bay', 'Northeastern University', 'Theatre District', 'Beacon Hill', 'North End', 'West End', 'Fenway'],
                                           ['Uber', 'Lyft'],
                                           ['Black SUV', 'Lux Black XL', 'UberPool', 'WAV', 'Lyft XL', 'Lux', 'Black', 'Lyft', 'Taxi', 'UberX', 'UberXL', 'Shared', 'Lux Black']])

encoder_med.fit(X_train_med_cat) # Training the dataset pattern

# Making a new encoding dataset
X_train_med_cat_encoded = encoder_med.transform(X_train_med_cat)
X_test_med_cat_encoded = encoder_med.transform(X_test_med_cat)

In [240]:
# Combine the numerical and categorical columns
X_train_med_endmost = np.concatenate([X_train_med_num_scaled, X_train_med_cat_encoded], axis=1)
X_test_med_endmost = np.concatenate([X_test_med_num_scaled, X_test_med_cat_encoded], axis=1)

In [241]:
# Making new dataframe that contains train set
df_X_train_med_endmost = pd.DataFrame(X_train_med_endmost, columns=[num_columns_med + cat_columns_med])

# Brief check
df_X_train_med_endmost

Unnamed: 0,distance,surge_multiplier,source,destination,cab_type,name
0,0.067602,0.0,0.0,0.0,0.0,0.0
1,0.128827,0.0,1.0,1.0,1.0,1.0
2,0.133929,0.0,2.0,1.0,0.0,0.0
3,0.167092,0.0,3.0,2.0,1.0,1.0
4,0.156888,0.0,4.0,2.0,0.0,2.0
...,...,...,...,...,...,...
554439,0.352041,0.0,11.0,6.0,0.0,8.0
554440,0.177296,0.0,7.0,4.0,0.0,0.0
554441,0.059949,0.0,1.0,3.0,0.0,10.0
554442,0.190051,0.0,7.0,4.0,0.0,10.0


#### 5. Model Definition and Training II
<br>
The model will be using Linear Regression Model to predict the value in test set.

In [242]:
# Model Definition
model_lin_reg_med = LinearRegression() # Linear Regression Model

# Model Training
model_lin_reg_med.fit(X_train_med_endmost, y_train_med)

#### 6. Model Evaluation II
<br>
Mean Absolute Error as the metrics since the outliers handled using capping's method.

In [243]:
# Model Evaluation
y_pred_med_train = model_lin_reg_med.predict(X_train_med_endmost) # Predict train set
y_pred_med_test = model_lin_reg_med.predict(X_test_med_endmost) # Predict test set
MAE_med_train_pred = mean_absolute_error(y_pred_med_train, y_train_med)
MAE_med_test_pred = mean_absolute_error(y_pred_med_test, y_test_med)
print('Error - Train Set : ', MAE_med_train_pred)
print('Error - Test Set  : ', MAE_med_test_pred)
print('Error Gap (%) :', (MAE_med_test_pred - MAE_med_train_pred) / MAE_med_test_pred * 100)

Error - Train Set :  6.44544553745487
Error - Test Set  :  6.450130585162604
Error Gap (%) : 0.0726349280200802


In [244]:
# Check the error in error to train test set range ratio
range_y_train_med = y_train_med.max() - y_train_med.min()
range_y_test_med = y_test_med.max() - y_test_med.min()
print('Train Prediction Error (%):', MAE_med_train_pred / range_y_train_med * 100)
print('Test Prediction Error (%):', MAE_med_test_pred / range_y_test_med * 100)

Train Prediction Error (%): price    6.78468
dtype: float64
Test Prediction Error (%): price    7.20685
dtype: float64


Statement :
<br>
The error gap of MAE in prediction data of Train and Test set is less than 1%, while the error prediction ratio with the dataset range is less than 30%. Therefore, the model is considered as good-fit model.

Statement about Median and Mean as missing values input :
<br>
Both of them are perform in a good-fit base on MAE metrics, but Median as missing values input shown a slightly better result than Mean in MAE for both train and test prediction and corresponding with range ratio.

In [245]:
# Saving the median input model
with open('model_lin_reg.pkl', 'wb') as file_1:
  joblib.dump(model_lin_reg_med, file_1)

with open('model_scaler.pkl', 'wb') as file_2:
  joblib.dump(scaler_med, file_2)

with open('model_encoder.pkl', 'wb') as file_3:
  joblib.dump(encoder_med, file_3)

with open('list_num_cols.txt', 'w') as file_4:
  json.dump(num_columns_med, file_4)

with open('list_cat_cols.txt', 'w') as file_5:
  json.dump(cat_columns_med, file_5)

#### 7. Model Inference

In [246]:
# Load the model
with open('model_lin_reg.pkl', 'rb') as file_1:
  model_lin_reg_med = joblib.load(file_1)

with open('model_scaler.pkl', 'rb') as file_2:
  model_scaler_med = joblib.load(file_2)

with open('model_encoder.pkl', 'rb') as file_3:
  model_encoder_med = joblib.load(file_3)

with open('list_num_cols.txt', 'r') as file_4:
  list_num_cols_med = json.load(file_4)

with open('list_cat_cols.txt', 'r') as file_5:
  list_cat_cols_med = json.load(file_5)

In [247]:
# Split between numerical and categorical column in Inference Model
data_inf_num = data_inf[num_columns_med]
data_inf_cat = data_inf[cat_columns_med]

# Brief check the latest dataset of Inference Model for numerical columns
data_inf_num

Unnamed: 0,distance,surge_multiplier
0,2.16,1.0
1,2.25,1.0
2,1.56,1.0
3,0.74,1.0
4,3.14,1.0
5,2.87,1.0
6,2.52,1.0
7,3.04,1.0
8,2.12,1.0
9,2.66,1.0


In [248]:
# Scaling and encoding for feature using model
data_inf_num_scaled = model_scaler_med.transform(data_inf_num)
data_inf_cat_encoded = model_encoder_med.transform(data_inf_cat)

# Menggambungkan kembali semua columns
data_inf_final = np.concatenate([data_inf_num_scaled, data_inf_cat_encoded], axis=1)

In [249]:
# Predict using Linear Regression Method
y_pred_inf = model_lin_reg_med.predict(data_inf_final)

In [250]:
# Making new DataFrame with predict results
y_pred_inf_df = pd.DataFrame(y_pred_inf, columns=['price (Prediction)'])

In [251]:
# Combine Inference set and prediction results for comparison
df_final_final_final = pd.concat([data_inf, y_pred_inf_df], axis=1)

# Making a new dataframe to compare the price and price prediction in Inference set
compare_last = ['price' , 'price (Prediction)']
df_final_final_final_final = df_final_final_final[compare_last] # Ignore the dataframe name that will only be used once
df_final_final_final_final # Show the dataframe

Unnamed: 0,price,price (Prediction)
0,16.5,19.111963
1,10.5,16.598414
2,8.0,10.365688
3,3.5,9.052911
4,26.0,14.398244
5,16.5,20.62647
6,16.5,20.112434
7,10.5,14.979739
8,16.5,18.066459
9,17.5,12.6004


Statement :
<br>
Unfortunately, the gap between each predicted value is still feels way too far.

#### 8. Overall Analysis / Conclusion
Even though the model is considered as good-fit model, but the gap between predicted and actual value of price is still feels way to far and not even 'fit'.
<br>
I have a guess, that there's a data leakage since the train and test null value is treated in the same way.
<br>
If the train and test data is splited after the null value is being seperated, the model is likely give better results.
<br>
<br>
`Illustration`
<br>
Train model -> contains all of the null data
<br>
Test model -> contain no null data
<br>
Inference model -> contain no null data
<br>
<br>
Moreover, there is no column that has a strong correlation ( over than 0.75 or less than -0.75 ) with target column.
<br>
Further trial, data exploration, and model manipulation is needed in order to obtain a model with better results.