# **Deployment Getaround**

## **Project**
When using Getaround, drivers book cars for a specific time period, from an hour to a few days long. They are supposed to bring back the car on time, but it happens from time to time that drivers are late for the checkout.

Late returns at checkout can generate high friction for the next driver if the car was supposed to be rented again on the same day : Customer service often reports users unsatisfied because they had to wait for the car to come back from the previous rental or users that even had to cancel their rental because the car wasn’t returned on time.

In order to mitigate those issues we’ve decided to implement a minimum delay between two rentals. A car won’t be displayed in the search results if the requested checkin or checkout times are too close from an already booked rental.

It solves the late checkout issue but also potentially hurts Getaround/owners revenues: we need to find the right trade off.

## **Goals**

This project aims to build a online Dashboard with data analysis to understand how long the minimum delay should be and what is the scope of the features (only connect cars, connect and mobile,...). 
A second part will be a machine learning model that will predict the price for car owners. It will be store and accessible via an API. 

This notebook consists in:

1. **Exploratory Data Analysis (EDA) of the Dataset**

2. **Data Preprocessing, Model Training, and Performance Analysis**

3. **Feature Importance and production**


In [1]:
import pandas as pd 
import pandas as pd
import plotly.express as px 
import plotly.graph_objects as go
import numpy as np
import os
import mlflow

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import  OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import BaggingRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.metrics import r2_score

### 1. **Exploratory Data Analysis (EDA) of the Dataset**

In [2]:
df= pd.read_csv("get_around_pricing_project.csv", index_col=0)

In [3]:
print("Number of rows and columns: {}".format(df.shape))
print()

print("Displaying a sample of the dataset: ")
display(df.head())
print()

print("Basic Statistics: ")
data_desc = df.describe(include='all')
display(data_desc)
print()

print("Missing values by category (in %):")
display(100*df.isnull().sum()/df.shape[0])

Number of rows and columns: (4843, 14)

Displaying a sample of the dataset: 


Unnamed: 0,model_key,mileage,engine_power,fuel,paint_color,car_type,private_parking_available,has_gps,has_air_conditioning,automatic_car,has_getaround_connect,has_speed_regulator,winter_tires,rental_price_per_day
0,Citroën,140411,100,diesel,black,convertible,True,True,False,False,True,True,True,106
1,Citroën,13929,317,petrol,grey,convertible,True,True,False,False,False,True,True,264
2,Citroën,183297,120,diesel,white,convertible,False,False,False,False,True,False,True,101
3,Citroën,128035,135,diesel,red,convertible,True,True,False,False,True,True,True,158
4,Citroën,97097,160,diesel,silver,convertible,True,True,False,False,False,True,True,183



Basic Statistics: 


Unnamed: 0,model_key,mileage,engine_power,fuel,paint_color,car_type,private_parking_available,has_gps,has_air_conditioning,automatic_car,has_getaround_connect,has_speed_regulator,winter_tires,rental_price_per_day
count,4843,4843.0,4843.0,4843,4843,4843,4843,4843,4843,4843,4843,4843,4843,4843.0
unique,28,,,4,10,8,2,2,2,2,2,2,2,
top,Citroën,,,diesel,black,estate,True,True,False,False,False,False,True,
freq,969,,,4641,1633,1606,2662,3839,3865,3881,2613,3674,4514,
mean,,140962.8,128.98823,,,,,,,,,,,121.214536
std,,60196.74,38.99336,,,,,,,,,,,33.568268
min,,-64.0,0.0,,,,,,,,,,,10.0
25%,,102913.5,100.0,,,,,,,,,,,104.0
50%,,141080.0,120.0,,,,,,,,,,,119.0
75%,,175195.5,135.0,,,,,,,,,,,136.0



Missing values by category (in %):


model_key                    0.0
mileage                      0.0
engine_power                 0.0
fuel                         0.0
paint_color                  0.0
car_type                     0.0
private_parking_available    0.0
has_gps                      0.0
has_air_conditioning         0.0
automatic_car                0.0
has_getaround_connect        0.0
has_speed_regulator          0.0
winter_tires                 0.0
rental_price_per_day         0.0
dtype: float64

In [4]:
# Creating scatter plots to visualize the relationship between price and some features of the dataset. 

fig1 = px.scatter(df, x="model_key", y="rental_price_per_day")
fig2 = px.scatter(df, x="mileage", y="rental_price_per_day")
fig3 = px.scatter(df, x="engine_power", y="rental_price_per_day")
fig4= px.scatter(df, x="has_getaround_connect", y="rental_price_per_day")


fig1.show()
fig2.show()
fig3.show()
fig4.show()

### 2. **Data Preprocessing, Model Training, and Performance Analysis**

a) Preprocessing

In [5]:
# Separate target variable Y from features X
print("Separating labels from features...")
target_variable = "rental_price_per_day"


X = df.drop(target_variable, axis = 1)
Y = df.loc[:,target_variable]

print("...Done.")
print()

print('Y : ')
print(Y.head())
print()
print('X :')
print(X.head())

Separating labels from features...
...Done.

Y : 
0    106
1    264
2    101
3    158
4    183
Name: rental_price_per_day, dtype: int64

X :
  model_key  mileage  engine_power    fuel paint_color     car_type  \
0   Citroën   140411           100  diesel       black  convertible   
1   Citroën    13929           317  petrol        grey  convertible   
2   Citroën   183297           120  diesel       white  convertible   
3   Citroën   128035           135  diesel         red  convertible   
4   Citroën    97097           160  diesel      silver  convertible   

   private_parking_available  has_gps  has_air_conditioning  automatic_car  \
0                       True     True                 False          False   
1                       True     True                 False          False   
2                      False    False                 False          False   
3                       True     True                 False          False   
4                       True     True     

In [6]:
print("Dividing into train and test sets...")
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)
print("...Done.")
print()

Dividing into train and test sets...
...Done.



In [7]:
numeric_features = []
categorical_features = []
for i,t in X.dtypes.items():
    if ('float' in str(t)) or ('int' in str(t)) :
        numeric_features.append(i)
    else :
        categorical_features.append(i)

print('Found numeric features ', numeric_features)
print('Found categorical features ', categorical_features)

Found numeric features  ['mileage', 'engine_power']
Found categorical features  ['model_key', 'fuel', 'paint_color', 'car_type', 'private_parking_available', 'has_gps', 'has_air_conditioning', 'automatic_car', 'has_getaround_connect', 'has_speed_regulator', 'winter_tires']


In [8]:
# Standardize numeric features by removing the mean and scaling to unit variance.
numeric_transformer = StandardScaler()

# We use OneHotEncoder to create a binary column for each category.
categorical_transformer = Pipeline(
    steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')), 
    ('encoder', OneHotEncoder(drop='first')) 
    ])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

In [9]:
# Application of all preprocessing 

print("Performing preprocessings on train set...")
print(X_train.head())
X_train = preprocessor.fit_transform(X_train)
print('...Done.')
print(X_train[0:5])
print()

# Preprocessings on test set
print("Performing preprocessings on test set...")
print(X_test.head())
X_test = preprocessor.transform(X_test) 


print('...Done.')
print(X_test[0:5,:]) 
print()

Performing preprocessings on train set...
     model_key  mileage  engine_power    fuel paint_color car_type  \
1215   Renault   119515           135  diesel        grey   estate   
432    Citroën   234365           135  diesel       black   estate   
4244       BMW    77356           105  diesel       black      suv   
289    Peugeot   181297           105  diesel       brown   estate   
2585   Citroën   144089           137  petrol       black    sedan   

      private_parking_available  has_gps  has_air_conditioning  automatic_car  \
1215                      False     True                 False          False   
432                        True     True                 False          False   
4244                      False     True                 False          False   
289                       False     True                 False          False   
2585                       True     True                 False          False   

      has_getaround_connect  has_speed_regulator  

b. Model Training

In [10]:
scores_df = pd.DataFrame(columns = ['model', 'R2_score', 'set'])

1. Linear Regression

In [11]:
# Train model
print("Train model...")
lr = LinearRegression() # 
lr.fit(X_train, Y_train)
print("...Done.")

scores_df = pd.concat([scores_df ,  pd.DataFrame([{'model': 'LIN_REG', 'R2_score': lr.score(X_train, Y_train), 'set': 'train'}])], ignore_index = True)
scores_df = pd.concat([scores_df , pd.DataFrame([{'model': 'LIN_REG', 'R2_score': lr.score(X_test, Y_test), 'set': 'test'}])], ignore_index = True)
scores_df

Train model...
...Done.



The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.



Unnamed: 0,model,R2_score,set
0,LIN_REG,0.71401,train
1,LIN_REG,0.693716,test


In [12]:
print("3-fold cross-validation...")
scores = cross_val_score(lr, X_train, Y_train, cv=10)
print('The cross-validated score is : ', scores.mean())
print('The standard deviation is : ', scores.std())

3-fold cross-validation...
The cross-validated score is :  0.7011647379430068
The standard deviation is :  0.053442139155884534


2. Bagging LR

In [13]:
lr2 = LinearRegression() 
bag_lr = BaggingRegressor(lr2)

params = {
'n_estimators': [20,40,60] # n_estimators is a hyperparameter of the ensemble method
}

gridsearch = GridSearchCV(bag_lr, param_grid = params, cv = 10, verbose=1) # cv : the number of folds to be used for CV
gridsearch.fit(X_train, Y_train)
print("...Done.")
print("Best hyperparameters : ", gridsearch.best_params_)
print("Best validation score : ", gridsearch.best_score_)

scores_df = pd.concat([scores_df ,  pd.DataFrame([{'model': 'BAG_LIN_REG', 'R2_score': gridsearch.score(X_train, Y_train), 'set': 'train'}])], ignore_index = True)
scores_df = pd.concat([scores_df , pd.DataFrame([{'model': 'BAG_LIN_REG', 'R2_score': gridsearch.score(X_test, Y_test), 'set': 'test'}])], ignore_index = True)
scores_df

Fitting 10 folds for each of 3 candidates, totalling 30 fits
...Done.
Best hyperparameters :  {'n_estimators': 20}
Best validation score :  0.7017776267786474


Unnamed: 0,model,R2_score,set
0,LIN_REG,0.71401,train
1,LIN_REG,0.693716,test
2,BAG_LIN_REG,0.713634,train
3,BAG_LIN_REG,0.694825,test


3. Adaboost LR

In [14]:
lr3 = LinearRegression() 
ada_lr = AdaBoostRegressor(lr3)

params = {
'n_estimators': [20,40,60] # n_estimators is a hyperparameter of the ensemble method
}

gridsearch2 = GridSearchCV(ada_lr, param_grid = params, cv = 10, verbose=1) # cv : the number of folds to be used for CV
gridsearch2.fit(X_train, Y_train)
print("...Done.")
print("Best hyperparameters : ", gridsearch2.best_params_)
print("Best validation score : ", gridsearch2.best_score_)

scores_df = pd.concat([scores_df ,  pd.DataFrame([{'model': 'ADA_LIN_REG', 'R2_score': gridsearch2.score(X_train, Y_train), 'set': 'train'}])], ignore_index = True)
scores_df = pd.concat([scores_df , pd.DataFrame([{'model': 'ADA_LIN_REG', 'R2_score': gridsearch2.score(X_test, Y_test), 'set': 'test'}])], ignore_index = True)
scores_df

Fitting 10 folds for each of 3 candidates, totalling 30 fits
...Done.
Best hyperparameters :  {'n_estimators': 20}
Best validation score :  0.6397292444355039


Unnamed: 0,model,R2_score,set
0,LIN_REG,0.71401,train
1,LIN_REG,0.693716,test
2,BAG_LIN_REG,0.713634,train
3,BAG_LIN_REG,0.694825,test
4,ADA_LIN_REG,0.648361,train
5,ADA_LIN_REG,0.608686,test


4. Random Forest

In [15]:
regressor = RandomForestRegressor()

# Grid of values to be tested
params = {
    'max_depth': [2, 4, 6, 8, 10],
    'min_samples_leaf': [1, 2, 5],
    'min_samples_split': [2, 4, 8],
    'n_estimators': [10, 20, 40, 60, 80, 100]
}
gridsearch1 = GridSearchCV(regressor, param_grid = params, cv = 3) # cv : the number of folds to be used for CV
gridsearch1.fit(X_train, Y_train)
print("...Done.")
print("Best hyperparameters : ", gridsearch1.best_params_)
print("Best validation score : ", gridsearch1.best_score_)

scores_df = pd.concat([scores_df ,  pd.DataFrame([{'model': 'RAN_FOR', 'R2_score': gridsearch.score(X_train, Y_train), 'set': 'train'}])], ignore_index = True)
scores_df = pd.concat([scores_df , pd.DataFrame([{'model': 'RAN_FOR', 'R2_score': gridsearch.score(X_test, Y_test), 'set': 'test'}])], ignore_index = True)
scores_df

...Done.
Best hyperparameters :  {'max_depth': 10, 'min_samples_leaf': 2, 'min_samples_split': 4, 'n_estimators': 80}
Best validation score :  0.7361671295192046


Unnamed: 0,model,R2_score,set
0,LIN_REG,0.71401,train
1,LIN_REG,0.693716,test
2,BAG_LIN_REG,0.713634,train
3,BAG_LIN_REG,0.694825,test
4,ADA_LIN_REG,0.648361,train
5,ADA_LIN_REG,0.608686,test
6,RAN_FOR,0.713634,train
7,RAN_FOR,0.694825,test


c. Performance analysis

In [16]:
scores_df

Unnamed: 0,model,R2_score,set
0,LIN_REG,0.71401,train
1,LIN_REG,0.693716,test
2,BAG_LIN_REG,0.713634,train
3,BAG_LIN_REG,0.694825,test
4,ADA_LIN_REG,0.648361,train
5,ADA_LIN_REG,0.608686,test
6,RAN_FOR,0.713634,train
7,RAN_FOR,0.694825,test


* These scores range between 0.62 and 0.72. The best score is obtained through the linear regression model.
* Considering the cross-validation, we are not facing overfitting.
* We will retain the linear regression model.

### 3. **Feature Importance and production**

In [17]:
# Focus on coefficients of the model
column_names = []
for name, pipeline, features_list in preprocessor.transformers_: # loop over pipelines
    if name == 'num': # if pipeline is for numeric variables
        features = features_list # just get the names of columns to which it has been applied
    else: # if pipeline is for categorical variables
        features = pipeline.named_steps['encoder'].get_feature_names_out() # get output columns names from OneHotEncoder
    column_names.extend(features) # concatenate features names

print("Names of columns corresponding to each coefficient: ", column_names)

coefs = pd.DataFrame(index = column_names, data = lr.coef_.transpose(), columns=["coefficients"])

# Sort value to obtain the weight of each features 
feature_importance = abs(coefs).sort_values(by = 'coefficients')
feature_importance

Names of columns corresponding to each coefficient:  ['mileage', 'engine_power', 'x0_Audi', 'x0_BMW', 'x0_Citroën', 'x0_Ferrari', 'x0_Fiat', 'x0_Ford', 'x0_Honda', 'x0_KIA Motors', 'x0_Lamborghini', 'x0_Lexus', 'x0_Maserati', 'x0_Mazda', 'x0_Mercedes', 'x0_Mini', 'x0_Mitsubishi', 'x0_Nissan', 'x0_Opel', 'x0_PGO', 'x0_Peugeot', 'x0_Porsche', 'x0_Renault', 'x0_SEAT', 'x0_Subaru', 'x0_Suzuki', 'x0_Toyota', 'x0_Volkswagen', 'x0_Yamaha', 'x1_electro', 'x1_hybrid_petrol', 'x1_petrol', 'x2_black', 'x2_blue', 'x2_brown', 'x2_green', 'x2_grey', 'x2_orange', 'x2_red', 'x2_silver', 'x2_white', 'x3_coupe', 'x3_estate', 'x3_hatchback', 'x3_sedan', 'x3_subcompact', 'x3_suv', 'x3_van', 'x4_True', 'x5_True', 'x6_True', 'x7_True', 'x8_True', 'x9_True', 'x10_True']


Unnamed: 0,coefficients
x2_grey,0.242179
x2_black,0.682024
x6_True,0.776546
x4_True,1.369021
x3_coupe,1.422885
x0_Lamborghini,1.589023
x2_brown,1.602146
x3_suv,2.161971
x0_Nissan,2.228806
x2_red,2.304588


In [18]:
# Let's display a graph
fig = px.bar(coefs, barmode="group")

fig.show()

The production was made separatly in the directory Mlflow/Train_model

### TEST API

In [25]:
import requests

response = requests.post("https://project-getaround-api-5156ee192f6a.herokuapp.com/predict", json={
    "inputs": [
    ["Citroën",140411,100,"diesel","black","convertible",True,True,False,False,True,True,True],
    ["Peugeot",46963,140,"diesel","orange","convertible",False,True,False,False,False,True,True]
]
})
print(response.json())

{'prediction': [120.84584642619095, 146.74887764341486]}
