# GETAROUND Project: deployment machine learning module

The following analysis is being done as a mandatory project for certification bloc 5 (Machine Learning Engineer at Jedha).

In the scope of the project: 

Part 1 : Implementation of a minimum delay between two rentals 

1. Understanding the business context of the given data through EDA:
    - Analysis of delays in car returns and conflicts (getaround_delay_analysis dataset)
    - Simulating minimum thresholds of delay for better decision-making
    - Conclusions
2. Creating a visual dashboard with Streamlit (hosted on Hugging Face)

Part 2 : Machine Learning pricing optimization model - training and deployment

3. Building a ML model for pricing optimization on the basis of the given data
    - Analysis of the pricing dataset and data pre-processing
    - Training 3 different ML models for pricing optimization
    - Best model selection based on evaluation metrics
4. Builiding an API to create a /predict endpoint for pricing predictions 
based on the previously created ML model
5. Preparing the API Documentation to provide clear usage instructions at /docs.
6. Deployment: hosting everything online

# Part 2 : Machine Learning pricing optimization model - training and deployment

Step 1: Analysis of the pricing dataset

In [1]:
# Setting up the environment

import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error, r2_score
import joblib


In [2]:
# Loading the dataset
df = pd.read_csv("../data/get_around_pricing_project.csv")
df.head()


Unnamed: 0.1,Unnamed: 0,model_key,mileage,engine_power,fuel,paint_color,car_type,private_parking_available,has_gps,has_air_conditioning,automatic_car,has_getaround_connect,has_speed_regulator,winter_tires,rental_price_per_day
0,0,Citroën,140411,100,diesel,black,convertible,True,True,False,False,True,True,True,106
1,1,Citroën,13929,317,petrol,grey,convertible,True,True,False,False,False,True,True,264
2,2,Citroën,183297,120,diesel,white,convertible,False,False,False,False,True,False,True,101
3,3,Citroën,128035,135,diesel,red,convertible,True,True,False,False,True,True,True,158
4,4,Citroën,97097,160,diesel,silver,convertible,True,True,False,False,False,True,True,183


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4843 entries, 0 to 4842
Data columns (total 15 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   Unnamed: 0                 4843 non-null   int64 
 1   model_key                  4843 non-null   object
 2   mileage                    4843 non-null   int64 
 3   engine_power               4843 non-null   int64 
 4   fuel                       4843 non-null   object
 5   paint_color                4843 non-null   object
 6   car_type                   4843 non-null   object
 7   private_parking_available  4843 non-null   bool  
 8   has_gps                    4843 non-null   bool  
 9   has_air_conditioning       4843 non-null   bool  
 10  automatic_car              4843 non-null   bool  
 11  has_getaround_connect      4843 non-null   bool  
 12  has_speed_regulator        4843 non-null   bool  
 13  winter_tires               4843 non-null   bool  
 14  rental_p

In [4]:
# Checking for missing values
df.isnull().sum()

Unnamed: 0                   0
model_key                    0
mileage                      0
engine_power                 0
fuel                         0
paint_color                  0
car_type                     0
private_parking_available    0
has_gps                      0
has_air_conditioning         0
automatic_car                0
has_getaround_connect        0
has_speed_regulator          0
winter_tires                 0
rental_price_per_day         0
dtype: int64

In [5]:
# Checking basic statistics
df.describe()

Unnamed: 0.1,Unnamed: 0,mileage,engine_power,rental_price_per_day
count,4843.0,4843.0,4843.0,4843.0
mean,2421.0,140962.8,128.98823,121.214536
std,1398.198007,60196.74,38.99336,33.568268
min,0.0,-64.0,0.0,10.0
25%,1210.5,102913.5,100.0,104.0
50%,2421.0,141080.0,120.0,119.0
75%,3631.5,175195.5,135.0,136.0
max,4842.0,1000376.0,423.0,422.0


Interpretation:
This dataset is clean and rich, exactly what is needed for building a solid pricing model.
We’ve got 4,843 rows and 15 columns, including:
- Target Variable: rental_price_per_day (this is to be predicted)
- Features:
    - Numerical: mileage, engine_power
    - Categorical: model_key, fuel, paint_color, car_type
    - Boolean: private_parking_available, has_gps, has_air_conditioning, automatic_car, has_getaround_connect, has_speed_regulator, winter_tires

Step 2: Data Preprocessing

In [6]:
# Dropping unnecessary columns
df.drop(columns=["Unnamed: 0"], inplace=True)

In [7]:
# Encoding categorical variables (This will convert model_key, fuel, paint_color, 
# and car_type into binary columns)
df_encoded = pd.get_dummies(df, drop_first=True)
df_encoded.head()

Unnamed: 0,mileage,engine_power,private_parking_available,has_gps,has_air_conditioning,automatic_car,has_getaround_connect,has_speed_regulator,winter_tires,rental_price_per_day,...,paint_color_red,paint_color_silver,paint_color_white,car_type_coupe,car_type_estate,car_type_hatchback,car_type_sedan,car_type_subcompact,car_type_suv,car_type_van
0,140411,100,True,True,False,False,True,True,True,106,...,False,False,False,False,False,False,False,False,False,False
1,13929,317,True,True,False,False,False,True,True,264,...,False,False,False,False,False,False,False,False,False,False
2,183297,120,False,False,False,False,True,False,True,101,...,False,False,True,False,False,False,False,False,False,False
3,128035,135,True,True,False,False,True,True,True,158,...,True,False,False,False,False,False,False,False,False,False
4,97097,160,True,True,False,False,False,True,True,183,...,False,True,False,False,False,False,False,False,False,False


In [8]:
# Scaling numerical features (mileage and engine_power)

scaler = StandardScaler()
df_encoded[['mileage', 'engine_power']] = scaler.fit_transform(df_encoded[['mileage', 'engine_power']])
df_encoded.head()

Unnamed: 0,mileage,engine_power,private_parking_available,has_gps,has_air_conditioning,automatic_car,has_getaround_connect,has_speed_regulator,winter_tires,rental_price_per_day,...,paint_color_red,paint_color_silver,paint_color_white,car_type_coupe,car_type_estate,car_type_hatchback,car_type_sedan,car_type_subcompact,car_type_suv,car_type_van
0,-0.009168,-0.743491,True,True,False,False,True,True,True,106,...,False,False,False,False,False,False,False,False,False,False
1,-2.110528,4.822133,True,True,False,False,False,True,True,264,...,False,False,False,False,False,False,False,False,False,False
2,0.703337,-0.23053,False,False,False,False,True,False,True,101,...,False,False,True,False,False,False,False,False,False,False
3,-0.214781,0.15419,True,True,False,False,True,True,True,158,...,True,False,False,False,False,False,False,False,False,False
4,-0.728782,0.795391,True,True,False,False,False,True,True,183,...,False,True,False,False,False,False,False,False,False,False


In [9]:
# Defining features and target variable
X = df_encoded.drop(columns=['rental_price_per_day'])
y = df_encoded['rental_price_per_day']

In [10]:
# Splitting the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Now X_train, X_test, y_train, and y_test are ready for model training and evaluation

Step 3: Building the ML model for price optimization

In [13]:
# Training and evaluating the first model - Linear Regression

lr_model = LinearRegression()
lr_model.fit(X_train, y_train)

# Predict
y_pred_lr = lr_model.predict(X_test)

# Evaluation metrics
rmse_lr = mean_squared_error(y_test, y_pred_lr)
mae_lr = mean_absolute_error(y_test, y_pred_lr)
r2_lr = r2_score(y_test, y_pred_lr)

print(f"Linear Regression RMSE: {rmse_lr:.2f}")
print(f"Linear Regression MAE: {mae_lr:.2f}")
print(f"Linear Regression R2: {r2_lr:.2f}")


Linear Regression RMSE: 322.58
Linear Regression MAE: 12.12
Linear Regression R2: 0.69


In [14]:
# Training and evaluating the second model - Random Forest Regressor

model = RandomForestRegressor(random_state=42)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

# Evaluation metrics
rmse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Random Forest RMSE: {rmse:.2f}")
print(f"Random Forest MAE: {mae:.2f}")
print(f"Random Forest R2: {r2:.2f}")


Random Forest RMSE: 282.31
Random Forest MAE: 10.71
Random Forest R2: 0.73


In [15]:
# Training and evaluating the third model - XGBoost Regressor

xgb_model = XGBRegressor(random_state=42)
xgb_model.fit(X_train, y_train)

y_pred_xgb = xgb_model.predict(X_test)

# Evaluation metrics
rmse_xgb = mean_squared_error(y_test, y_pred_xgb)
mae_xgb = mean_absolute_error(y_test, y_pred_xgb)
r2_xgb = r2_score(y_test, y_pred_xgb)

print(f"XGBoost RMSE: {rmse_xgb:.2f}")
print(f"XGBoost MAE: {mae_xgb:.2f}")
print(f"XGBoost R2: {r2_xgb:.2f}")


XGBoost RMSE: 281.60
XGBoost MAE: 10.56
XGBoost R2: 0.73


Let's try some further tuning to see if the model performance mesured by the a/m metrics can be enhanced.

In [None]:
# Hyperparameter tuning for XGBoost Regressor using GridSearchCV

from xgboost import XGBRegressor
from sklearn.model_selection import GridSearchCV

# Define the model
xgb = XGBRegressor(random_state=42)

# Define the hyperparameter grid
param_grid = {
    'max_depth': [3, 5, 7],
    'learning_rate': [0.01, 0.1, 0.3],
    'n_estimators': [100, 200, 300],
    'subsample': [0.8, 1.0]
}

# Set up GridSearchCV
grid_search = GridSearchCV(
    estimator=xgb,
    param_grid=param_grid,
    cv=5,
    scoring='neg_root_mean_squared_error',
    verbose=1,
    n_jobs=-1
)

# Fit to training data
grid_search.fit(X_train, y_train)

# Best model and score

print("Best RMSE:", -grid_search.best_score_)
print("Best Parameters:", grid_search.best_params_)

# Additional metrics can be calculated by predicting on the test set

best_model = grid_search.best_estimator_
from sklearn.metrics import mean_absolute_error, r2_score

y_pred = best_model.predict(X_test)

mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Optimized MAE: {mae:.2f}")
print(f"Optimized R² Score: {r2:.2f}")





Fitting 5 folds for each of 54 candidates, totalling 270 fits
Best RMSE: 16.331990814208986
Best Parameters: {'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 300, 'subsample': 1.0}
Optimized MAE: 10.10
Optimized R² Score: 0.76


Interpretation:

Comparing the three above trained models RMSE metrics, we can clearly see that the best performing one is XGBoost. However, to enhance the performance of the model, we've tried some optimization which seems to work comparing to the previous metrics. Therefore, the optimized model will be saved for deployment.

Step 4: Save the optimal trained ML model for pricing

In [19]:
feature_order = X.columns.tolist()
print("Feature order:", feature_order)

Feature order: ['mileage', 'engine_power', 'private_parking_available', 'has_gps', 'has_air_conditioning', 'automatic_car', 'has_getaround_connect', 'has_speed_regulator', 'winter_tires', 'model_key_Audi', 'model_key_BMW', 'model_key_Citroën', 'model_key_Ferrari', 'model_key_Fiat', 'model_key_Ford', 'model_key_Honda', 'model_key_KIA Motors', 'model_key_Lamborghini', 'model_key_Lexus', 'model_key_Maserati', 'model_key_Mazda', 'model_key_Mercedes', 'model_key_Mini', 'model_key_Mitsubishi', 'model_key_Nissan', 'model_key_Opel', 'model_key_PGO', 'model_key_Peugeot', 'model_key_Porsche', 'model_key_Renault', 'model_key_SEAT', 'model_key_Subaru', 'model_key_Suzuki', 'model_key_Toyota', 'model_key_Volkswagen', 'model_key_Yamaha', 'fuel_electro', 'fuel_hybrid_petrol', 'fuel_petrol', 'paint_color_black', 'paint_color_blue', 'paint_color_brown', 'paint_color_green', 'paint_color_grey', 'paint_color_orange', 'paint_color_red', 'paint_color_silver', 'paint_color_white', 'car_type_coupe', 'car_type

In [23]:
# Save the best optimized model and feature order
model = best_model
feature_order = X_train.columns.tolist()
joblib.dump(best_model, '../models/best_xgb_model.pkl')


['../models/best_xgb_model.pkl']

Saving the selected model as well as feature order will be necessary for creating the API and its deployment. The model trained on the known data will now be able to generate predictions given some input introduced by the user (=> new data unknown to the model). This input however must follow strictly the same feature order as the one that was used to train the model, otherwise the prediction will not be given.

To facilitate the input to the user and avoid potential errors, a smart input encoder was added to the API in order to be user-friendly and meaningful.