# Pricing Optimization Model

The objective of this notebook is to build a machine learning model capable of predicting optimal rental prices for vehicle owners.

This model will later be deployed through an API in order to automate pricing recommendations at scale.

## Imports

In [62]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.metrics import mean_absolute_error

import mlflow
import mlflow.sklearn

import joblib

In [63]:
mlflow.set_tracking_uri("file:../mlruns")

## Data Loading

In [64]:
df_price = pd.read_csv("../data/get_around_pricing_project.csv")

df_price.head()

Unnamed: 0.1,Unnamed: 0,model_key,mileage,engine_power,fuel,paint_color,car_type,private_parking_available,has_gps,has_air_conditioning,automatic_car,has_getaround_connect,has_speed_regulator,winter_tires,rental_price_per_day
0,0,Citroën,140411,100,diesel,black,convertible,True,True,False,False,True,True,True,106
1,1,Citroën,13929,317,petrol,grey,convertible,True,True,False,False,False,True,True,264
2,2,Citroën,183297,120,diesel,white,convertible,False,False,False,False,True,False,True,101
3,3,Citroën,128035,135,diesel,red,convertible,True,True,False,False,True,True,True,158
4,4,Citroën,97097,160,diesel,silver,convertible,True,True,False,False,False,True,True,183


## Data Overview

In [65]:
df_price.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4843 entries, 0 to 4842
Data columns (total 15 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   Unnamed: 0                 4843 non-null   int64 
 1   model_key                  4843 non-null   object
 2   mileage                    4843 non-null   int64 
 3   engine_power               4843 non-null   int64 
 4   fuel                       4843 non-null   object
 5   paint_color                4843 non-null   object
 6   car_type                   4843 non-null   object
 7   private_parking_available  4843 non-null   bool  
 8   has_gps                    4843 non-null   bool  
 9   has_air_conditioning       4843 non-null   bool  
 10  automatic_car              4843 non-null   bool  
 11  has_getaround_connect      4843 non-null   bool  
 12  has_speed_regulator        4843 non-null   bool  
 13  winter_tires               4843 non-null   bool  
 14  rental_p

In [66]:
print("Number of rows : {}".format(df_price.shape[0]))
print()

print(" Basic statistics: ")
display(df_price.describe(include='all'))


print("Percentage of missing values : ")
display(100*df_price.isnull().sum()/df_price.shape[0])

Number of rows : 4843

 Basic statistics: 


Unnamed: 0.1,Unnamed: 0,model_key,mileage,engine_power,fuel,paint_color,car_type,private_parking_available,has_gps,has_air_conditioning,automatic_car,has_getaround_connect,has_speed_regulator,winter_tires,rental_price_per_day
count,4843.0,4843,4843.0,4843.0,4843,4843,4843,4843,4843,4843,4843,4843,4843,4843,4843.0
unique,,28,,,4,10,8,2,2,2,2,2,2,2,
top,,Citroën,,,diesel,black,estate,True,True,False,False,False,False,True,
freq,,969,,,4641,1633,1606,2662,3839,3865,3881,2613,3674,4514,
mean,2421.0,,140962.8,128.98823,,,,,,,,,,,121.214536
std,1398.198007,,60196.74,38.99336,,,,,,,,,,,33.568268
min,0.0,,-64.0,0.0,,,,,,,,,,,10.0
25%,1210.5,,102913.5,100.0,,,,,,,,,,,104.0
50%,2421.0,,141080.0,120.0,,,,,,,,,,,119.0
75%,3631.5,,175195.5,135.0,,,,,,,,,,,136.0


Percentage of missing values : 


Unnamed: 0                   0.0
model_key                    0.0
mileage                      0.0
engine_power                 0.0
fuel                         0.0
paint_color                  0.0
car_type                     0.0
private_parking_available    0.0
has_gps                      0.0
has_air_conditioning         0.0
automatic_car                0.0
has_getaround_connect        0.0
has_speed_regulator          0.0
winter_tires                 0.0
rental_price_per_day         0.0
dtype: float64

## Feature Selection

Technical identifiers and high-cardinality variables are removed to improve model generalization and simplify deployment.

In [67]:
df_model = df_price.drop(
    columns=[
        "Unnamed: 0",
        "model_key"
    ]
)

df_model.columns

Index(['mileage', 'engine_power', 'fuel', 'paint_color', 'car_type',
       'private_parking_available', 'has_gps', 'has_air_conditioning',
       'automatic_car', 'has_getaround_connect', 'has_speed_regulator',
       'winter_tires', 'rental_price_per_day'],
      dtype='object')

## Target Variable

The objective of this model is to predict the optimal rental price per day for a given vehicle based on its characteristics.

In [68]:
target = "rental_price_per_day"

## Define X/y

In [69]:
X = df_model.drop(columns=[target])
y = df_model[target]

## Identify types of features

In [70]:
categorical_features = [
    "fuel",
    "paint_color",
    "car_type"
]

numeric_features = [
    "mileage",
    "engine_power"
]

binary_features = [
    "private_parking_available",
    "has_gps",
    "has_air_conditioning",
    "automatic_car",
    "has_getaround_connect",
    "has_speed_regulator",
    "winter_tires"
]

## Preprocessing

In [71]:
preprocessor = ColumnTransformer(
    transformers=[
        (
            "cat",
            OneHotEncoder(handle_unknown="ignore"),
            categorical_features
        ),
        (
            "num",
            StandardScaler(),
            numeric_features
        ),
        (
            "bin",
            "passthrough",
            binary_features
        )
    ]
)

## Train-Test Split

We split the dataset into training and testing sets in order to evaluate model generalization performance.

In [72]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=42)

## Modeling Pipeline

Preprocessing and model training are combined into a single pipeline to ensure consistent transformations between training and production environments.

### Pipeline Linear Regression

In [73]:
lr_pipeline = Pipeline(
    steps=[
        ("preprocessor", preprocessor),
        ("model", LinearRegression())
    ]
)

lr_pipeline.fit(X_train, y_train)

y_pred_lr = lr_pipeline.predict(X_test)

mae_lr = mean_absolute_error(y_test, y_pred_lr)

print(f"Linear Regression MAE: {mae_lr:.2f}")

Linear Regression MAE: 13.06


### Pipeline Gradient Boosting 

In [74]:
gb_pipeline = Pipeline(
    steps=[
        ("preprocessor", preprocessor),
        ("model", HistGradientBoostingRegressor())
    ]
)

gb_pipeline.fit(X_train, y_train)

y_pred_gb = gb_pipeline.predict(X_test)

mae_gb = mean_absolute_error(y_test, y_pred_gb)

print(f"Gradient Boosting MAE: {mae_gb:.2f}")

Gradient Boosting MAE: 11.57


### Interpretation

The gradient boosting model achieved a lower Mean Absolute Error compared to the linear regression model.

This indicates that the gradient boosting approach is better suited to capture non-linear relationships between vehicle characteristics and rental prices.

Therefore, the gradient boosting model was selected for deployment in order to improve pricing prediction accuracy.

## Model Training with MLflow

The selected gradient boosting model is trained and tracked using MLflow to ensure reproducibility and facilitate deployment.

In [75]:
best_pipeline = gb_pipeline

mlflow.start_run()

best_pipeline.fit(X_train, y_train)

y_pred = best_pipeline.predict(X_test)

mae = mean_absolute_error(y_test, y_pred)

mlflow.log_metric("mae", mae)

mlflow.sklearn.log_model(
    best_pipeline,
    name="pricing_model"
)

mlflow.end_run()

print(f"Final MAE: {mae:.2f}")



Final MAE: 11.57


## Model Export

The trained pipeline is exported as a serialized artifact to be reused by the prediction API.

In [76]:
joblib.dump(
    best_pipeline,
    "../api/model/model.joblib"
)

['../api/model/model.joblib']