![snap](https://lever-client-logos.s3.amazonaws.com/2bd4cdf9-37f2-497f-9096-c2793296a75f-1568844229943.png)

# GetAround 

[GetAround](https://www.getaround.com/?wpsrc=Google+Organic+Search) is the Airbnb for cars. You can rent cars from any person for a few hours to a few days! Founded in 2009, this company has known rapid growth. In 2019, they count over 5 million users and about 20K available cars worldwide. 

As Jedha's partner, they offered this great challenges: 

## Context 

When renting a car, our users have to complete a checkin flow at the beginning of the rental and a checkout flow at the end of the rental in order to:

* Assess the state of the car and notify other parties of pre-existing damages or damages that occurred during the rental.
* Compare fuel levels.
* Measure how many kilometers were driven.

The checkin and checkout of our rentals can be done with three distinct flows:
* **📱 Mobile** rental agreement on native apps: driver and owner meet and both sign the rental agreement on the owner’s smartphone
* **Connect:** the driver doesn’t meet the owner and opens the car with his smartphone
* **📝 Paper** contract (negligible)

## Project 🚧

For this case study, we suggest that you put yourselves in our shoes, and run an analysis we made back in 2017 🔮 🪄

When using Getaround, drivers book cars for a specific time period, from an hour to a few days long. They are supposed to bring back the car on time, but it happens from time to time that drivers are late for the checkout.

Late returns at checkout can generate high friction for the next driver if the car was supposed to be rented again on the same day : Customer service often reports users unsatisfied because they had to wait for the car to come back from the previous rental or users that even had to cancel their rental because the car wasn’t returned on time.


## Goals 🎯

In order to mitigate those issues we’ve decided to implement a minimum delay between two rentals. A car won’t be displayed in the search results if the requested checkin or checkout times are too close from an already booked rental.

It solves the late checkout issue but also potentially hurts Getaround/owners revenues: we need to find the right trade off.

**Our Product Manager still needs to decide:**
* **threshold:** how long should the minimum delay be?
* **scope:** should we enable the feature for all cars?, only Connect cars?

In order to help them make the right decision, they are asking you for some data insights. Here are the first analyses they could think of, to kickstart the discussion. Don’t hesitate to perform additional analysis that you find relevant.

* Which share of our owner’s revenue would potentially be affected by the feature?
* How many rentals would be affected by the feature depending on the threshold and scope we choose?
* How often are drivers late for the next check-in? How does it impact the next driver?
* How many problematic cases will it solve depending on the chosen threshold and scope?

### Web dashboard

First build a dashboard that will help the product Management team with the above questions. You can use `streamlit` or any other technology that you see fit. 


### Machine Learning - `/predict` endpoint

In addition to the above question, the Data Science team is working on *pricing optimization*. They have gathered some data to suggest optimum prices for car owners using Machine Learning. 

You should provide at least **one endpoint** `/predict`. The full URL would look like something like this: `https://your-url.com/predict`.

This endpoint accepts **POST method** with JSON input data and it should return the predictions. We assume **inputs will be always well formatted**. It means you do not have to manage errors. We leave the error handling as a bonus.

Input example:

```
{
  "input": [[7.0, 0.27, 0.36, 20.7, 0.045, 45.0, 170.0, 1.001, 3.0, 0.45, 8.8], [7.0, 0.27, 0.36, 20.7, 0.045, 45.0, 170.0, 1.001, 3.0, 0.45, 8.8]]
}
```

The response should be a JSON with one key `prediction` corresponding to the prediction.

Response example:

```
{
  "prediction":[6,6]
}
```

### Documentation page

You need to provide the users with a **documentation** about your API.

It has to be located at the `/docs` of your website. If we take the URL example above, it should be located directly at `https://your-url.com/docs`).

This small documentation should at least include:
- An h1 title: the title is up to you.
- A description of every endpoints the user can call with the endpoint name, the HTTP method, the required input and the expected output (you can give example).

You are free to add other any other relevant informations and style your HTML as you wish.

### Online production

You have to **host your API online**. We recommend you to use [Hugging Face](https://huggingface.co/spaces) as it is free of charge. But you are free to choose any other hosting provider.

## Helpers 🦮

To help you start with this project we provide you with some pieces of advice:

* Spend some time understanding data 
* Don't overlook Data Analysis part, there is a lot of insights to find out. 
* Data Analysis should take 2 to 5 hours 
* Machine Learning should take 3 to 6 hours 
* You are not obligated to use libraries to handle your Machine Learning workflow like `mlflow` but we definitely advise you to do so.


### Share your code

In order to get evaluation, do not forget to share your code on a [Github](https://github.com/) repository. You can create a [`README.md`](https://guides.github.com/features/mastering-markdown/) file with a quick description about this project, how to setup locally and the online URL.

## Deliverable 📬

To complete this project, you should deliver:

- A **dashboard** in production (accessible via a web page for example)
- The **whole code** stored in a **Github repository**. You will include the repository's URL.
- An **documented online API** on Hugging Face server (or any other provider you choose) containing at least **one `/predict` endpoint** that respects the technical description above. We should be able to request the API endpoint `/predict` using `curl`:

```shell
$ curl -i -H "Content-Type: application/json" -X POST -d '{"input": [[7.0, 0.27, 0.36, 20.7, 0.045, 45.0, 170.0, 1.001, 3.0, 0.45, 8.8]]}' http://your-url/predict
```

Or Python:

```python
import requests

response = requests.post("https://your-url/predict", json={
    "input": [[7.0, 0.27, 0.36, 20.7, 0.045, 45.0, 170.0, 1.001, 3.0, 0.45, 8.8]]
})
print(response.json())
```

## Data 

There are two files you need to download: 

* [Delay Analysis](https://full-stack-assets.s3.eu-west-3.amazonaws.com/Deployment/get_around_delay_analysis.xlsx) 👈 Data Analysis 
* [Pricing Optimization](https://full-stack-assets.s3.eu-west-3.amazonaws.com/Deployment/get_around_pricing_project.csv) 👈 Machine Learning 


Happy coding! 👩‍💻

In [22]:
# Machine Learning - Pricing Optimization
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import r2_score
import joblib

In [1]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.linear_model import Ridge, Lasso
from sklearn.metrics import r2_score


In [3]:
get_around_data = pd.read_excel('../data/get_around_delay_analysis.xlsx')
get_around_data.head()

Unnamed: 0,rental_id,car_id,checkin_type,state,delay_at_checkout_in_minutes,previous_ended_rental_id,time_delta_with_previous_rental_in_minutes
0,505000,363965,mobile,canceled,,,
1,507750,269550,mobile,ended,-81.0,,
2,508131,359049,connect,ended,70.0,,
3,508865,299063,connect,canceled,,,
4,511440,313932,mobile,ended,,,


In [4]:
print('Dataset shape:')
display(get_around_data.shape)
print('\n')

print('Basics statistics:')
display(get_around_data.info())
print('\n')
display(get_around_data.describe(include='all'))
print('\n')

print('Percentage of missing values:')
display(100 * get_around_data.isnull().sum() / get_around_data.shape[0])

Dataset shape:


(21310, 7)



Basics statistics:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21310 entries, 0 to 21309
Data columns (total 7 columns):
 #   Column                                      Non-Null Count  Dtype  
---  ------                                      --------------  -----  
 0   rental_id                                   21310 non-null  int64  
 1   car_id                                      21310 non-null  int64  
 2   checkin_type                                21310 non-null  object 
 3   state                                       21310 non-null  object 
 4   delay_at_checkout_in_minutes                16346 non-null  float64
 5   previous_ended_rental_id                    1841 non-null   float64
 6   time_delta_with_previous_rental_in_minutes  1841 non-null   float64
dtypes: float64(3), int64(2), object(2)
memory usage: 1.1+ MB


None





Unnamed: 0,rental_id,car_id,checkin_type,state,delay_at_checkout_in_minutes,previous_ended_rental_id,time_delta_with_previous_rental_in_minutes
count,21310.0,21310.0,21310,21310,16346.0,1841.0,1841.0
unique,,,2,2,,,
top,,,mobile,ended,,,
freq,,,17003,18045,,,
mean,549712.880338,350030.603426,,,59.701517,550127.411733,279.28843
std,13863.446964,58206.249765,,,1002.561635,13184.023111,254.594486
min,504806.0,159250.0,,,-22433.0,505628.0,0.0
25%,540613.25,317639.0,,,-36.0,540896.0,60.0
50%,550350.0,368717.0,,,9.0,550567.0,180.0
75%,560468.5,394928.0,,,67.0,560823.0,540.0




Percentage of missing values:


rental_id                                      0.000000
car_id                                         0.000000
checkin_type                                   0.000000
state                                          0.000000
delay_at_checkout_in_minutes                  23.294228
previous_ended_rental_id                      91.360863
time_delta_with_previous_rental_in_minutes    91.360863
dtype: float64

# EDA

In [5]:
# Distribution des types de checkin
fig1 = px.histogram(get_around_data, x='checkin_type', title='Distribution Checkin Type')
fig1.show()

# Distribution des états de location
fig2 = px.histogram(get_around_data, x='state', title='Distribution State')
fig2.show()

# On voit bien :

Mobile >> Connect (beaucoup plus de locations mobile)
Ended >> Canceled (la plupart des locations se terminent bien)

In [6]:
fig = px.histogram(get_around_data, x='delay_at_checkout_in_minutes', color='checkin_type', nbins=30)
fig.show()

### On voit que la majorité des retours se font autour de 0 (à l'heure), avec des valeurs négatives (en avance) et positives (en retard).

In [7]:
get_around_data.tail()

Unnamed: 0,rental_id,car_id,checkin_type,state,delay_at_checkout_in_minutes,previous_ended_rental_id,time_delta_with_previous_rental_in_minutes
21305,573446,380069,mobile,ended,,573429.0,300.0
21306,573790,341965,mobile,ended,-337.0,,
21307,573791,364890,mobile,ended,144.0,,
21308,574852,362531,connect,ended,-76.0,,
21309,575056,351549,connect,ended,35.0,,


In [8]:
print(get_around_data.loc[get_around_data['state']=='canceled',"delay_at_checkout_in_minutes"].isna().sum())
print(get_around_data.loc[get_around_data['state']=='ended',"delay_at_checkout_in_minutes"].isna().sum())

get_around_data.loc[get_around_data['state']=='canceled',:].describe()

3264
1700


Unnamed: 0,rental_id,car_id,delay_at_checkout_in_minutes,previous_ended_rental_id,time_delta_with_previous_rental_in_minutes
count,3265.0,3265.0,1.0,229.0,229.0
mean,548637.68392,350585.309954,-17468.0,550913.327511,294.89083
std,14907.810897,57254.052866,,11955.3976,250.591601
min,504871.0,159533.0,-17468.0,509972.0,0.0
25%,539183.0,317572.0,-17468.0,543706.0,60.0
50%,549700.0,368593.0,-17468.0,550970.0,210.0
75%,560563.0,394869.0,-17468.0,560395.0,570.0
max,576195.0,416935.0,-17468.0,574540.0,720.0


In [9]:
filtered_data = get_around_data[get_around_data['checkin_type'] == 'connect']
fig = px.histogram(data_frame=filtered_data, x='delay_at_checkout_in_minutes', marginal='violin')
fig.show()

filtered_data = get_around_data[get_around_data['checkin_type'] == 'mobile']
fig = px.histogram(data_frame=filtered_data, x='delay_at_checkout_in_minutes', marginal='violin')
fig.show()

# Pricing


In [10]:
# Exploration du dataset pricing
pricing_data = pd.read_csv('../data/get_around_pricing_project.csv')
print("Shape pricing:", pricing_data.shape)
print("\nColonnes:", pricing_data.columns.tolist())
pricing_data.head()

Shape pricing: (4843, 15)

Colonnes: ['Unnamed: 0', 'model_key', 'mileage', 'engine_power', 'fuel', 'paint_color', 'car_type', 'private_parking_available', 'has_gps', 'has_air_conditioning', 'automatic_car', 'has_getaround_connect', 'has_speed_regulator', 'winter_tires', 'rental_price_per_day']


Unnamed: 0.1,Unnamed: 0,model_key,mileage,engine_power,fuel,paint_color,car_type,private_parking_available,has_gps,has_air_conditioning,automatic_car,has_getaround_connect,has_speed_regulator,winter_tires,rental_price_per_day
0,0,Citroën,140411,100,diesel,black,convertible,True,True,False,False,True,True,True,106
1,1,Citroën,13929,317,petrol,grey,convertible,True,True,False,False,False,True,True,264
2,2,Citroën,183297,120,diesel,white,convertible,False,False,False,False,True,False,True,101
3,3,Citroën,128035,135,diesel,red,convertible,True,True,False,False,True,True,True,158
4,4,Citroën,97097,160,diesel,silver,convertible,True,True,False,False,False,True,True,183


In [11]:
print("Missing values:")
print(pricing_data.isnull().sum().sum())
print(f"\nPrix moyen: {pricing_data['rental_price_per_day'].mean():.0f}€")
print(f"Prix médian: {pricing_data['rental_price_per_day'].median():.0f}€")
print(f"Range: {pricing_data['rental_price_per_day'].min()}€ - {pricing_data['rental_price_per_day'].max()}€")

Missing values:
0

Prix moyen: 121€
Prix médian: 119€
Range: 10€ - 422€


In [12]:
pricing_data.info()
pricing_data.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4843 entries, 0 to 4842
Data columns (total 15 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   Unnamed: 0                 4843 non-null   int64 
 1   model_key                  4843 non-null   object
 2   mileage                    4843 non-null   int64 
 3   engine_power               4843 non-null   int64 
 4   fuel                       4843 non-null   object
 5   paint_color                4843 non-null   object
 6   car_type                   4843 non-null   object
 7   private_parking_available  4843 non-null   bool  
 8   has_gps                    4843 non-null   bool  
 9   has_air_conditioning       4843 non-null   bool  
 10  automatic_car              4843 non-null   bool  
 11  has_getaround_connect      4843 non-null   bool  
 12  has_speed_regulator        4843 non-null   bool  
 13  winter_tires               4843 non-null   bool  
 14  rental_p

Unnamed: 0.1,Unnamed: 0,mileage,engine_power,rental_price_per_day
count,4843.0,4843.0,4843.0,4843.0
mean,2421.0,140962.8,128.98823,121.214536
std,1398.198007,60196.74,38.99336,33.568268
min,0.0,-64.0,0.0,10.0
25%,1210.5,102913.5,100.0,104.0
50%,2421.0,141080.0,120.0,119.0
75%,3631.5,175195.5,135.0,136.0
max,4842.0,1000376.0,423.0,422.0


In [13]:
fig = px.histogram(pricing_data, "rental_price_per_day")
fig.show()

fig = px.box(pricing_data, y="rental_price_per_day", color="fuel")
fig.show()

In [14]:
fig = px.histogram(pricing_data, "rental_price_per_day")
fig.show()

px.box(pricing_data, y="rental_price_per_day", color="fuel")

pricing_data.loc[pricing_data['mileage'] < 0,:]

Unnamed: 0.1,Unnamed: 0,model_key,mileage,engine_power,fuel,paint_color,car_type,private_parking_available,has_gps,has_air_conditioning,automatic_car,has_getaround_connect,has_speed_regulator,winter_tires,rental_price_per_day
2938,2938,Renault,-64,230,diesel,black,sedan,True,True,False,True,False,False,True,274


# Conclusion EDA
# Dataset delay : 21310 locations, 23% NaN sur les retards (locations canceled)
# Dataset pricing : 4843 voitures, pas de valeurs manquantes
# Prêt pour l'analyse business et le ML

# Question 1: Which share of our owner’s revenue would potentially be affected by the feature?

In [15]:
threshold = 60
get_around_data_ended = get_around_data[get_around_data['state']=='ended']
get_around_data_ended['affected_by_threshold'] = get_around_data_ended['time_delta_with_previous_rental_in_minutes'] < threshold



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [None]:
nb_affected=get_around_data_ended['affected_by_threshold'].sum()
print(f'Le nombre de locations touchées est de {nb_affected}')

GA_connect=get_around_data_ended[get_around_data_ended['checkin_type']=='connect']
GA_connect_threshold=GA_connect['affected_by_threshold'].sum()
print(f'Le nombre de locations CONNNECT touchées est de {GA_connect_threshold} - Le nombre de locations MOBILE touchées est de {nb_affected-GA_connect_threshold}')

Le nombre de locations touchées est de 358
Le nombre de locations CONNNECT touchées est de 156 - Le nombre de locations MOBILE touchées est de 202


In [17]:
share_affected=(get_around_data_ended['affected_by_threshold'].sum()/len(get_around_data_ended['affected_by_threshold']))*100
print(f'Pourcentage de locations touchées {round(share_affected,2)} %')

Pourcentage de locations touchées 1.98 %


###  1.98% de locations touchées par un seuil de 60 minutes.

# How many rentals would be affected by the feature depending on the threshold and scope we choose?

In [None]:
threshold=60
get_around_data_ended=get_around_data[get_around_data['state']=='ended'].copy()
get_around_data_ended['affected_by_threshold']=get_around_data_ended['time_delta_with_previous_rental_in_minutes']<threshold

nb_affected=get_around_data_ended['affected_by_threshold'].sum()
print(f'Le nombre de locations touchées est de {nb_affected}')

GA_connect=get_around_data_ended[get_around_data_ended['checkin_type']=='connect']
GA_connect_threshold=GA_connect['affected_by_threshold'].sum()
print(f'Le nombre de locations CONNECT touchées est de {GA_connect_threshold} - Le nombre de locations MOBILE touchées est de {nb_affected-GA_connect_threshold}')

Le nombre de locations touchées est de 358
Le nombre de locations CONNECT touchées est de 156 - Le nombre de locations MOBILE touchées est de 202


# How often are drivers late for the next check-in? How does it impact the next driver?

In [20]:
# Question 3: Fréquence retards
get_around_data_ended['late_impact_next_driver']=(get_around_data_ended['delay_at_checkout_in_minutes']>get_around_data_ended['time_delta_with_previous_rental_in_minutes'])

late_impact_next_driver=get_around_data_ended['late_impact_next_driver'].sum()
share_late_impact=late_impact_next_driver/len(get_around_data_ended)*100

print(f'Pourcentage de locations touchées par un retard: {round(share_late_impact,2)}%')
print(f'Soit {late_impact_next_driver} retards')

Pourcentage de locations touchées par un retard: 1.5%
Soit 270 retards


# How many problematic cases will it solve depending on the chosen threshold and scope?

In [21]:
get_around_data_ended_late=get_around_data_ended[get_around_data_ended['late_impact_next_driver']]
get_around_data_ended_late['location_problem_avoided']=get_around_data_ended_late['time_delta_with_previous_rental_in_minutes']<threshold

avoided=get_around_data_ended_late['location_problem_avoided'].sum()
print(f'La solution threshold permettrait d\'éviter {avoided} problèmes de locations')

get_around_data_ended_late_connect=get_around_data_ended_late[get_around_data_ended_late['checkin_type']=='connect']
avoided_connect=get_around_data_ended_late_connect['location_problem_avoided'].sum()
print(f'La solution threshold permettrait d\'éviter {avoided_connect} problèmes pour les CONNECT et {avoided-avoided_connect} pour les MOBILE.')

La solution threshold permettrait d'éviter 176 problèmes de locations
La solution threshold permettrait d'éviter 63 problèmes pour les CONNECT et 113 pour les MOBILE.




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



# Machine Learning - Pricing Optimization

In [26]:
# Chargement données pricing depuis le dossier data
pricing=pd.read_csv('../data/get_around_pricing_project.csv')
pricing.drop("Unnamed: 0",axis=1,inplace=True)
print(f'Shape: {pricing.shape}')
pricing.head()

Shape: (4843, 14)


Unnamed: 0,model_key,mileage,engine_power,fuel,paint_color,car_type,private_parking_available,has_gps,has_air_conditioning,automatic_car,has_getaround_connect,has_speed_regulator,winter_tires,rental_price_per_day
0,Citroën,140411,100,diesel,black,convertible,True,True,False,False,True,True,True,106
1,Citroën,13929,317,petrol,grey,convertible,True,True,False,False,False,True,True,264
2,Citroën,183297,120,diesel,white,convertible,False,False,False,False,True,False,True,101
3,Citroën,128035,135,diesel,red,convertible,True,True,False,False,True,True,True,158
4,Citroën,97097,160,diesel,silver,convertible,True,True,False,False,False,True,True,183


In [27]:
# EDA 
print(f'Shape: {pricing.shape}')
print(f'\nMissing values:\n{pricing.isnull().sum()}')
print(f'\nTarget stats:')
print(pricing['rental_price_per_day'].describe())

Shape: (4843, 14)

Missing values:
model_key                    0
mileage                      0
engine_power                 0
fuel                         0
paint_color                  0
car_type                     0
private_parking_available    0
has_gps                      0
has_air_conditioning         0
automatic_car                0
has_getaround_connect        0
has_speed_regulator          0
winter_tires                 0
rental_price_per_day         0
dtype: int64

Target stats:
count    4843.000000
mean      121.214536
std        33.568268
min        10.000000
25%       104.000000
50%       119.000000
75%       136.000000
max       422.000000
Name: rental_price_per_day, dtype: float64


In [28]:
# Types de données
print('Data types:')
print(pricing.dtypes)
print(f'\nCategorical features:')
cat_cols=pricing.select_dtypes(include=['object','bool']).columns.tolist()
print(cat_cols)

Data types:
model_key                    object
mileage                       int64
engine_power                  int64
fuel                         object
paint_color                  object
car_type                     object
private_parking_available      bool
has_gps                        bool
has_air_conditioning           bool
automatic_car                  bool
has_getaround_connect          bool
has_speed_regulator            bool
winter_tires                   bool
rental_price_per_day          int64
dtype: object

Categorical features:
['model_key', 'fuel', 'paint_color', 'car_type', 'private_parking_available', 'has_gps', 'has_air_conditioning', 'automatic_car', 'has_getaround_connect', 'has_speed_regulator', 'winter_tires']


## Preprocessing

In [None]:
target="rental_price_per_day"
X=pricing.drop(target,axis=1)
y=pricing[target]

X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=42)
print(f'Train: {X_train.shape}, Test: {X_test.shape}')

Train: (3874, 13), Test: (969, 13)


In [30]:
# Pipeline 
features=X
numeric_features=features.select_dtypes(include=['int64','float64']).columns.tolist()
categorical_features=features.select_dtypes(include=['object','bool']).columns.tolist()

print(f'Numeric: {numeric_features}')
print(f'Categorical: {categorical_features}')

preprocessor=ColumnTransformer([
    ("num","passthrough",numeric_features),
    ("cat",OneHotEncoder(handle_unknown="ignore"),categorical_features)
])

model=Pipeline([
    ("preprocessing",preprocessor),
    ("regressor",RandomForestRegressor(n_estimators=100,random_state=42))
])

# Training
model.fit(X_train,y_train)
print("Model trained!")

Numeric: ['mileage', 'engine_power']
Categorical: ['model_key', 'fuel', 'paint_color', 'car_type', 'private_parking_available', 'has_gps', 'has_air_conditioning', 'automatic_car', 'has_getaround_connect', 'has_speed_regulator', 'winter_tires']
Model trained!


In [31]:
# Evaluation
y_pred=model.predict(X_test)
r2=r2_score(y_test,y_pred)

print(f'R2 Score: {r2:.3f}')
print(f'Mean Absolute Error: {abs(y_test-y_pred).mean():.2f}€')
print(f'Target mean: {y_test.mean():.2f}€')

R2 Score: 0.734
Mean Absolute Error: 10.70€
Target mean: 120.57€


## Model Saving

In [32]:
# Sauvegarde du modèle
import joblib

joblib.dump(model,"getaround_pricing_model.pkl")
print("Modèle sauvegardé : getaround_pricing_model.pkl")

# Test du modèle sauvegardé
loaded_model=joblib.load("getaround_pricing_model.pkl")
test_prediction=loaded_model.predict(X_test[:1])
print(f"Test prédiction : {test_prediction[0]:.2f}€")

Modèle sauvegardé : getaround_pricing_model.pkl
Test prédiction : 136.57€
