![snap](https://lever-client-logos.s3.amazonaws.com/2bd4cdf9-37f2-497f-9096-c2793296a75f-1568844229943.png)

# GetAround 

[GetAround](https://www.getaround.com/?wpsrc=Google+Organic+Search) is the Airbnb for cars. You can rent cars from any person for a few hours to a few days! Founded in 2009, this company has known rapid growth. In 2019, they count over 5 million users and about 20K available cars worldwide. 

As Jedha's partner, they offered this great challenges: 

## Context 

When renting a car, our users have to complete a checkin flow at the beginning of the rental and a checkout flow at the end of the rental in order to:

* Assess the state of the car and notify other parties of pre-existing damages or damages that occurred during the rental.
* Compare fuel levels.
* Measure how many kilometers were driven.

The checkin and checkout of our rentals can be done with three distinct flows:
* **📱 Mobile** rental agreement on native apps: driver and owner meet and both sign the rental agreement on the owner’s smartphone
* **Connect:** the driver doesn’t meet the owner and opens the car with his smartphone
* **📝 Paper** contract (negligible)

## Project 🚧

For this case study, we suggest that you put yourselves in our shoes, and run an analysis we made back in 2017 🔮 🪄

When using Getaround, drivers book cars for a specific time period, from an hour to a few days long. They are supposed to bring back the car on time, but it happens from time to time that drivers are late for the checkout.

Late returns at checkout can generate high friction for the next driver if the car was supposed to be rented again on the same day : Customer service often reports users unsatisfied because they had to wait for the car to come back from the previous rental or users that even had to cancel their rental because the car wasn’t returned on time.


## Goals 🎯

In order to mitigate those issues we’ve decided to implement a minimum delay between two rentals. A car won’t be displayed in the search results if the requested checkin or checkout times are too close from an already booked rental.

It solves the late checkout issue but also potentially hurts Getaround/owners revenues: we need to find the right trade off.

**Our Product Manager still needs to decide:**
* **threshold:** how long should the minimum delay be?
* **scope:** should we enable the feature for all cars?, only Connect cars?

In order to help them make the right decision, they are asking you for some data insights. Here are the first analyses they could think of, to kickstart the discussion. Don’t hesitate to perform additional analysis that you find relevant.

* Which share of our owner’s revenue would potentially be affected by the feature?
* How many rentals would be affected by the feature depending on the threshold and scope we choose?
* How often are drivers late for the next check-in? How does it impact the next driver?
* How many problematic cases will it solve depending on the chosen threshold and scope?

### Web dashboard

First build a dashboard that will help the product Management team with the above questions. You can use `streamlit` or any other technology that you see fit. 


### Machine Learning - `/predict` endpoint

In addition to the above question, the Data Science team is working on *pricing optimization*. They have gathered some data to suggest optimum prices for car owners using Machine Learning. 

You should provide at least **one endpoint** `/predict`. The full URL would look like something like this: `https://your-url.com/predict`.

This endpoint accepts **POST method** with JSON input data and it should return the predictions. We assume **inputs will be always well formatted**. It means you do not have to manage errors. We leave the error handling as a bonus.

Input example:

```
{
  "input": [[7.0, 0.27, 0.36, 20.7, 0.045, 45.0, 170.0, 1.001, 3.0, 0.45, 8.8], [7.0, 0.27, 0.36, 20.7, 0.045, 45.0, 170.0, 1.001, 3.0, 0.45, 8.8]]
}
```

The response should be a JSON with one key `prediction` corresponding to the prediction.

Response example:

```
{
  "prediction":[6,6]
}
```

### Documentation page

You need to provide the users with a **documentation** about your API.

It has to be located at the `/docs` of your website. If we take the URL example above, it should be located directly at `https://your-url.com/docs`).

This small documentation should at least include:
- An h1 title: the title is up to you.
- A description of every endpoints the user can call with the endpoint name, the HTTP method, the required input and the expected output (you can give example).

You are free to add other any other relevant informations and style your HTML as you wish.

### Online production

You have to **host your API online**. We recommend you to use [Heroku](https://www.heroku.com/) as it is free of charge. But you are free to choose any other hosting provider.

## Helpers 🦮

To help you start with this project we provide you with some pieces of advice:

* Spend some time understanding data 
* Don't overlook Data Analysis part, there is a lot of insights to find out. 
* Data Analysis should take 2 to 5 hours 
* Machine Learning should take 3 to 6 hours 
* You are not obligated to use libraries to handle your Machine Learning workflow like `mlflow` but we definitely advise you to do so.


### Share your code

In order to get evaluation, do not forget to share your code on a [Github](https://github.com/) repository. You can create a [`README.md`](https://guides.github.com/features/mastering-markdown/) file with a quick description about this project, how to setup locally and the online URL.

## Deliverable 📬

To complete this project, you should deliver:

- A **dashboard** in production (accessible via a web page for example)
- The **whole code** stored in a **Github repository**. You will include the repository's URL.
- An **documented online API** on Heroku server (or any other provider you choose) containing at least **one `/predict` endpoint** that respects the technical description above. We should be able to request the API endpoint `/predict` using `curl`:

```shell
$ curl -i -H "Content-Type: application/json" -X POST -d '{"input": [[7.0, 0.27, 0.36, 20.7, 0.045, 45.0, 170.0, 1.001, 3.0, 0.45, 8.8]]}' http://your-url/predict
```

Or Python:

```python
import requests

response = requests.post("https://your-url/predict", json={
    "input": [[7.0, 0.27, 0.36, 20.7, 0.045, 45.0, 170.0, 1.001, 3.0, 0.45, 8.8]]
})
print(response.json())
```

## Data 

There are two files you need to download: 

* [Delay Analysis](https://full-stack-assets.s3.eu-west-3.amazonaws.com/Deployment/get_around_delay_analysis.xlsx) 👈 Data Analysis 
* [Pricing Optimization](https://full-stack-assets.s3.eu-west-3.amazonaws.com/Deployment/get_around_pricing_project.csv) 👈 Machine Learning 


Happy coding! 👩‍💻

## EDA

In [1]:
import requests
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go

In [2]:
data_file_delay = 'get_around_delay_analysis.xlsx'
data_file_pricing = 'get_around_pricing_project_train.csv'
def load_data():
    data_delay = pd.read_excel(data_file_delay)
    data_pricing = pd.read_csv(data_file_pricing)
    return data_delay, data_pricing

In [3]:
df_delay, df_pricing = load_data()
print('df_delay')
display(df_delay.head())
print('df_pricing')
display(df_pricing.head())

df_delay


Unnamed: 0,rental_id,car_id,checkin_type,state,delay_at_checkout_in_minutes,previous_ended_rental_id,time_delta_with_previous_rental_in_minutes
0,505000,363965,mobile,canceled,,,
1,507750,269550,mobile,ended,-81.0,,
2,508131,359049,connect,ended,70.0,,
3,508865,299063,connect,canceled,,,
4,511440,313932,mobile,ended,,,


df_pricing


Unnamed: 0.1,Unnamed: 0,model_key,mileage,engine_power,fuel,paint_color,car_type,private_parking_available,has_gps,has_air_conditioning,automatic_car,has_getaround_connect,has_speed_regulator,winter_tires,rental_price_per_day
0,0,Citroën,140411,100,diesel,black,convertible,True,True,False,False,True,True,True,106
1,1,Citroën,13929,317,petrol,grey,convertible,True,True,False,False,False,True,True,264
2,2,Citroën,183297,120,diesel,white,convertible,False,False,False,False,True,False,True,101
3,3,Citroën,128035,135,diesel,red,convertible,True,True,False,False,True,True,True,158
4,4,Citroën,97097,160,diesel,silver,convertible,True,True,False,False,False,True,True,183


In [22]:
df_delay.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21310 entries, 0 to 21309
Data columns (total 7 columns):
 #   Column                                      Non-Null Count  Dtype  
---  ------                                      --------------  -----  
 0   rental_id                                   21310 non-null  int64  
 1   car_id                                      21310 non-null  int64  
 2   checkin_type                                21310 non-null  object 
 3   state                                       21310 non-null  object 
 4   delay_at_checkout_in_minutes                16346 non-null  float64
 5   previous_ended_rental_id                    1841 non-null   float64
 6   time_delta_with_previous_rental_in_minutes  1841 non-null   float64
dtypes: float64(3), int64(2), object(2)
memory usage: 1.1+ MB


In [23]:
df_pricing.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4837 entries, 0 to 4836
Data columns (total 15 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   Unnamed: 0                 4837 non-null   int64 
 1   model_key                  4837 non-null   object
 2   mileage                    4837 non-null   int64 
 3   engine_power               4837 non-null   int64 
 4   fuel                       4837 non-null   object
 5   paint_color                4837 non-null   object
 6   car_type                   4837 non-null   object
 7   private_parking_available  4837 non-null   bool  
 8   has_gps                    4837 non-null   bool  
 9   has_air_conditioning       4837 non-null   bool  
 10  automatic_car              4837 non-null   bool  
 11  has_getaround_connect      4837 non-null   bool  
 12  has_speed_regulator        4837 non-null   bool  
 13  winter_tires               4837 non-null   bool  
 14  rental_p

In [24]:
df_delay.describe(include='all')

Unnamed: 0,rental_id,car_id,checkin_type,state,delay_at_checkout_in_minutes,previous_ended_rental_id,time_delta_with_previous_rental_in_minutes
count,21310.0,21310.0,21310,21310,16346.0,1841.0,1841.0
unique,,,2,2,,,
top,,,mobile,ended,,,
freq,,,17003,18045,,,
mean,549712.880338,350030.603426,,,59.701517,550127.411733,279.28843
std,13863.446964,58206.249765,,,1002.561635,13184.023111,254.594486
min,504806.0,159250.0,,,-22433.0,505628.0,0.0
25%,540613.25,317639.0,,,-36.0,540896.0,60.0
50%,550350.0,368717.0,,,9.0,550567.0,180.0
75%,560468.5,394928.0,,,67.0,560823.0,540.0


In [25]:
df_pricing.describe(include='all')

Unnamed: 0.1,Unnamed: 0,model_key,mileage,engine_power,fuel,paint_color,car_type,private_parking_available,has_gps,has_air_conditioning,automatic_car,has_getaround_connect,has_speed_regulator,winter_tires,rental_price_per_day
count,4837.0,4837,4837.0,4837.0,4837,4837,4837,4837,4837,4837,4837,4837,4837,4837,4837.0
unique,,28,,,4,10,8,2,2,2,2,2,2,2,
top,,Citroën,,,diesel,black,estate,True,True,False,False,False,False,True,
freq,,969,,,4635,1632,1606,2660,3833,3859,3875,2608,3668,4508,
mean,2418.0,,141055.0,129.003515,,,,,,,,,,,121.204879
std,1396.465956,,60140.27,39.008941,,,,,,,,,,,33.58565
min,0.0,,-64.0,0.0,,,,,,,,,,,10.0
25%,1209.0,,103077.0,100.0,,,,,,,,,,,104.0
50%,2418.0,,141120.0,120.0,,,,,,,,,,,119.0
75%,3627.0,,175217.0,135.0,,,,,,,,,,,136.0


In [26]:
fig = px.scatter(df_pricing, x='mileage', y='rental_price_per_day', trendline='ols')
fig.show()

In [27]:
power = 100
fig = px.scatter(df_pricing[df_pricing['engine_power'] == power],
                 x='mileage',
                 y='rental_price_per_day',
                 trendline='ols',
                 title=f'Rental price par mileage pour un engine power de {power}')
fig.show()

In [28]:
fig = px.scatter(df_delay,
                 x='time_delta_with_previous_rental_in_minutes',
                 y='delay_at_checkout_in_minutes',
                 color='checkin_type',
                 range_y=[-5000,12000])
fig.show()

In [42]:
# Calcul de la proportion
total = len(df_delay)
f_retards = len(df_delay[df_delay['delay_at_checkout_in_minutes'] > 0])
f_non_retards = len(df_delay[df_delay['delay_at_checkout_in_minutes'] <= 0])

# Création d'un DataFrame pour le diagramme à secteurs
df_delay_pie = pd.DataFrame({
    'Type': ['Retards', 'A l\'heure'],
    'Proportion': [f_retards / total, f_non_retards / total]
})

# Création du diagramme à secteurs
fig = px.pie(df_delay_pie, values='Proportion', names='Type')
fig.show()

In [19]:
import plotly.graph_objs as go

def calculate_proportions(df, checkin_type=None):
    if checkin_type:
        df = df[df['checkin_type'] == checkin_type]
    f_retards = len(df[df['delay_at_checkout_in_minutes'] > 0])
    f_non_retards = len(df[df['delay_at_checkout_in_minutes'] <= 0])
    return [f_non_retards, f_retards]

proportions_tout = calculate_proportions(df_delay)
proportions_mobile = calculate_proportions(df_delay, 'mobile')
proportions_connect = calculate_proportions(df_delay, 'connect')

fig = go.Figure()
fig.add_trace(go.Pie(labels=['À l\'heure', 'Retards'], values=proportions_tout, name="Mobile & Connect", sort=False))
fig.add_trace(go.Pie(labels=['À l\'heure', 'Retards'], values=proportions_mobile, name="Mobile", visible=False, sort=False))
fig.add_trace(go.Pie(labels=['À l\'heure', 'Retards'], values=proportions_connect, name="Connect", visible=False, sort=False))

# Mise à jour des tracés pour chaque bouton
fig.update_traces(hole=.4, hoverinfo="label+percent+name")

# Ajout des boutons pour changer le diagramme
fig.update_layout(
    title_text="Proportion de retards",
    updatemenus=[{
        "buttons": [
            {
                "label": "Mobile & Connect",
                "method": "update",
                "args": [{"visible": [True, False, False]}]
            },
            {
                "label": "Mobile",
                "method": "update",
                "args": [{"visible": [False, True, False]}]
            },
            {
                "label": "Connect",
                "method": "update",
                "args": [{"visible": [False, False, True]}]
            }
        ],
        "direction": "down",
        "showactive": True,
    }]
)

fig.show()

In [51]:
fig = px.box(df_delay, y='delay_at_checkout_in_minutes', color='checkin_type', range_y=[-25000,72000])
fig.show()

In [8]:
df_delay.head()

Unnamed: 0,rental_id,car_id,checkin_type,state,delay_at_checkout_in_minutes,previous_ended_rental_id,time_delta_with_previous_rental_in_minutes
0,505000,363965,mobile,canceled,,,
1,507750,269550,mobile,ended,-81.0,,
2,508131,359049,connect,ended,70.0,,
3,508865,299063,connect,canceled,,,
4,511440,313932,mobile,ended,,,


In [9]:
# Proportion de retards supérieurs à chaque seuil
thresholds_in_minutes = [0, 50, 100, 150, 200, 250, 300, 350, 400]
scope = ['all', 'only_connect']

proportion_de_retards = np.zeros_like(thresholds_in_minutes)*1.0
df_delay2 = df_delay.copy()
mask = (df_delay2['delay_at_checkout_in_minutes'].notnull())
df_delay2 = df_delay2[mask]

for index in range(len(thresholds_in_minutes)):
    df_delay2[f"retard_trop_grand_{thresholds_in_minutes[index]}"] = df_delay2["delay_at_checkout_in_minutes"].apply(
                                                                        lambda x: 1.0 if x > thresholds_in_minutes[index] else 0.0)
    proportion_de_retards[index] = df_delay2[f"retard_trop_grand_{thresholds_in_minutes[index]}"].mean()

proportion_de_retards = pd.DataFrame(proportion_de_retards, columns=['prop_retards']).reset_index()
proportion_de_retards['index'] = proportion_de_retards['index'].apply(lambda i: thresholds_in_minutes[i])
proportion_de_retards.head()

Unnamed: 0,index,prop_retards
0,0,0.575309
1,50,0.296709
2,100,0.182124
3,150,0.126575
4,200,0.096721


In [10]:
proportion_de_retards_connect = np.zeros_like(thresholds_in_minutes)*1.0
df_delay3 = df_delay.copy()
mask = (df_delay3['delay_at_checkout_in_minutes'].notnull())
df_delay3 = df_delay3[mask]
mask2 = (df_delay3['checkin_type'] == 'connect')
df_delay3 = df_delay3[mask2]

for index in range(len(thresholds_in_minutes)):
    df_delay3[f"retard_trop_grand_{thresholds_in_minutes[index]}"] = df_delay3["delay_at_checkout_in_minutes"].apply(
                                                                        lambda x: 1.0 if x > thresholds_in_minutes[index] else 0.0)
    proportion_de_retards_connect[index] = df_delay3[f"retard_trop_grand_{thresholds_in_minutes[index]}"].mean()

proportion_de_retards_connect = pd.DataFrame(proportion_de_retards_connect, columns=['prop_retards']).reset_index()
proportion_de_retards_connect['index'] = proportion_de_retards_connect['index'].apply(lambda i: thresholds_in_minutes[i])

proportion_de_retards['prop_retards_connect'] = proportion_de_retards_connect['prop_retards']

In [11]:
proportion_de_retards_mobile = np.zeros_like(thresholds_in_minutes)*1.0
df_delay4 = df_delay.copy()
mask = (df_delay4['delay_at_checkout_in_minutes'].notnull())
df_delay4 = df_delay4[mask]
mask2 = (df_delay4['checkin_type'] == 'mobile')
df_delay4 = df_delay4[mask2]

for index in range(len(thresholds_in_minutes)):
    df_delay4[f"retard_trop_grand_{thresholds_in_minutes[index]}"] = df_delay4["delay_at_checkout_in_minutes"].apply(
                                                                        lambda x: 1.0 if x > thresholds_in_minutes[index] else 0.0)
    proportion_de_retards_mobile[index] = df_delay4[f"retard_trop_grand_{thresholds_in_minutes[index]}"].mean()

proportion_de_retards_mobile = pd.DataFrame(proportion_de_retards_mobile, columns=['prop_retards']).reset_index()
proportion_de_retards_mobile['index'] = proportion_de_retards_mobile['index'].apply(lambda i: thresholds_in_minutes[i])

proportion_de_retards['prop_retards_mobile'] = proportion_de_retards_mobile['prop_retards']

In [12]:
proportion_de_retards

Unnamed: 0,index,prop_retards,prop_retards_connect,prop_retards_mobile
0,0,0.575309,0.428865,0.613798
1,50,0.296709,0.189594,0.324861
2,100,0.182124,0.097884,0.204265
3,150,0.126575,0.056731,0.144932
4,200,0.096721,0.035273,0.112871
5,250,0.07794,0.022928,0.092398
6,300,0.064175,0.015285,0.077024
7,350,0.055977,0.011758,0.067599
8,400,0.049859,0.010288,0.06026


In [38]:
thresholds_in_minutes = [0, 50, 100, 150, 200, 250, 300, 350, 400]
scope = ['all', 'only_connect']

mask = (df_delay['delay_at_checkout_in_minutes'].notnull())
mask_connect = (df_delay['checkin_type'] == 'connect')
mask_mobile = (df_delay[mask]['checkin_type'] == 'mobile')

proportion_de_retards = pd.DataFrame({
    'thresholds': thresholds_in_minutes,
    'prop_retards': [df_delay.loc[mask, 'delay_at_checkout_in_minutes'].gt(threshold).mean() for threshold in thresholds_in_minutes],
    'prop_retards_connect': [df_delay.loc[mask & mask_connect, 'delay_at_checkout_in_minutes'].gt(threshold).mean() for threshold in thresholds_in_minutes],
    'prop_retards_mobile': [df_delay.loc[mask & mask_mobile, 'delay_at_checkout_in_minutes'].gt(threshold).mean() for threshold in thresholds_in_minutes]
})

proportion_de_retards


Unnamed: 0,thresholds,prop_retards,prop_retards_connect,prop_retards_mobile
0,0,0.575309,0.428865,0.613798
1,50,0.296709,0.189594,0.324861
2,100,0.182124,0.097884,0.204265
3,150,0.126575,0.056731,0.144932
4,200,0.096721,0.035273,0.112871
5,250,0.07794,0.022928,0.092398
6,300,0.064175,0.015285,0.077024
7,350,0.055977,0.011758,0.067599
8,400,0.049859,0.010288,0.06026


In [41]:
def calculate_proportions2(df):
    thresholds_in_minutes = [0, 50, 100, 150, 200, 250, 300, 350, 400]
    mask = (df['delay_at_checkout_in_minutes'].notnull())
    mask_connect = (df['checkin_type'] == 'connect')
    mask_mobile = (df['checkin_type'] == 'mobile')
    proportions = pd.DataFrame({
        'thresholds': thresholds_in_minutes,
        'prop_retards': [df.loc[mask, 'delay_at_checkout_in_minutes'].
                        gt(threshold).mean() for threshold in thresholds_in_minutes],
        'prop_retards_connect': [df.loc[mask & mask_connect, 'delay_at_checkout_in_minutes'].
                                gt(threshold).mean() for threshold in thresholds_in_minutes],
        'prop_retards_mobile': [df.loc[mask & mask_mobile, 'delay_at_checkout_in_minutes'].
                                gt(threshold).mean() for threshold in thresholds_in_minutes]
    })
    return proportions

proportion_de_retards = calculate_proportions2(df_delay)

fig5 = go.Figure()
fig5.add_trace(
    go.Bar(
        x=proportion_de_retards["thresholds"],
        y=proportion_de_retards["prop_retards"],
        name="Proportion de retards"
    )
)
fig5.add_trace(
    go.Bar(
        x=proportion_de_retards["thresholds"],
        y=proportion_de_retards["prop_retards_mobile"],
        name="Proportion de retards (only mobile)"
    )
)
fig5.add_trace(
    go.Bar(
        x=proportion_de_retards["thresholds"],
        y=proportion_de_retards["prop_retards_connect"],
        name="Proportion de retards (only connect)"
    )
)
fig5.update_layout(
    title='Proportion de retards selon le seuil et le scope',
    xaxis_title='Seuil',
    yaxis_title='Proportion de retards',
    bargap=0.2
)
fig5.show()

In [14]:
def new_delay(row, threshold):
    if threshold > row['time_delta_with_previous_rental_in_minutes']:
        return row['delay_at_checkout_in_minutes'] - (threshold - row['time_delta_with_previous_rental_in_minutes'])
    else:
        return row['delay_at_checkout_in_minutes']

In [19]:
threshold = 200

mask = (df_delay2['delay_at_checkout_in_minutes'].notnull()
         & df_delay2['time_delta_with_previous_rental_in_minutes'].notnull())
df_delay2 = df_delay[['delay_at_checkout_in_minutes', 'time_delta_with_previous_rental_in_minutes']].reset_index().copy()
df_delay2 = df_delay2[mask]
df_delay2['is_delayed_0'] = (df_delay2['delay_at_checkout_in_minutes'] > 0)
df_delay2['new_delay_200'] = df_delay2.apply(lambda row: new_delay(row, threshold), axis=1)
df_delay2['is_delayed_200'] = df_delay2['new_delay_200'] > 0
# df_delay2['is_delayed_300'] = df_delay2['new_delay_300'] > 0
# df_delay2['is_delayed_400'] = df_delay2['new_delay_400'] > 0

In [20]:
df_delay2.head()

Unnamed: 0,index,delay_at_checkout_in_minutes,time_delta_with_previous_rental_in_minutes,is_delayed_0,new_delay_200,is_delayed_200
6,6,-15.0,570.0,False,-15.0,False
19,19,58.0,420.0,True,58.0,True
40,40,-76.0,330.0,False,-76.0,False
64,64,-6.0,630.0,False,-6.0,False
74,74,-7.0,90.0,False,-117.0,False


1) Quelle part du revenu du propriétaire est potentiellement affectée par la feature ?

=> Difficile de répondre à la question avec les données actuelles. Si on a accès à plus d'infos, on pourrait faire une meilleure estimation du gain.

2) Combien de locations sont affectées par la feature selon le scope et le seuil ?

=> Le 1er graphique montre que le delay est moins étendu avec le scope "connect" par rapport au scope "mobile", le 4e graphique montre que si on augmente seuil, de moins en moins de locations finissent avec du retard, cela affecterait environ la moitié des locations.

3) A quelle fréquence les conducteurs sont en retard pour le prochain check-in ? Comment cela impacte-t-il le conducteur suivant ?

=> D'après le 2e graphique, il y a 57.5% de retards. D'après le 3e graphique, si on prend le type "mobile", pas mal de retards dépassent 12*60 = 720 minutes (la valeur de 12 heures étant la "limite" du time delta), seulement 13 si on prend le type "connect". D'après le schéma d'illustration, cela impactera le client suivant si le retard est supérieur au time delta. Le 1er graphique montre alors plus de locations impactées lorsque le time delta est petit (0 - 200 minutes) : cela justifie l'introduction d'un seuil.

4) Combien de cas problématiques seraient résolus en fonction du seuil et du scope choisis ?

=> Les cas problématiques sont présentés par le schéma d'illustration. Le 4e graphique montre qu'en passant au type "connect", on passerait déjà de 61% de retards à 43%. Avec un seuil de 100 minutes, on divise par 3 voire 4 le nombre de retards. Comme le nombre de retards diminue, en conséquence, le nombre de cas problématiques aussi (toujours en se référant à l'illustration).

## Machine Learning

### Modèle de base : régression linéaire

In [53]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

X = df_pricing.drop('rental_price_per_day', axis=1)
y = df_pricing['rental_price_per_day']

categorical_features = ['model_key', 'fuel', 'paint_color', 'car_type',
                        'private_parking_available', 'has_gps', 'has_air_conditioning',
                        'automatic_car', 'has_getaround_connect', 'has_speed_regulator', 'winter_tires']
categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

numeric_features = ['mileage', 'engine_power']
numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

regressor = Pipeline(steps=[('preprocessor', preprocessor),
                            ('regressor', LinearRegression())])

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=0)

regressor.fit(X_train, y_train)

In [54]:
print("R2 score on training set : ", regressor.score(X_train, y_train))
print("R2 score on test set : ", regressor.score(X_val, y_val))

R2 score on training set :  0.7282045193657689
R2 score on test set :  0.6461847017959849


### XGBoost

In [37]:
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from xgboost import XGBRegressor

X = df_pricing.drop('rental_price_per_day', axis=1)
y = df_pricing['rental_price_per_day']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

categorical_features = ['model_key', 'fuel', 'paint_color', 'car_type']
categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

boolean_features = ['private_parking_available', 'has_gps', 'has_air_conditioning', 'automatic_car',
                    'has_getaround_connect', 'has_speed_regulator', 'winter_tires']
boolean_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(drop='first'))
])

numeric_features = ['mileage', 'engine_power']
numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features),
        ('bool', boolean_transformer, boolean_features)
    ])

param_grid = {
    'regressor__n_estimators': [100, 500, 1000],
    'regressor__learning_rate': [0.01, 0.05, 0.1],
    'regressor__max_depth': [3, 4, 5],
}

xgbmodel = XGBRegressor(objective='reg:squarederror')

regressor = Pipeline(steps=[('preprocessor', preprocessor),
                            ('regressor', xgbmodel)])

gridsearch = GridSearchCV(estimator=regressor,
                           param_grid=param_grid,
                           scoring='neg_mean_squared_error',
                           cv=3,
                           verbose=1)

gridsearch.fit(X_train, y_train)

Fitting 3 folds for each of 27 candidates, totalling 81 fits


In [38]:
y_train_pred = gridsearch.predict(X_train)
y_val_pred = gridsearch.predict(X_val)
print("Best parameters:", gridsearch.best_params_)
print("Best cross-validation score: {:.2f}".format(gridsearch.best_score_))

Best parameters: {'regressor__learning_rate': 0.1, 'regressor__max_depth': 4, 'regressor__n_estimators': 500}
Best cross-validation score: -249.27


In [52]:
print("NMSE on training set : ", gridsearch.score(X_train, y_train))
print("NMSE on test set : ", gridsearch.score(X_val, y_val))

NMSE on training set :  -115.50510318374461
NMSE on test set :  -344.85507703935497


In [55]:
import json

json_data = {
  "input": [4842,'Audi',195840,160,'diesel','grey','van',True,True,False,False,True,False,True]
}
feature_names = ['age', 'model_key', 'mileage', 'engine_power', 'fuel', 'paint_color', 'car_type',
                 'private_parking_available', 'has_gps', 'has_air_conditioning',
                 'automatic_car', 'has_getaround_connect', 'has_speed_regulator', 'winter_tires']
input_data = pd.DataFrame([json_data['input']], columns=feature_names)

predicted_price = gridsearch.predict(input_data)

output_json = {
  "prediction": predicted_price.tolist()
}

print(json.dumps(output_json))


{"prediction": [127.94544219970703]}


In [56]:
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from xgboost import XGBRegressor

X = df_pricing.drop('rental_price_per_day', axis=1)
y = df_pricing['rental_price_per_day']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

categorical_features = ['model_key', 'fuel', 'paint_color', 'car_type']
categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

boolean_features = ['private_parking_available', 'has_gps', 'has_air_conditioning', 'automatic_car',
                    'has_getaround_connect', 'has_speed_regulator', 'winter_tires']
boolean_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(drop='first'))
])

numeric_features = ['mileage', 'engine_power']
numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features),
        ('bool', boolean_transformer, boolean_features)
    ])

xgbmodel = XGBRegressor(objective='reg:squarederror',
                        n_estimators = 500,
                        learning_rate = 0.1,
                        max_depth = 4)

regressor = Pipeline(steps=[('preprocessor', preprocessor),
                            ('regressor', xgbmodel)])

regressor.fit(X_train, y_train)

In [57]:
print("R2 score on training set : ", regressor.score(X_train, y_train))
print("R2 score on test set : ", regressor.score(X_val, y_val))

R2 score on training set :  0.8959475686331715
R2 score on test set :  0.712159761434431


In [58]:
import json

json_data = {
  "input": [4842,'Audi',195840,160,'diesel','grey','van',True,True,False,False,True,False,True]
}
feature_names = ['age', 'model_key', 'mileage', 'engine_power', 'fuel', 'paint_color', 'car_type',
                 'private_parking_available', 'has_gps', 'has_air_conditioning',
                 'automatic_car', 'has_getaround_connect', 'has_speed_regulator', 'winter_tires']
input_data = pd.DataFrame([json_data['input']], columns=feature_names)

predicted_price = regressor.predict(input_data)

output_json = {
  "prediction": predicted_price.tolist()
}

print(json.dumps(output_json))

{"prediction": [127.94544219970703]}


In [59]:
import os
from joblib import dump

print("Saving model...")
dump(regressor, 'pricing_model.joblib')
print(f"Model has been saved here: {os.getcwd()}")

Saving model...
Model has been saved here: c:\Users\pierr\OneDrive\Documents\Formation\Formation Jedha\Jedha_Training\Projets Portfolio\Bloc Déploiement\Projet8_GetAround_Analysis


In [2]:
import pandas as pd
from typing import List
import joblib
from pydantic import BaseModel
import warnings
warnings.filterwarnings("ignore")

class VehicleData(BaseModel):
    age: int
    model_key: str
    mileage: int
    engine_power: int
    fuel: str
    paint_color: str
    car_type: str
    private_parking_available: bool
    has_gps: bool
    has_air_conditioning: bool
    automatic_car: bool
    has_getaround_connect: bool
    has_speed_regulator: bool
    winter_tires: bool

class InputData(BaseModel):
    input: List[VehicleData]

data = {
    "input": [[4842,'Audi',195840,160,'diesel','grey','van',True,True,False,False,True,False,True],
            [4838,'Toyota',39743,110,'diesel','black','van',False,True,False,False,False,False,True]]
}

def predict(data: InputData):
    if 'input' not in data:
        raise ValueError("La clé 'input' est manquante dans les données fournies.")
    
    vehicle_dicts = [dict(zip(['age', 'model_key', 'mileage', 'engine_power',
                               'fuel', 'paint_color', 'car_type', 'private_parking_available',
                               'has_gps', 'has_air_conditioning', 'automatic_car',
                               'has_getaround_connect', 'has_speed_regulator', 'winter_tires'],
                               vehicle)) for vehicle in data['input']]
    
    input_data = pd.DataFrame(vehicle_dicts)

    pricing_model = joblib.load('./pricing_model.joblib')

    prediction = pricing_model.predict(input_data)

    response = {
        "prediction": prediction.tolist()
    }

    return response

result = predict(data)
print(result)

{'prediction': [127.94544219970703, 127.71898651123047]}


In [None]:
!curl -X 'POST' \
  'http://localhost:4000/predict' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "input": [
    {
      "age": 4842,
      "model_key": "Audi",
      "mileage": 195840,
      "engine_power": 160,
      "fuel": "diesel",
      "paint_color": "grey",
      "car_type": "van",
      "private_parking_available": true,
      "has_gps": true,
      "has_air_conditioning": false,
      "automatic_car": false,
      "has_getaround_connect": true,
      "has_speed_regulator": false,
      "winter_tires": true
    }
  ]
}'

In [None]:
import requests

response = requests.post("http://localhost:4000/predict", json={
    "input": [
        {
            "age": 4842,
            "model_key": "Audi",
            "mileage": 195840,
            "engine_power": 160,
            "fuel": "diesel",
            "paint_color": "grey",
            "car_type": "van",
            "private_parking_available": True,
            "has_gps": True,
            "has_air_conditioning": False,
            "automatic_car": False,
            "has_getaround_connect": True,
            "has_speed_regulator": False,
            "winter_tires": True
        }
    ]
})
print(response.json())