![snap](https://lever-client-logos.s3.amazonaws.com/2bd4cdf9-37f2-497f-9096-c2793296a75f-1568844229943.png)

# **GetAround Analysis**

[GetAround](https://www.getaround.com/?wpsrc=Google+Organic+Search) is the Airbnb for cars. You can rent cars from any person for a few hours to a few days! Founded in 2009, this company has known rapid growth. In 2019, they count over 5 million users and about 20K available cars worldwide.

## Context

When renting a car, our users have to complete a checkin flow at the beginning of the rental and a checkout flow at the end of the rental in order to:

* Assess the state of the car and notify other parties of pre-existing damages or damages that occurred during the rental.
* Compare fuel levels.
* Measure how many kilometers were driven.

The checkin and checkout of our rentals can be done with three distinct flows:
* **📱 Mobile** rental agreement on native apps: driver and owner meet and both sign the rental agreement on the owner’s smartphone
* **Connect:** the driver doesn’t meet the owner and opens the car with his smartphone
* **📝 Paper** contract (negligible)

## Project 🚧

When using Getaround, drivers book cars for a specific time period, from an hour to a few days long. They are supposed to bring back the car on time, but it happens from time to time that drivers are late for the checkout.

Late returns at checkout can generate high friction for the next driver if the car was supposed to be rented again on the same day : Customer service often reports users unsatisfied because they had to wait for the car to come back from the previous rental or users that even had to cancel their rental because the car wasn’t returned on time.

## Goals 🎯

In order to mitigate those issues, Getaround decided to implement a minimum delay between two rentals. A car won’t be displayed in the search results if the requested checkin or checkout times are too close from an already booked rental.

It solves the late checkout issue but also potentially hurts Getaround/owners revenues: there is a need to find the right trade off.

The Product Manager still needs to decide:
* **threshold:** how long should the minimum delay be?
* **scope:** should we enable the feature for all cars?, only Connect cars?

In order to help them make the right decision, they are asking for some data insights. Here are the first analyses they could think of, to kickstart the discussion.

* Which share of our owner’s revenue would potentially be affected by the feature?
* How many rentals would be affected by the feature depending on the threshold and scope we choose?
* How often are drivers late for the next check-in? How does it impact the next driver?
* How many problematic cases will it solve depending on the chosen threshold and scope?

## **This project will be devided in Three parts:**

### Part 1: Data Analysis and Web Dashboarding

Perform a data analysis and build a dashboard using streamlit that will help the product Management team with the above questions.

### Part 2: Machine Learning

The Data Science team is working on *pricing optimization*. They have gathered some data to suggest optimum prices for car owners using Machine Learning.

#### 1) Train and manage a model using Mlflow server

A linear regression model will be trained and the model will be deployed using an Mlflow server that manages the different experiments performed and stores model artifacts. This will help to make predictions easily by calling the logged model.

#### 2) Make predictions directly from the API - `/predict` endpoint

 The predictions will be made directly from an API hosted on line using Hugging face with the `/predict` endpoint. The full URL would look like this: `https://your-url.com/predict`. This endpoint accepts POST method with JSON input data and it should return the predictions.

### Part 3: API documentation page

A documentation about the API will be provided to the users. It will be located at the `/docs` of the website. It will be located directly at `https://your-url.com/docs`. This small documentation include:

- A title and a description of the API
- A prediction endpoint `/predict`
- A description of every endpoints the user can call with, the required input and the expected output.

### Share the code

The code will be shared on a [Github](https://github.com/) repository with a [`README.md`](https://guides.github.com/features/mastering-markdown/) file with a quick description about this project, how to setup locally and the online URL.

## Deliverable 📬

- The **whole code** stored in a Github repository with the repository's URL.
- A **dashboard** in production accessible via a web page.
- A **trained model** stored in Mlflow server.
- A **documented online API** on Hugging Face server containing one `/predict` endpoint that respects the technical description above.

## Data

* [Delay Analysis](https://full-stack-assets.s3.eu-west-3.amazonaws.com/Deployment/get_around_delay_analysis.xlsx) 👈 Data Analysis
* [Pricing Optimization](https://full-stack-assets.s3.eu-west-3.amazonaws.com/Deployment/get_around_pricing_project.csv) 👈 Machine Learning

In [1]:
!python3 --version

Python 3.11.11


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# Import libraries for data manipulation and analysis
import pandas as pd  # For working with DataFrames
import numpy as np  # For numerical operations and arrays

# Import libraries for data visualization
import matplotlib.pyplot as plt  # For creating static plots and visualizations
import seaborn as sns   # For creating static plots and visualizations
import plotly.express as px  # For creating interactive plots
import plotly.graph_objects as go  # For creating more customized plots with Plotly
from plotly.subplots import make_subplots  # For creating subplots within a figure
import plotly.io as pio  # For configuring Plotly output

In [None]:
file_path = "/content/drive/MyDrive/Colab Notebooks/Getaround_project/get_around_delay_analysis.xlsx"

# Load file
data = pd.read_excel(file_path)

# Display the first few rows of the DataFrame
data.head()

Unnamed: 0,rental_id,car_id,checkin_type,state,delay_at_checkout_in_minutes,previous_ended_rental_id,time_delta_with_previous_rental_in_minutes
0,505000,363965,mobile,canceled,,,
1,507750,269550,mobile,ended,-81.0,,
2,508131,359049,connect,ended,70.0,,
3,508865,299063,connect,canceled,,,
4,511440,313932,mobile,ended,,,


### **1. EDA: Delay analysis**

In [None]:
# Display the shape of the dataframe (rows, columns)
data.shape

(21310, 7)

In [None]:
# Display information about the DataFrame
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21310 entries, 0 to 21309
Data columns (total 7 columns):
 #   Column                                      Non-Null Count  Dtype  
---  ------                                      --------------  -----  
 0   rental_id                                   21310 non-null  int64  
 1   car_id                                      21310 non-null  int64  
 2   checkin_type                                21310 non-null  object 
 3   state                                       21310 non-null  object 
 4   delay_at_checkout_in_minutes                16346 non-null  float64
 5   previous_ended_rental_id                    1841 non-null   float64
 6   time_delta_with_previous_rental_in_minutes  1841 non-null   float64
dtypes: float64(3), int64(2), object(2)
memory usage: 1.1+ MB


In [None]:
# Display missing values
data.isnull().sum()

Unnamed: 0,0
rental_id,0
car_id,0
checkin_type,0
state,0
delay_at_checkout_in_minutes,4964
previous_ended_rental_id,19469
time_delta_with_previous_rental_in_minutes,19469


In [None]:
# Check duplicates
data.duplicated().sum()

np.int64(0)

In [None]:
# Display summary statistics of the DataFrame
data.describe(include="all")

Unnamed: 0,rental_id,car_id,checkin_type,state,delay_at_checkout_in_minutes,previous_ended_rental_id,time_delta_with_previous_rental_in_minutes
count,21310.0,21310.0,21310,21310,16346.0,1841.0,1841.0
unique,,,2,2,,,
top,,,mobile,ended,,,
freq,,,17003,18045,,,
mean,549712.880338,350030.603426,,,59.701517,550127.411733,279.28843
std,13863.446964,58206.249765,,,1002.561635,13184.023111,254.594486
min,504806.0,159250.0,,,-22433.0,505628.0,0.0
25%,540613.25,317639.0,,,-36.0,540896.0,60.0
50%,550350.0,368717.0,,,9.0,550567.0,180.0
75%,560468.5,394928.0,,,67.0,560823.0,540.0


In [None]:
# Get all unique previous_ended_rental_id values (excluding NaN)
previous_rental_ids = data['previous_ended_rental_id'].dropna().unique()

# Get all unique rental_id values
rental_ids = data['rental_id'].unique()

# Check if all previous_ended_rental_id values are in rental_id values
all_previous_in_rental = np.all(np.isin(previous_rental_ids, rental_ids))

# Print the result
if all_previous_in_rental:
    print("All previous_ended_rental_id values correspond to a rental_id.")
else:
    print("Not all previous_ended_rental_id values correspond to a rental_id.")

    # Find the previous_ended_rental_id values that are not in rental_id values
    missing_ids = previous_rental_ids[~np.isin(previous_rental_ids, rental_ids)]
    print(f"Missing rental_ids: {missing_ids}")

All previous_ended_rental_id values correspond to a rental_id.


#### **1.1. Car ID**

In [None]:
# Get unique car_id count
data["car_id"].nunique()

8143

In [None]:
# Count rentals by car ID
rentals_by_car_id = data.groupby('car_id')['rental_id'].count().reset_index()
rentals_by_car_id.rename(columns={'rental_id': 'rental_count'}, inplace=True)

# Group by rental_count and count occurrences
rental_count_distribution = rentals_by_car_id.groupby('rental_count')['car_id'].count().reset_index()
rental_count_distribution.rename(columns={'car_id': 'number_of_cars'}, inplace=True)

fig = px.bar(
    rental_count_distribution,
    x='rental_count',
    y='number_of_cars',
    title='Distribution of Rental Counts',
    labels={'rental_count': 'Number of Rentals', 'number_of_cars': 'Number of Cars'}
)
fig.show()

In [None]:
# Calculate total number of cars
total_cars = rental_count_distribution['number_of_cars'].sum()

# Filter for rental_count = 1
rentals_count_1 = rental_count_distribution[rental_count_distribution['rental_count'] == 1]

# Calculate percentage
percentage_rentals_count_1 = (rentals_count_1['number_of_cars'].sum() / total_cars) * 100

# Print the result
print(f"Percentage of rentals with rental_count = 1: {percentage_rentals_count_1:.2f}%")

Percentage of rentals with rental_count = 1: 44.93%


- 45% of the cars in the dataset were rented only once.

In [None]:
def get_checkin_type(car_id, data):
    """
    Returns the checkin type for a given car_id.

    Args:
        car_id (int): The ID of the car.
        data (pd.DataFrame): The DataFrame containing car data.

    Returns:
        str: The checkin type ('mobile', 'connect', or 'both').
    """

    checkin_types = data[data['car_id'] == car_id]['checkin_type'].unique()

    if len(checkin_types) == 1:
        return checkin_types[0]  # Return the unique checkin type
    elif len(checkin_types) > 1:
        return 'both'  # Return 'both' if multiple checkin types
    else:
        return None  # Return None if no checkin type found for the car_id

# Apply the function to get checkin type for all cars
data['checkin_type_category'] = data['car_id'].apply(lambda x: get_checkin_type(x, data))

# Group by car_id and count the occurrences of checkin types
checkin_type_counts = data.groupby('car_id')['checkin_type_category'].first().value_counts()

# Print the overall distribution of checkin types
print("\nOverall distribution of checkin types:")
print(checkin_type_counts)

# Create a pie chart using Plotly Express
fig = px.pie(
    names=checkin_type_counts.index,
    values=checkin_type_counts.values,
    title='Distribution of Check-in Types for cars'
)

fig.show()


Overall distribution of checkin types:
checkin_type_category
mobile     7429
connect     618
both         96
Name: count, dtype: int64


In [None]:
# Filter for cars with both check-in types
cars_with_both_types = data[data.groupby('car_id')['checkin_type'].transform('nunique') > 1]['car_id'].unique()
filtered_data = data[data['car_id'].isin(cars_with_both_types)]

# Calculate check-in type percentages for filtered data
checkin_type_percentages = filtered_data['checkin_type'].value_counts(normalize=True) * 100

# Print the percentages
print("Check-in Type Percentages for Cars with Both Types:")
print(checkin_type_percentages)

# Create a pie chart using Plotly Express
fig = px.pie(
    names=checkin_type_percentages.index,
    values=checkin_type_percentages.values,
    title='Check-in Type Percentages for Cars with Both Types'
)

fig.show()

Check-in Type Percentages for Cars with Both Types:
checkin_type
connect    73.06338
mobile     26.93662
Name: proportion, dtype: float64


#### **1.2. Checking type**

In [None]:
# Count occurrences of 'mobile' and 'connect'
checkin_type_counts = data['checkin_type'].value_counts()
checkin_type_counts

Unnamed: 0_level_0,count
checkin_type,Unnamed: 1_level_1
mobile,17003
connect,4307


In [None]:
# Create a pie chart using Plotly Express
fig = px.pie(
    names=checkin_type_counts.index,
    values=checkin_type_counts.values,
    title='Distribution of Check-in Types',
)

fig.show()

#### **1.3. State**

In [None]:
# Count occurrences of 'ended' and 'canceled'
state_type_counts = data['state'].value_counts()
state_type_counts

Unnamed: 0_level_0,count
state,Unnamed: 1_level_1
ended,18045
canceled,3265


In [None]:
# Create a pie chart using Plotly Express
fig = px.pie(
    names=state_type_counts.index,
    values=state_type_counts.values,
    title='Distribution of state Types',
)

fig.show()

##### **1.3.1. Canceled**

In [None]:
# Filter the original DataFrame to select rows with 'state' as 'canceled'
canceled_rentals_df = data[data['state'] == 'canceled']

# Display the new DataFrame
print(canceled_rentals_df.head())

    rental_id  car_id checkin_type     state  delay_at_checkout_in_minutes  \
0      505000  363965       mobile  canceled                           NaN   
3      508865  299063      connect  canceled                           NaN   
8      512475  322502       mobile  canceled                           NaN   
10     513743  330658       mobile  canceled                           NaN   
11     514161  366037      connect  canceled                           NaN   

    previous_ended_rental_id  time_delta_with_previous_rental_in_minutes  \
0                        NaN                                         NaN   
3                        NaN                                         NaN   
8                        NaN                                         NaN   
10                       NaN                                         NaN   
11                       NaN                                         NaN   

   checkin_type_category  
0                 mobile  
3                con

In [None]:
# Display summary statistics of the DataFrame
canceled_rentals_df.describe()

Unnamed: 0,rental_id,car_id,delay_at_checkout_in_minutes,previous_ended_rental_id,time_delta_with_previous_rental_in_minutes
count,3265.0,3265.0,1.0,229.0,229.0
mean,548637.68392,350585.309954,-17468.0,550913.327511,294.89083
std,14907.810897,57254.052866,,11955.3976,250.591601
min,504871.0,159533.0,-17468.0,509972.0,0.0
25%,539183.0,317572.0,-17468.0,543706.0,60.0
50%,549700.0,368593.0,-17468.0,550970.0,210.0
75%,560563.0,394869.0,-17468.0,560395.0,570.0
max,576195.0,416935.0,-17468.0,574540.0,720.0


In [None]:
# Filter the DataFrame to select rows where 'delay_at_checkout_in_minutes' is not null
not_null_delays_df = canceled_rentals_df[canceled_rentals_df['delay_at_checkout_in_minutes'].notnull()]

# Print the filtered DataFrame
print(not_null_delays_df)

       rental_id  car_id checkin_type     state  delay_at_checkout_in_minutes  \
21002     559126  379544       mobile  canceled                      -17468.0   

       previous_ended_rental_id  time_delta_with_previous_rental_in_minutes  \
21002                       NaN                                         NaN   

      checkin_type_category  
21002                mobile  


- Here is a canceled rental with a negative delay at checkout since it never started. It will be dropped.

In [None]:
# Drop row 21002
data = data.drop(index=21002)

In [None]:
# Check if row 21002 is still present
if 21002 in data.index:
    print("Row 21002 is still present in the DataFrame.")
else:
    print("Row 21002 has been dropped from the DataFrame.")

Row 21002 has been dropped from the DataFrame.


##### **1.3.2. Ended**

In [None]:
# Filter the DataFrame to exclude canceled rentals
#data_filtered = data[data['state'] != 'canceled']
#data_filtered.head()

In [None]:
# Display the shape of the dataframe (rows, columns)
#data_filtered.shape

In [None]:
# Display missing values
#data_filtered.isnull().sum()

#### **1.4. delay at checkout in minutes**

In [None]:
# Create an histogram using Plotly Express
fig = px.histogram(data, x="delay_at_checkout_in_minutes")
fig.show()

In [None]:
# Create a box chart using Plotly Express
fig = px.box(data, y="delay_at_checkout_in_minutes")
fig.show()

In [None]:
# Calculate quantiles
Q1 = data['delay_at_checkout_in_minutes'].quantile(0.25)
Q3 = data['delay_at_checkout_in_minutes'].quantile(0.75)
IQR = Q3 - Q1
# Define upper and lower bounds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
# Filter data to remove outliers
data_IQR = data[(data['delay_at_checkout_in_minutes'] >= lower_bound) & (data['delay_at_checkout_in_minutes'] <= upper_bound)]

In [None]:
# Display the shape of the dataframe (rows, columns)
data_IQR.shape

(13916, 8)

In [None]:
# Create an histogram chart using Plotly Express
fig = px.histogram(data_IQR, x="delay_at_checkout_in_minutes")
fig.show()

In [None]:
# Create a box chart using Plotly Express
fig = px.box(data_IQR, y="delay_at_checkout_in_minutes")
fig.show()

In [None]:
# Calculate Z-scores
data['zscore'] = (data['delay_at_checkout_in_minutes'] - data['delay_at_checkout_in_minutes'].mean()) / data['delay_at_checkout_in_minutes'].std()

# Define threshold
threshold = 0.5  # Common threshold is 2 standard deviations

# Filter data to remove outliers
data_Zscore = data[(data['zscore'] >= -threshold) & (data['zscore'] <= threshold)]
# Remove 'zscore' column if not needed
data = data.drop(columns=['zscore'])

In [None]:
# Display the shape of the dataframe (rows, columns)
data_Zscore.shape

(15310, 9)

In [None]:
# Create an histogram using Plotly Express
fig = px.histogram(data_Zscore, x="delay_at_checkout_in_minutes")
fig.show()

In [None]:
# Create a box chart using Plotly Express
fig = px.box(data_Zscore, y="delay_at_checkout_in_minutes")
fig.show()

In [None]:
# Filter out rows with missing 'delay_at_checkout_in_minutes'
valid_delays_df = data[data['delay_at_checkout_in_minutes'].notna()]

# Create the new column 'delay_at_checkout_type'
valid_delays_df['delay_at_checkout_type'] = np.where(valid_delays_df['delay_at_checkout_in_minutes'] > 0, 'late',
                                        np.where(valid_delays_df['delay_at_checkout_in_minutes'] < 0, 'in advance',
                                        np.where(valid_delays_df['delay_at_checkout_in_minutes'] == 0 , 'no_delay',
                                        # Instead of np.nan, use a string value 'missing'
                                              'missing'))) # Assign 'missing' for the remaining cases
                                        #np.where(valid_delays_df['delay_at_checkout_in_minutes'].isnull(), np.nan, 'nan'))))



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [None]:
# Count occurrences of type of checkout
checkout_type_counts = valid_delays_df['delay_at_checkout_type'].value_counts()
checkout_type_counts

Unnamed: 0_level_0,count
delay_at_checkout_type,Unnamed: 1_level_1
late,9404
in advance,6819
no_delay,122


In [None]:
# Create a pie chart using Plotly Express
fig = px.pie(
    names=checkout_type_counts.index,
    values=checkout_type_counts.values,
    title='Distribution of delay at checkout Types',
)

fig.show()

In [None]:
# Filter for late checkouts (positive delays)
late_checkouts_df = data[data['delay_at_checkout_in_minutes'] > 0]

# Define delay intervals
delay_intervals = [
    (1, 60),      # 1 to 60 minutes (1 hour)
    (61, 120),    # 61 to 120 minutes (2 hours)
    (121, 180),   # 121 to 180 minutes (3 hours)
    (181, 240),   # 181 to 240 minutes (4 hours)
    (241, 300),   # 141 to 300 minutes (5 hours)
    (301, float('inf'))  # More than 300 minutes
]

# Create a list to store the percentages
delay_percentages = []

# Calculate percentage for each interval
for interval in delay_intervals:
    lower_bound, upper_bound = interval
    # Count delays within the interval
    delay_count = late_checkouts_df[
        (late_checkouts_df['delay_at_checkout_in_minutes'] >= lower_bound) &
        (late_checkouts_df['delay_at_checkout_in_minutes'] <= upper_bound)
    ].shape[0]
    # Calculate percentage
    percentage = (delay_count / late_checkouts_df.shape[0]) * 100
    delay_percentages.append(percentage)

# Create a DataFrame to display the results
delay_distribution_only_df = pd.DataFrame({
    'Delay Interval (minutes)': [f"{lower}-{upper}" for lower, upper in delay_intervals],
    'Percentage of Delays': delay_percentages
})

# Print the distribution
print("Delay Distribution (Late Checkouts Only):")
print(delay_distribution_only_df)

Delay Distribution (Late Checkouts Only):
  Delay Interval (minutes)  Percentage of Delays
0                     1-60             53.360272
1                   61-120             19.491706
2                  121-180              8.475117
3                  181-240              4.487452
4                  241-300              3.030625
5                  301-inf             11.154828


In [None]:
# Create a pie chart using Plotly Express
fig = px.pie(
    delay_distribution_only_df,
    names='Delay Interval (minutes)',
    values='Percentage of Delays',
    title='Delay Distribution (Late Checkouts Only)'
)

fig.show()

In [None]:
# Create a bar chart
fig = px.bar(
    delay_distribution_only_df,
    x='Delay Interval (minutes)',
    y='Percentage of Delays',
    title='Delay Distribution (Late Checkouts Only)',
    text='Percentage of Delays'  # Add text to display percentage values
)

# Update layout to display text on top of bars
fig.update_traces(texttemplate='%{text:.1f}%', textposition='outside')

fig.show()

- If GetAround would put a minimum of 1 hour delta time between to rentals, it would reduce the delay by 53.4%.
- If GetAround would put a minimum of 2 hours delta time, it would reduce the delay by 72.9%.
- If GetAround would put a minimum of 3 hours delta time, it would reduce the delay by 81.4%.
- If GetAround would put a minimum of 4 hour delta time, it would reduce the delay by 85.9%.
- If GetAround would put a minimum of 5 hours delta time, it would reduce the delay by 88.9%.

In [None]:
# Filter out rows with missing 'delay_at_checkout_in_minutes'
valid_delays_df = data[data['delay_at_checkout_in_minutes'].notna()]

# Define delay intervals
delay_intervals = [
    (1, 60),      # 1 to 60 minutes (1 hour)
    (61, 120),    # 61 to 120 minutes (2 hours)
    (121, 180),   # 121 to 180 minutes (3 hours)
    (181, 240),   # 181 to 240 minutes (4 hours)
    (241, 300),   # 141 to 300 minutes (5 hours)
    (301, float('inf'))  # More than 300 minutes
]

# Create a list to store the percentages
delay_percentages = []

# Calculate percentage for each interval
for interval in delay_intervals:
    lower_bound, upper_bound = interval
    # Count delays within the interval using valid_delays_df
    delay_count = valid_delays_df[
        (valid_delays_df['delay_at_checkout_in_minutes'] >= lower_bound) &
        (valid_delays_df['delay_at_checkout_in_minutes'] <= upper_bound)
    ].shape[0]
    # Calculate percentage using valid_delays_df.shape[0]
    percentage = (delay_count / valid_delays_df.shape[0]) * 100
    delay_percentages.append(percentage)

# Create a DataFrame to display the results
delay_distribution_df = pd.DataFrame({
    'Delay Interval (minutes)': [f"{lower}-{upper}" for lower, upper in delay_intervals],
    'Percentage of Delays': delay_percentages
})

# Print the distribution
print("Delay Distribution:")
print(delay_distribution_df)

Delay Distribution:
  Delay Interval (minutes)  Percentage of Delays
0                     1-60             30.700520
1                   61-120             11.214439
2                  121-180              4.876109
3                  181-240              2.581829
4                  241-300              1.743652
5                  301-inf              6.417865


In [None]:
# Create a bar chart
fig = px.bar(
    delay_distribution_df,
    x='Delay Interval (minutes)',
    y='Percentage of Delays',
    title='Delay Distribution',
    text='Percentage of Delays'  # Add text to display percentage values
)

# Update layout to display text on top of bars
fig.update_traces(texttemplate='%{text:.1f}%', textposition='outside')

fig.show()

- If GetAround would put a minimum of 1 hour delta time between to rentals, it would reduce the delay by 53.4% and would have only 26.8% delays left.
- If GetAround would put a minimum of 2 hours delta time, it would reduce the delay by 72.9% and would have only 15.6% delays left.
- If GetAround would put a minimum of 3 hours delta time, it would reduce the delay by 81.4% and would have only 10.7% delays left.
- If GetAround would put a minimum of 4 hour delta time, it would reduce the delay by 85.9% and would have only 8.1% delays left.
- If GetAround would put a minimum of 5 hours delta time, it would reduce the delay by 88.9% and would have only 6.4% delays left.

#### **1.5. Time delta with previous rental in minutes**

In [None]:
# Create a box chart using Plotly Express
fig = px.box(data, y="time_delta_with_previous_rental_in_minutes")
fig.show()

In [None]:
# Filter out rows with missing time_delta_with_previous_rental_in_minutes
valid_time_delta_df = data[data['time_delta_with_previous_rental_in_minutes'].notna()]

# Define time delta intervals
time_delta_intervals = [
    (0, 0),       # 0 minutes
    (1, 60),      # 1 to 60 minutes (1 hour)
    (61, 120),    # 61 to 120 minutes (2 hours)
    (121, 180),   # 121 to 180 minutes (3 hours)
    (181, 240),   # 181 to 240 minutes (4 hours)
    (241, 300),   # 141 to 300 minutes (5 hours)
    (301, float('inf'))  # More than 300 minutes
]

# Create a list to store the percentages
time_delta_percentages = []

# Calculate percentage for each interval
for interval in time_delta_intervals:
    lower_bound, upper_bound = interval
    # Count time deltas within the interval
    time_delta_count = valid_time_delta_df[
        (valid_time_delta_df['time_delta_with_previous_rental_in_minutes'] >= lower_bound) &
        (valid_time_delta_df['time_delta_with_previous_rental_in_minutes'] <= upper_bound)
    ].shape[0]
    # Calculate percentage using valid_time_delta_df.shape[0]
    percentage = (time_delta_count / valid_time_delta_df.shape[0]) * 100
    time_delta_percentages.append(percentage)

# Create a DataFrame to display the results
time_delta_distribution_df = pd.DataFrame({
    'Time Delta Interval (minutes)': [f"{lower}-{upper}" for lower, upper in time_delta_intervals],
    'Percentage of Rentals': time_delta_percentages
})

# Print the distribution
print("Time Delta Distribution:")
print(time_delta_distribution_df)

Time Delta Distribution:
  Time Delta Interval (minutes)  Percentage of Rentals
0                           0-0              15.154807
1                          1-60              16.567083
2                        61-120              11.895709
3                       121-180               8.147746
4                       181-240               6.246605
5                       241-300               4.128191
6                       301-inf              37.859859


In [None]:
# Create a bar chart
fig = px.bar(
    time_delta_distribution_df,
    x='Time Delta Interval (minutes)',
    y='Percentage of Rentals',
    title='Time Delta Distribution',
    text='Percentage of Rentals'  # Add text to display percentage values
)

# Update layout to display text on top of bars
fig.update_traces(texttemplate='%{text:.1f}%', textposition='outside')

fig.show()

- 15.2% of rentals are not spaced at all
- 16.6% of rentals are spaced by less than 1 hour
- 11.9% of rentals are spaced by 1 to 2 hours
- 8.1% of rentals are spaced by 2 to 3 hours
- 6.2% of rentals are spaced by 3 to 4 hours
- 4.1% of rentals are spaced by 4 to 5 hours
- 37.9% of rentals are more spaced by at least 5 hours

- For a minimum of 1 hour delta time, GetAround would lose 31.8% of rentals.
- For a minimum of 2 hours delta time, GetAround would lose 43.7%.
- For a minimum of 3 hours delta time, GetAround would lose 51.8%.
- For a minimum of 4 hours delta time, GetAround would lose 58%.
- For a minimum of 5 hours delta time, GetAround would lose 62.1%.


- For a minimum of 1 hour delta time, GetAround would reduce the delay by 53.4% and would have only 26.8% delays left but would lose 31.8% of rentals.
- For a minimum of 2 hours delta time, GetAround would reduce the delay by 72.9% and would have only 15.6% delays left but would lose 43.7%.
- For a minimum of 3 hours delta time, GetAround would reduce the delay by 81.4% and would have only 10.7% delays left but would lose 51.8%.
- For a minimum of 4 hours delta time, GetAround would reduce the delay by 85.9% and would have only 8.1% delays left but would lose 58%.
- For a minimum of 5 hours delta time, GetAround would reduce the delay by 88.9% and would have only 6.4% delays left but would lose 62.1%.

#### **1.6. Checkin data**

In [None]:
# Merge the DataFrame with itself to link rentals with their previous rentals
merged_data_checkin = pd.merge(data, data, left_on='rental_id', right_on='previous_ended_rental_id', how='inner', suffixes=('_current', '_previous'))

# Filter the merged data to keep only rows where rental_id matches previous_ended_rental_id
# Use the renamed column 'rental_id_current' instead of 'rental_id'
filtered_data = merged_data_checkin[merged_data_checkin['rental_id_current'] == merged_data_checkin['previous_ended_rental_id_previous']]

# Calculate checkin_delay_in_minutes for the filtered data
filtered_data['checkin_delay_in_minutes'] = filtered_data['time_delta_with_previous_rental_in_minutes_previous'] - filtered_data['delay_at_checkout_in_minutes_current']

# Display the filtered data with relevant columns
print(filtered_data[['rental_id_current', 'previous_ended_rental_id_previous', 'delay_at_checkout_in_minutes_current', 'time_delta_with_previous_rental_in_minutes_previous', 'checkin_delay_in_minutes']].head()) # Use rental_id_current instead of rental_id

   rental_id_current  previous_ended_rental_id_previous  \
0             531158                           531158.0   
1             533303                           533303.0   
2             533380                           533380.0   
3             534820                           534820.0   
4             535313                           535313.0   

   delay_at_checkout_in_minutes_current  \
0                                  29.0   
1                                -340.0   
2                                -167.0   
3                                -576.0   
4                                  23.0   

   time_delta_with_previous_rental_in_minutes_previous  \
0                                               90.0     
1                                              600.0     
2                                              690.0     
3                                              150.0     
4                                              720.0     

   checkin_delay_in_minutes  
0      

In [None]:
# Display the shape of the dataframe (rows, columns)
filtered_data.shape

(1841, 17)

In [None]:
# Create an histogram
fig = px.histogram(
    filtered_data,
    x="checkin_delay_in_minutes",
    title="Distribution of Check-in Delay in Minutes",
    nbins=50  # Adjust the number of bins as needed for better visualization
)

fig.show()

In [None]:
# Count NaN values
nan_count = filtered_data['checkin_delay_in_minutes'].isnull().sum()
# Count no time left values
no_time_left_count = (filtered_data['checkin_delay_in_minutes'] == 0).sum()
# Count positive time left values
positive_time_left_count = (filtered_data['checkin_delay_in_minutes'] > 0).sum()
# Count negative time left values
negative_time_left_count = (filtered_data['checkin_delay_in_minutes'] < 0).sum()
# Print the counts
print(f"NaN Count: {nan_count}")
print(f"No time left Count: {no_time_left_count}")
print(f"Positive time left Count: {positive_time_left_count}")
print(f"Negative time left Count: {negative_time_left_count}")

NaN Count: 112
No time left Count: 5
Positive time left Count: 1506
Negative time left Count: 218


In [None]:
# Count occurrences of 'state' as 'canceled' in filtered_data
canceled_count = filtered_data[filtered_data['state_previous'] == 'canceled'].shape[0]

# Print the count
print(f"Number of canceled rentals in filtered_data: {canceled_count}")

Number of canceled rentals in filtered_data: 229


In [None]:
# Create categories based on 'checkin_delay_in_minutes' values (excluding NaN)
filtered_data['checkin_delay_type'] = pd.cut(
    filtered_data['checkin_delay_in_minutes'],
    bins=[-float('inf'), 0, float('inf')],
    labels=['Negative time left', 'Positive time left'],
    include_lowest=True,
    right=False
)

# Filter out rows with NaN values in 'checkin_delay_type' (created from NaN in 'checkin_delay_in_minutes')
filtered_data_no_nan = filtered_data[filtered_data['checkin_delay_type'].notna()]

# Count occurrences of each category (excluding NaN)
checkin_delay_counts = filtered_data_no_nan['checkin_delay_type'].value_counts()

# Create a pie chart using Plotly Express
fig = px.pie(
    names=checkin_delay_counts.index,
    values=checkin_delay_counts.values,
    title='Distribution of Check-in Delay Types (excluding NaN)',
)

fig.show()

In [None]:
# Filter for negative time left
negative_time_left_df = filtered_data[filtered_data['checkin_delay_type'] == 'Negative time left']

# Count occurrences of checkin_type within negative time left
# Access the original 'checkin_type' column from the 'filtered_data' DataFrame
checkin_type_counts = filtered_data.loc[negative_time_left_df.index, 'checkin_type_previous'].value_counts()

# Print the counts
print("Check-in Type Counts for Negative Time Left:")
print(checkin_type_counts)

# Access individual counts if needed
mobile_count = checkin_type_counts.get('mobile', 0)  # Get count for 'mobile', default to 0 if not found
connect_count = checkin_type_counts.get('connect', 0) # Get count for 'connect', default to 0 if not found

Check-in Type Counts for Negative Time Left:
checkin_type_previous
mobile     149
connect     69
Name: count, dtype: int64


In [None]:
# Create the pie chart
fig = px.pie(
    names=checkin_type_counts.index,  # Check-in types (e.g., 'mobile', 'connect')
    values=checkin_type_counts.values,  # Counts for each check-in type
    title='Check-in Type Distribution for Negative Time Left',
)

fig.show()

NameError: name 'px' is not defined

In [None]:
# Count occurrences of state within negative time left
# Access the original 'state' column from the 'filtered_data' DataFrame
state_counts = filtered_data.loc[negative_time_left_df.index, 'state_previous'].value_counts()

# Print the counts
print("Check-in Type Counts for Negative Time Left:")
print(state_counts)

# Access individual counts if needed
ended_count = state_counts.get('ended', 0)  # Get count for 'ended', default to 0 if not found
canceled_count = state_counts.get('canceled', 0) # Get count for 'canceled', default to 0 if not found

Check-in Type Counts for Negative Time Left:
state_previous
ended       181
canceled     37
Name: count, dtype: int64


In [None]:
# Group by 'state_previous' and 'checkin_type_previous' and get counts
grouped_counts = negative_time_left_df.groupby(['state_previous', 'checkin_type_previous']).size().reset_index(name='count')

# Print the grouped counts
print("Grouped Counts for Negative Time Left:")
print(grouped_counts)

Grouped Counts for Negative Time Left:
  state_previous checkin_type_previous  count
0       canceled               connect     19
1       canceled                mobile     18
2          ended               connect     50
3          ended                mobile    131


In [None]:
# Create pie chart for 'mobile' check-in type
mobile_data = grouped_counts[grouped_counts['checkin_type_previous'] == 'mobile']
fig_mobile = px.pie(
    mobile_data,
    names='state_previous',
    values='count',
    title='Distribution of Rental State for Mobile Check-in',
)
fig_mobile.show()

In [None]:
# Create pie chart for 'connect' check-in type
connect_data = grouped_counts[grouped_counts['checkin_type_previous'] == 'connect']
fig_connect = px.pie(
    connect_data,
    names='state_previous',
    values='count',
    title='Distribution of Rental State for Connect Check-in',
)
fig_connect.show()

### **2. Machine learning: Pricing prediction**

In [None]:
# Load file
pricing_df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Getaround_project/get_around_pricing_project.csv')

In [None]:
# Import libraries
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.metrics import mean_squared_error, r2_score
import warnings

warnings.filterwarnings(
    "ignore", category=DeprecationWarning
)  # to avoid deprecation warnings

In [None]:
# Display the first few rows of the DataFrame
pricing_df.head()

Unnamed: 0.1,Unnamed: 0,model_key,mileage,engine_power,fuel,paint_color,car_type,private_parking_available,has_gps,has_air_conditioning,automatic_car,has_getaround_connect,has_speed_regulator,winter_tires,rental_price_per_day
0,0,Citroën,140411,100,diesel,black,convertible,True,True,False,False,True,True,True,106
1,1,Citroën,13929,317,petrol,grey,convertible,True,True,False,False,False,True,True,264
2,2,Citroën,183297,120,diesel,white,convertible,False,False,False,False,True,False,True,101
3,3,Citroën,128035,135,diesel,red,convertible,True,True,False,False,True,True,True,158
4,4,Citroën,97097,160,diesel,silver,convertible,True,True,False,False,False,True,True,183


In [None]:
# Drop the unnamed first column from pricing_df
pricing_df = pricing_df.drop(pricing_df.columns[0], axis=1)

In [None]:
# Display the shape of the dataframe (rows, columns)
pricing_df.shape

(4843, 14)

In [None]:
# Display information about the DataFrame
pricing_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4843 entries, 0 to 4842
Data columns (total 14 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   model_key                  4843 non-null   object
 1   mileage                    4843 non-null   int64 
 2   engine_power               4843 non-null   int64 
 3   fuel                       4843 non-null   object
 4   paint_color                4843 non-null   object
 5   car_type                   4843 non-null   object
 6   private_parking_available  4843 non-null   bool  
 7   has_gps                    4843 non-null   bool  
 8   has_air_conditioning       4843 non-null   bool  
 9   automatic_car              4843 non-null   bool  
 10  has_getaround_connect      4843 non-null   bool  
 11  has_speed_regulator        4843 non-null   bool  
 12  winter_tires               4843 non-null   bool  
 13  rental_price_per_day       4843 non-null   int64 
dtypes: bool(

In [None]:
# Check the uniqueness of value counts for 'engine_power'
pricing_df['engine_power'].value_counts().unique()

array([882, 785, 631, 570, 451, 319, 166, 153, 142, 120,  99,  62,  49,
        47,  43,  40,  32,  30,  25,  21,  19,  16,  14,  11,   9,   7,
         6,   5,   4,   3,   2,   1])

In [None]:
# Check missing values
pricing_df.isnull().sum().any()

np.False_

In [None]:
# Separate target variable Y from features X
target_name = 'rental_price_per_day'

print("Separating labels from features...")
Y = pricing_df.loc[:, target_name]
X = pricing_df.drop(target_name, axis=1)  # Drop target
print("...Done.")
print(Y.head())
print()
print(X.head())
print()

Separating labels from features...
...Done.
0    106
1    264
2    101
3    158
4    183
Name: rental_price_per_day, dtype: int64

  model_key  mileage  engine_power    fuel paint_color     car_type  \
0   Citroën   140411           100  diesel       black  convertible   
1   Citroën    13929           317  petrol        grey  convertible   
2   Citroën   183297           120  diesel       white  convertible   
3   Citroën   128035           135  diesel         red  convertible   
4   Citroën    97097           160  diesel      silver  convertible   

   private_parking_available  has_gps  has_air_conditioning  automatic_car  \
0                       True     True                 False          False   
1                       True     True                 False          False   
2                      False    False                 False          False   
3                       True     True                 False          False   
4                       True     True               

In [None]:
# Divide dataset into train set & test set !!
print("Dividing into train and test sets...")
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=0)
# test_size indicates the proportion of rows from X and Y that will go into the test dataset while
# maintaining the correspondance between the rows from X and Y

# random_state is an argument that can be found in all functions that have a pseudo-random behaviour
# if random_state is not stated the function will derive a different random result everytime the cell
# runs, if random_state is given a value the results will be the same everytime the cell runs
print("...Done.")
print()

Dividing into train and test sets...
...Done.



In [None]:
# Create pipeline for numeric features
numeric_features = ['mileage', 'engine_power']  # Names of numeric columns in X_train/X_test
numeric_transformer = Pipeline(
    steps=[
        ('scaler', StandardScaler())
    ]
)

In [None]:
# Filter categorical features
all_features = pricing_df.columns.tolist()
features_to_exclude = numeric_features + [target_name]  # Combine numeric and target
categorical_features = [feature for feature in all_features if feature not in features_to_exclude] # Names of categorical columns in X_train/X_test

print(categorical_features)

['model_key', 'fuel', 'paint_color', 'car_type', 'private_parking_available', 'has_gps', 'has_air_conditioning', 'automatic_car', 'has_getaround_connect', 'has_speed_regulator', 'winter_tires']


In [None]:
# Create pipeline for categorical features
categorical_transformer = Pipeline(
    steps=[
        (
            'encoder',
            OneHotEncoder(drop="first", handle_unknown='ignore'), # Handle unknown categories during transform
        ),  # first column will be dropped to avoid creating correlations between features
    ]
)

In [None]:
# Use ColumnTransformer to make a preprocessor object that describes all the treatments to be done
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features),
    ]
)

In [None]:
# Preprocessings on train set
print("Performing preprocessings on train set...")
print(X_train.head())
X_train = preprocessor.fit_transform(X_train)
print("...Done.")
print(X_train[0:5])  # MUST use this syntax because X_train is a numpy array and not a pandas DataFrame anymore
print()

# Preprocessings on test set
print("Performing preprocessings on test set...")
print(X_test.head())
X_test = preprocessor.transform(X_test)  # We don't fit again !! The test set is used for validating decisions
# we made based on the training set, therefore we can only apply transformations that were parametered using the training set.
# Otherwise this creates what is called a leak from the test set which will introduce a bias in all your results.
print("...Done.")
print(
    X_test[0:5, :]
)  # MUST use this syntax because X_test is a numpy array and not a pandas DataFrame anymore
print()

Performing preprocessings on train set...
     model_key  mileage  engine_power    fuel paint_color car_type  \
4550       BMW   132485           135  diesel       white      suv   
1237   Citroën   131121           135  diesel       black   estate   
3158   Renault   209216           135  diesel        grey    sedan   
900    Peugeot   148986           100  diesel       black   estate   
933    Citroën   170500           135  diesel       black   estate   

      private_parking_available  has_gps  has_air_conditioning  automatic_car  \
4550                       True     True                 False          False   
1237                      False     True                 False          False   
3158                       True     True                 False          False   
900                        True     True                 False          False   
933                        True     True                 False          False   

      has_getaround_connect  has_speed_regulator  


Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros



In [None]:
# Train model
model = LinearRegression()

print("Training model...")
model.fit(X_train, Y_train)
print("...Done.")

Training model...
...Done.


In [None]:
# Predictions on training set
print("Predictions on training set...")
Y_train_pred = model.predict(X_train)
print("...Done.")
print(Y_train_pred[0:5])
print()

Predictions on training set...
...Done.
[132.11472904 118.97972318 118.29209294 101.0444328  104.69670365]



In [None]:
# Predictions on test set
print("Predictions on test set...")
Y_test_pred = model.predict(X_test)
print("...Done.")
print(Y_test_pred[0:5])
print()

Predictions on test set...
...Done.
[ 91.36911628 157.56519238 105.17967036  72.2206298  100.58722405]



In [None]:
# Print scores using appropriate metrics for regression
print("R-squared on training set : ", round(r2_score(Y_train, Y_train_pred), 2))
print("R-squared on test set : ", round(r2_score(Y_test, Y_test_pred), 2))

mse_train = mean_squared_error(Y_train, Y_train_pred)
mse_test = mean_squared_error(Y_test, Y_test_pred)

print("Mean Squared Error on training set : ", round(mse_train, 2))
print("Root Mean Squared Error on training set : ", round(np.sqrt(mse_train), 2)) # Calculate and print RMSE for training set

print("Mean Squared Error on test set : ", round(mse_test, 2))
print("Root Mean Squared Error on test set : ", round(np.sqrt(mse_test), 2)) # Calculate and print RMSE for test set

R-squared on training set :  0.72
R-squared on test set :  0.69
Mean Squared Error on training set :  323.21
Root Mean Squared Error on training set :  17.98
Mean Squared Error on test set :  337.42
Root Mean Squared Error on test set :  18.37


In [None]:
# Get feature names after preprocessing
feature_names = numeric_features + list(preprocessor.named_transformers_['cat']['encoder'].get_feature_names_out(categorical_features))

# Get coefficients
coefficients = model.coef_

# Create a DataFrame to display coefficients with feature names
coefficients_df = pd.DataFrame({'Feature': feature_names, 'Coefficient': coefficients})

# Sort coefficients by absolute value to see most impactful features
coefficients_df['Absolute_Coefficient'] = coefficients_df['Coefficient'].abs()
coefficients_df = coefficients_df.sort_values(by='Absolute_Coefficient', ascending=False)

# Display the coefficients
print(coefficients_df)

                           Feature  Coefficient  Absolute_Coefficient
8                  model_key_Honda   -63.341485             63.341485
23                model_key_Suzuki    40.058790             40.058790
28              fuel_hybrid_petrol    38.533721             38.533721
12              model_key_Maserati    36.695858             36.695858
26                model_key_Yamaha    36.526594             36.526594
21                  model_key_SEAT    34.294238             34.294238
6                   model_key_Fiat   -32.964875             32.964875
24                model_key_Toyota    28.045712             28.045712
45                    car_type_van   -26.019922             26.019922
16                  model_key_Opel    24.431297             24.431297
25            model_key_Volkswagen    24.335118             24.335118
33               paint_color_green   -22.492860             22.492860
13              model_key_Mercedes    21.879169             21.879169
19               mod

In [None]:
# Create a bar plot
fig = px.bar(
    coefficients_df,
    x="Feature",
    y="Coefficient",
    title="Feature Coefficients of Linear Regression Model",
)

# Update x-axis layout to display all categories
fig.update_xaxes(
    tickangle=45,  # Rotate x-axis labels for readability
    categoryorder='total ascending'  # Display all categories in ascending order of coefficient magnitude
)

fig.show()

In [None]:
# Get the feature names after OneHotEncoding
encoded_feature_names = preprocessor.named_transformers_['cat']['encoder'].get_feature_names_out(categorical_features)

dropped_features = {}
for feature in categorical_features:
    original_categories = pricing_df[feature].unique()
    # Get categories that are present in the encoded feature names for this feature
    encoded_categories = [cat for cat in encoded_feature_names if feature in cat]
    # Extract actual categories from the encoded names
    encoded_categories = [cat.split('_')[1] for cat in encoded_categories]
    # Find the dropped category (the one not present in the encoded categories)
    dropped_category = list(set(original_categories) - set(encoded_categories))[0]
    dropped_features[feature] = dropped_category

# Print the dropped features
print("Dropped Features:")
for feature, category in dropped_features.items():
    print(f"{feature}: {category}")

Dropped Features:
model_key: Honda
fuel: diesel
paint_color: green
car_type: subcompact
private_parking_available: False
has_gps: False
has_air_conditioning: False
automatic_car: False
has_getaround_connect: False
has_speed_regulator: False
winter_tires: False


In [None]:
# Get the intercept
intercept = model.intercept_

# Create the equation string
equation = f"Weekly_Sales = {intercept:.2f}"  # Start with the intercept

# Add terms for each feature with non-zero coefficients
for index, row in coefficients_df[coefficients_df['Coefficient'] != 0].iterrows():
    feature_name = row['Feature']
    coefficient = row['Coefficient']
    equation += f" + {coefficient:.2f} * {feature_name}"

# Print the equation
print("Linear Regression Equation:")
print(equation)

Linear Regression Equation:
Weekly_Sales = 105.18 + -63.34 * model_key_Honda + 40.06 * model_key_Suzuki + 38.53 * fuel_hybrid_petrol + 36.70 * model_key_Maserati + 36.53 * model_key_Yamaha + 34.29 * model_key_SEAT + -32.96 * model_key_Fiat + 28.05 * model_key_Toyota + -26.02 * car_type_van + 24.43 * model_key_Opel + 24.34 * model_key_Volkswagen + -22.49 * paint_color_green + 21.88 * model_key_Mercedes + 21.34 * model_key_Porsche + 20.79 * model_key_Subaru + -19.49 * fuel_petrol + 18.92 * model_key_Mitsubishi + 15.54 * model_key_Ferrari + 13.91 * model_key_Renault + 13.84 * engine_power + -13.39 * mileage + -13.27 * model_key_Ford + 12.31 * has_gps_True + 9.72 * model_key_Audi + -9.67 * car_type_estate + 8.98 * fuel_electro + 8.24 * model_key_KIA Motors + -8.22 * car_type_hatchback + -8.16 * car_type_subcompact + 7.81 * model_key_Lexus + 7.51 * model_key_Peugeot + -6.10 * model_key_PGO + 6.01 * model_key_BMW + 5.95 * has_getaround_connect_True + 5.50 * car_type_coupe + 4.61 * automatic_

In [None]:
import requests

url = "https://farabouna-getaroudapispace.hf.space/predict"  # Replace with the correct FastAPI URL

# Input data - send as a list of InputData objects, no need for an 'input' key
data = [
    {
        "model_key": "Honda",
        "mileage": 20000,
        "engine_power": 120,
        "fuel": "gasoline",
        "paint_color": "red",
        "car_type": "sedan",
        "private_parking_available": True,
        "has_gps": True,
        "has_air_conditioning": True,
        "automatic_car": False,
        "has_getaround_connect": True,
        "has_speed_regulator": False,
        "winter_tires": True
    }
]

# Send POST request with the list of InputData objects
response = requests.post(url, json=data)

# Print the response from the API
print(response.json())


{'predictions': [83.44397291029853]}
