# Predicting Airbnb Listing Prices in Paris  
*Using machine learning to understand key pricing drivers*

## 1. Business Understanding

<div style="text-align: center;">
  <img src="../illustration-images/images.png" width="610" style="display:inline-block; margin-right: 20px;">
  <img src="../illustration-images/paris.jpg" width="340" style="display:inline-block;">
</div>

### 1.1 Business Objectives

Predicting Airbnb listing prices and understanding the most influential features can be highly beneficial to stakeholders.
Renters want proposed prices for accommodations to be attractive, while hosts, seeking profit, want fair pricing guidance without feeling shortchanged. Ultimately, both parties aim for a fair price based on the apartment's characteristics and location.

The goal of this project is to develop a *supervised machine learning* model to:

* **Understand** the impact of individual features typically found in listings.
* **Predict** accurate prices based on the characteristics of the apartment.

To carry out this project, we will use Airbnb listing prices data in the city of **Paris** from the past 12 months. Using this dataset, we will train a machine learning model, analyze feature importance, and ultimately deploy the model into production as web app.

<div style='text-align: center;'>
    <img src='../illustration-images/diagram-pipeline.png' width="600">
</div>

This project will interest and benefit the following stakeholders:

* **Renters**: The model will help renters assess whether the listed price is fair by comparing it to the predicted price. For price-sensitive renters, it also offers insight into which features to prioritize or compromise on to stay within their budget.
* **Hosts**: Setting a fair and competitive price can be challenging, especially for new hosts. Our model will assist them by offering data-driven price suggestions based on similar apartments. It can also help hosts identify which features to highlight or improve to increase the value of their listing.

### 1.2. Business Success Criteria

In the scope of this project, we aim to achieve the following:

* **Low RMSE**: We will use the Root Mean Square Error (RMSE) to evaluate the model's performance. Our target is an RMSE below €50, indicating that price estimates are reasonably close to actual values.
* **Identified key features**: At the end of the project, we aim to clearly identify the most important features that influence pricing.
* **Lightweight model**: The final model should be as lightweight as possible to enable fast and efficient deployment.




## 2. Data Understanding

### 2.1. Collecting Initial data

The dataset used in this project was sourced from [Inside Airbnb](https://insideairbnb.com/fr/get-the-data/), a platform that provides publicly available Airbnb data for various cities.

For this analysis, we downloaded the Paris listings dataset dated March 3, 2025.

According to Inside Airbnb, the following information is important to note:

- The data is collected from publicly available information on the Airbnb website.

- It has been verified, cleansed, and aggregated by Inside Airbnb.

- The reported location of each listing is obfuscated for privacy reasons, with coordinates randomly displaced within a radius of approximately 150 meters (450 feet) from the actual address.

- Listings within the same building are anonymized individually, which can make them appear scattered on the map.

For further details and assumptions regarding the data collection process, refer to the official [Inside Airbnb Data Assumptions](https://insideairbnb.com/fr/data-assumptions/) page.

In [4]:
#  importing important libraries
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline

raw_data = pd.read_csv('../data/listings.csv')

_Which attributes (columns) from the database seem most promising?_
According to Inside Airbnb website, the dataset gather 86,064 listings, with 79 attributes, for the city of Paris.
Informations concerning Airbnb listing attributes are available [here](https://docs.google.com/spreadsheets/d/1iWCNJcSutYqpULSQHlNyGInUvHg2BoUGoNRIGa6Szc4/edit?gid=1322284596#gid=1322284596), and followings seem to be the most promising:
| Attributes | Description |
|------------|-------------|
| neighbourhood_cleansed | |
| host_is_superhost | The neighbourhood as geocoded using the latitude and longitude against neighborhoods as defined by open or public digital shapefiles. |

Which attributes seem irrelevant and can be excluded?
Is there enough data to draw generalizable conclusions or make accurate predictions?
Are there too many attributes for your modeling method of choice?
Are you merging various data sources? If so, are there areas that might pose a problem when merging?
Have you considered how missing values are handled in each of your data sources?


### 2.2 Describing the Data

### 2.3. Exploring Data

### 2.4. Verifying Data Quality

### 2.5. Summary