# Predicting Airbnb Listing Prices in Paris  
*Using machine learning to understand key pricing drivers*

 ## 3. Data Preparation

In this third step of our project, we prepare the dataset for modeling and further analysis. This involves several key tasks:

> * **Data splitting**: Dividing the dataset into training and test sets to avoid data leakage
> * **Cleaning the data**: Handling missing values and correcting inconsistent entries
> * **Outlier treatment**: Detecting and addressing extreme values
> * **Feature transformation**: Applying appropriate scaling, encoding, or normalization
> * **Feature engineering**: Creating new relevant features to enhance model performance
> * **Feature selection**: Identifying the most informative variables using only the training set



**Important Libraries**

In [17]:
import sys
import numpy as np
import matplotlib.pyplot as plt

import pandas as pd
import seaborn as sns
%matplotlib inline

In [3]:
from sklearn.model_selection import train_test_split

In [18]:
sys.path.append('../src')

In [19]:

from data_understanding_utils import infos_missing

**load the data**

In [2]:
raw_data = pd.read_csv("../data/listings.csv")

### 3.1. Splitting the dataset into train and test set

As we saw early during data understanding phase near half of the dataset target values are missing.
Here, we then propose to filter the data and keep apart labeled data and unlabeled ones. And we will further work with the labbeled data.

In [11]:
data_labeled = raw_data[raw_data.price.notna()]
print(data_labeled.shape)
data_unlabeled = raw_data[raw_data.price.isna()]
print(data_unlabeled.shape)
print(f"unlabeled data represent {100 * data_unlabeled.shape[0]/raw_data.shape[0]:2.0f}% of the total data")

(55655, 79)
(30409, 79)
unlabeled data represent 35% of the total data


Then we separate features data from the target 

In [12]:
X = data_labeled.copy()
y = data_labeled.pop('price')

In [20]:
y.isnull().mean()

0.0

In [28]:
import importlib
import data_understanding_utils

importlib.reload(data_understanding_utils)

<module 'data_understanding_utils' from 'c:\\Users\\cheic\\Documents\\GitHub\\Predicting-Airbnb-Listing-Prices-in-Paris\\notebooks\\../src\\data_understanding_utils.py'>

In [29]:
data_understanding_utils.infos_missing(X)

There are 10% of missing data.


In [33]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=10)
print(f'Shape of the training set: {X_train.shape}')
print(f'Shape of the test set: {X_test.shape}')

Shape of the training set: (38958, 79)
Shape of the test set: (16697, 79)


In [35]:
pd.options.display.max_rows=None
X_train.isnull().mean()

id                                              0.000000
listing_url                                     0.000000
scrape_id                                       0.000000
last_scraped                                    0.000000
source                                          0.000000
name                                            0.000000
description                                     0.030751
neighborhood_overview                           0.505416
picture_url                                     0.000026
host_id                                         0.000000
host_url                                        0.000000
host_name                                       0.000513
host_since                                      0.000513
host_location                                   0.214949
host_about                                      0.535474
host_response_time                              0.219031
host_response_rate                              0.219031
host_acceptance_rate           

### 3.2. Data Cleaning 

### 3.3. Data Preprocessing

### 3.4. Constructing New Data

### 3.5. Integrating Data

### 3.6. Formating Data

### Summary