# AirbnbPricePrediction

## Table of Contents
1. Project Preparation
   - 1.1 Defining the problem and project goals.
   - 1.2 Hypothesis

2. Data Cleaning
   - 2.1 Imports
      - 2.1.1 Import libraries
      - 2.1.2 Import dataset
   - 2.2 Variable Identification
   - 2.3 Remove duplicates
   - 2.4 Remove values errors
   - 2.5 Outliers Treatment
   - 2.6 Handle Missing Values
   - 2.7 Drop Unnecessary Columns

3. Exploratory Data Analysis
   - 3.1 Initial Exploration
   - 3.2 Univariate Analysis
   - 3.3 Bivariate Analysis
      - 3.3.1 Numerical-Numerical Variable
      - 3.3.2 Categorical-Numerical Variable

4. Data Preprocessing
   - 4.1 Transformation of Distributions
   - 4.2 Feature Engineering
      - 4.2.1 Creating New Features
      - 4.2.2 Feature Scaling
      - 4.2.3 Encoding Categorical Variables
         - 4.2.3.1 Label Encoding
         - 4.2.3.2 One Hot Encoding
   - 4.3 Data Splitting (Train-Test-Validation)

5. The model
   - 5.1 Model Building
   - 5.2 Model Training
   - 5.3 Model Evaluation
      - 5.3.1 K-Fold Cross Validation
      - 5.3.2 Hyperparameter Tunning
      - 5.3.3 Re-train with optimal hyperparameters for predictions
      - 5.3.4 Feature Importance
      - 5.3.5 Learning Curves
   - 5.4 Test the model on Test Set

6. Conclusion
   - 6.1 Results of the project / Validating hypothesis
   - 6.2 Improvements
   - 6.3 Conclusion on the project / course

## 1. Project Preparation
...
### 1.1 Defining the problem and project goals
...
### 1.2 Hypothesis
...


## 2. Data Cleaning
...
### 2.1 Imports
#### 2.1.1 Import libraries

In [202]:
import pandas as pd
import numpy as np

In [203]:
pd.set_option('display.max_columns', None)

#### 2.1.2 Import dataset

In [204]:
dataset = pd.read_csv('./data/airbnb-listings.csv', sep=";")

  dataset = pd.read_csv('./data/airbnb-listings.csv', sep=";")


In [205]:
dataset.sample(5)

Unnamed: 0,ID,Listing Url,Scrape ID,Last Scraped,Name,Summary,Space,Description,Experiences Offered,Neighborhood Overview,Notes,Transit,Access,Interaction,House Rules,Thumbnail Url,Medium Url,Picture Url,XL Picture Url,Host ID,Host URL,Host Name,Host Since,Host Location,Host About,Host Response Time,Host Response Rate,Host Acceptance Rate,Host Thumbnail Url,Host Picture Url,Host Neighbourhood,Host Listings Count,Host Total Listings Count,Host Verifications,Street,Neighbourhood,Neighbourhood Cleansed,Neighbourhood Group Cleansed,City,State,Zipcode,Market,Smart Location,Country Code,Country,Latitude,Longitude,Property Type,Room Type,Accommodates,Bathrooms,Bedrooms,Beds,Bed Type,Amenities,Square Feet,Price,Weekly Price,Monthly Price,Security Deposit,Cleaning Fee,Guests Included,Extra People,Minimum Nights,Maximum Nights,Calendar Updated,Has Availability,Availability 30,Availability 60,Availability 90,Availability 365,Calendar last Scraped,Number of Reviews,First Review,Last Review,Review Scores Rating,Review Scores Accuracy,Review Scores Cleanliness,Review Scores Checkin,Review Scores Communication,Review Scores Location,Review Scores Value,License,Jurisdiction Names,Cancellation Policy,Calculated host listings count,Reviews per Month,Geolocation,Features
54263,6105366,https://www.airbnb.com/rooms/6105366,20170602102612,2017-06-02,Bayou St. John Garden Apt - Jazz Fest 2 blocks,Garden apartment in a lovely walkable neighbor...,An apartment in a beautiful private home in th...,Garden apartment in a lovely walkable neighbor...,none,"According to Fodors: ""With its tree-lined stre...",,On a major bus line that runs to the French Qu...,Guests will have access to apartment and entra...,"Guests are greeted by the local host, who will...",1. We ask guests to be respectful of the fully...,https://a0.muscache.com/im/pictures/83634086/c...,https://a0.muscache.com/im/pictures/83634086/c...,https://public.opendatasoft.com/api/explore/v2...,https://a0.muscache.com/im/pictures/83634086/c...,4887507,https://www.airbnb.com/users/show/4887507,Roxanne And Tim,2013-01-28,"New Orleans, Louisiana, United States","Originally from the Midwest, retired to New Or...",within a day,100.0,,https://a0.muscache.com/im/users/4887507/profi...,https://a0.muscache.com/im/users/4887507/profi...,Bayou St. John,2.0,2.0,"email,phone,reviews,kba","Bayou St. John, New Orleans, LA 70119, United ...",Bayou St. John,Bayou St. John,,New Orleans,LA,70119,New Orleans,"New Orleans, LA",US,United States,29.97846,-90.082995,Apartment,Entire home/apt,4.0,1.0,1.0,2.0,Real Bed,"Cable TV,Internet,Wireless Internet,Air condit...",,140.0,,,500.0,65.0,2.0,25.0,2.0,1125.0,7 weeks ago,,30.0,60.0,90.0,365.0,2017-06-02,11.0,2015-10-07,2017-02-26,98.0,10.0,10.0,10.0,10.0,10.0,9.0,,"Louisiana State, New Orleans, LA",strict,2.0,0.55,"29.97846046336117, -90.08299533436112","Host Is Superhost,Host Has Profile Pic,Host Id..."
382178,16265358,https://www.airbnb.com/rooms/16265358,20170304065726,2017-03-05,Studio Flat Clerkenwell Farringdon Holborn (303),"My place is good for solo adventurers, and bus...",The apartment is a self-contained single room ...,"My place is good for solo adventurers, and bus...",none,Perfect central London location - Sainsbury’s ...,,Very convenient transport links! Closest Tube ...,"You will have full access to the common room, ...",I will be able to offer help 24/7 with any req...,,https://a0.muscache.com/im/pictures/428a12f6-a...,https://a0.muscache.com/im/pictures/428a12f6-a...,https://public.opendatasoft.com/api/explore/v2...,https://a0.muscache.com/im/pictures/428a12f6-a...,62368024,https://www.airbnb.com/users/show/62368024,Saem,2016-03-10,"London, England, United Kingdom",,within an hour,99.0,,https://a0.muscache.com/im/pictures/75bfadc2-7...,https://a0.muscache.com/im/pictures/75bfadc2-7...,Clerkenwell,56.0,56.0,"email,phone,reviews,jumio,government_id","Herbal Hill, London, England EC1R, United Kingdom",Clerkenwell,Islington,,London,England,EC1R,London,"London, United Kingdom",GB,United Kingdom,51.523568,-0.108069,Apartment,Entire home/apt,1.0,1.0,0.0,1.0,Real Bed,"TV,Cable TV,Internet,Wireless Internet,Kitchen...",,63.0,,,,10.0,1.0,0.0,1.0,1125.0,a week ago,,0.0,0.0,0.0,40.0,2017-03-05,0.0,,,,,,,,,,,,flexible,53.0,,"51.52356843048189, -0.10806882760753102","Host Has Profile Pic,Host Identity Verified,Is..."
292507,15843080,https://www.airbnb.com/rooms/15843080,20170502172350,2017-05-03,Private Koreatown Master Bedroom,Welcome Home! Our spacious room is well lit wi...,The furniture is brand new and thoughtfully pl...,Welcome Home! Our spacious room is well lit wi...,none,,,Public transit is 3 blocks away if you want to...,,There are a lot of good restaurants around the...,We hope you are a dog lover. We have two large...,,,https://public.opendatasoft.com/api/explore/v2...,,25208150,https://www.airbnb.com/users/show/25208150,Sharlette,2014-12-26,"Los Angeles, California, United States","As your host, I hope you feel at home. I welco...",within an hour,100.0,,https://a0.muscache.com/im/users/25208150/prof...,https://a0.muscache.com/im/users/25208150/prof...,Mid-Wilshire,2.0,2.0,"email,phone,google,reviews,kba","Mid-Wilshire, Los Angeles, CA 90006, United St...",Mid-Wilshire,Koreatown,,Los Angeles,CA,90006,Los Angeles,"Los Angeles, CA",US,United States,34.054438,-118.297751,Apartment,Private room,2.0,1.0,1.0,1.0,Real Bed,"TV,Internet,Wireless Internet,Air conditioning...",,75.0,,,100.0,35.0,1.0,20.0,2.0,1125.0,5 days ago,,22.0,52.0,82.0,82.0,2017-05-03,3.0,2016-11-14,2016-12-31,100.0,9.0,10.0,10.0,10.0,9.0,10.0,,"City of Los Angeles, CA",strict,2.0,0.53,"34.0544381604634, -118.2977510163419","Host Is Superhost,Host Has Profile Pic,Host Id..."
35265,14540198,https://www.airbnb.com/rooms/14540198,20170304065726,2017-03-05,"Nottinghill, high ceilings and big windows",If you want a quintessential 'notting hill' ex...,"Think 'Quirky, Homely, Art Gallery, idosyncrat...",If you want a quintessential 'notting hill' ex...,none,I've done a bespoke guide for all my guests,No discounts,Easy transport Bus stop one min away I use not...,you get access to it all!,A friend of mine will be available to meet you...,,,,https://public.opendatasoft.com/api/explore/v2...,,11799093,https://www.airbnb.com/users/show/11799093,Anna,2014-01-30,GB,,within a day,80.0,,https://a0.muscache.com/im/users/11799093/prof...,https://a0.muscache.com/im/users/11799093/prof...,Notting Hill,1.0,1.0,"email,phone,reviews,jumio","Elgin Crescent, London, England W11 2JU, Unite...",Notting Hill,Kensington and Chelsea,,London,England,W11 2JU,London,"London, United Kingdom",GB,United Kingdom,51.51375,-0.207703,Apartment,Entire home/apt,5.0,1.0,1.0,1.0,Real Bed,"TV,Internet,Wireless Internet,Kitchen,Breakfas...",,175.0,,,250.0,63.0,1.0,30.0,2.0,1125.0,2 months ago,,29.0,57.0,83.0,338.0,2017-03-05,5.0,2016-09-23,2016-11-23,100.0,10.0,10.0,10.0,10.0,10.0,10.0,,,strict,1.0,0.91,"51.51375018089924, -0.20770325475419465","Host Is Superhost,Host Has Profile Pic,Host Id..."
81906,9881149,https://www.airbnb.com/rooms/9881149,20170218121908,2017-02-18,Overlooking Jameson Distillery,A beautiful loft apartment with a balcony over...,,A beautiful loft apartment with a balcony over...,none,,,,,,,https://a0.muscache.com/im/pictures/5c1bd0bd-3...,https://a0.muscache.com/im/pictures/5c1bd0bd-3...,https://public.opendatasoft.com/api/explore/v2...,https://a0.muscache.com/im/pictures/5c1bd0bd-3...,16296283,https://www.airbnb.com/users/show/16296283,Sam,2014-06-02,"Pennsylvania, United States",I am a 23 year old American student. I love fi...,,,,https://a0.muscache.com/im/pictures/1b00d661-4...,https://a0.muscache.com/im/pictures/1b00d661-4...,,1.0,1.0,"email,phone,reviews,jumio,government_id","Bow Street, Dublin, Dublin, Ireland",,Dublin City,,Dublin,Dublin,,Dublin,"Dublin, Ireland",IE,Ireland,53.34965,-6.277691,Loft,Private room,2.0,1.0,1.0,1.0,Real Bed,"Internet,Wireless Internet,Air conditioning,Sm...",,56.0,,,,,1.0,0.0,1.0,3.0,6 months ago,,0.0,0.0,0.0,0.0,2017-02-18,24.0,2015-12-22,2016-07-19,93.0,9.0,9.0,9.0,10.0,10.0,8.0,,,flexible,1.0,1.69,"53.34964996125459, -6.277691028289764","Host Has Profile Pic,Host Identity Verified,In..."


### 2.2 Variable Identification

Let's have a quick look in our data. 

But what are our variables ?

In [206]:
print(dataset.columns)

Index(['ID', 'Listing Url', 'Scrape ID', 'Last Scraped', 'Name', 'Summary',
       'Space', 'Description', 'Experiences Offered', 'Neighborhood Overview',
       'Notes', 'Transit', 'Access', 'Interaction', 'House Rules',
       'Thumbnail Url', 'Medium Url', 'Picture Url', 'XL Picture Url',
       'Host ID', 'Host URL', 'Host Name', 'Host Since', 'Host Location',
       'Host About', 'Host Response Time', 'Host Response Rate',
       'Host Acceptance Rate', 'Host Thumbnail Url', 'Host Picture Url',
       'Host Neighbourhood', 'Host Listings Count',
       'Host Total Listings Count', 'Host Verifications', 'Street',
       'Neighbourhood', 'Neighbourhood Cleansed',
       'Neighbourhood Group Cleansed', 'City', 'State', 'Zipcode', 'Market',
       'Smart Location', 'Country Code', 'Country', 'Latitude', 'Longitude',
       'Property Type', 'Room Type', 'Accommodates', 'Bathrooms', 'Bedrooms',
       'Beds', 'Bed Type', 'Amenities', 'Square Feet', 'Price', 'Weekly Price',
       'Month

In [207]:
print(len(dataset.columns))

89


In this list, we can see the ID of the listing, we can use it as an index for our dataframe.

In [208]:
dataset.set_index('ID', inplace=True)

We have a lot of feature, but are all these data useful for predicting our dependent variables?

Certainly, the answer is obviously no. Which data is not useful to us and will therefore be removed later?
- Scraping data, which provides no relevant information about the accommodation.
- Various fields with URLs, which also do not contribute to our analysis.
- Here, we will not be performing NLP, so all attributes containing long text (such as description) will be removed as well
- Features irrelevant to our goal
- Redondant informations

We begin by removing features related to web scraping that are not directly associated with the dwelling. Scraping-related data may include information that does not significantly contribute to our analysis of the dwelling, and therefore, their presence could add noise rather than clarity to our dataset.

In [209]:
dataset.drop(['Scrape ID', 'Last Scraped', 'Calendar last Scraped'], axis=1, inplace=True)

Next, let's remove features that are in the form of URLs, whether for images or links to other web pages.

In [210]:
dataset.drop(['Listing Url', 'Thumbnail Url', 'Medium Url', 'Picture Url', 'XL Picture Url', 'Host URL', 'Host Thumbnail Url', 'Host Picture Url'], axis=1, inplace=True)

The decision to remove the following attributes was made to simplify our dataset. These attributes contain long and detailed textual information for which more advanced processing, such as Natural Language Processing (NLP), would be necessary.

If our model does not perform optimally, it might be possible to leverage these attributes to add additional features to our dataset. However, since the messages do not follow a predefined structure, extracting specific information for one property may be feasible, but it does not guarantee finding the same information in the same field for another property. So, the data retrieved from these fields could lead to the creation of attributes with a significant amount of missing values.

// TODO: Use 'House Rules' to extract information about smoking policy.

In [211]:
dataset.drop(['Name', 'Summary', 'Space', 'Description', 'Neighborhood Overview', 'Notes', 'Transit', 'Access', 'Interaction', 'House Rules'], axis=1, inplace=True)

Then, I decided to remove attributes that I consider irrelevant to address our problem. These attributes are mainly categorical, and preserving them would have required specific encodings, resulting in the creation of a significant number of features and substantially increasing the dimensionality of our dataset, without providing meaningful information to predict our price values. Indeed, almost all these attributes have unique values.

It's important to note that the ``Features`` attribute corresponds to a consolidation of several attributes related to the host that I have chosen to retain. Consequently, this attribute has become unnecessary.

In [212]:
dataset.drop(['Host ID', 'Host Name', 'Host About', 'Host Neighbourhood', 'Neighbourhood', 'Neighbourhood Cleansed', 'Neighbourhood Group Cleansed', 'First Review', 'Last Review', 'License', 'Jurisdiction Names', 'Features'], axis=1, inplace=True)

I also decided to remove all information regarding the location of the dwelling, except for ``longitude`` and ``latitude``. Using these two coordinates, it is possible to retrieve all the information that I am going to drop just below, such as the country, etc.

If I had kept this information, it would have created redundancy in the data. Because, we can find all the informations drop with only these two values. Additionally, it would have been necessary to encode this information using an One Hot Encoder, significantly increasing the dimensionality of our dataset. It is noteworthy that the ``Geolocation`` attribute is actually a combination of the ``Latitude`` and ``Longitude`` attributes.

It is important to note that at this stage, I have not chosen to delete the ``City`` feature as it will be used later to calculate a new feature. However, it will be removed later on.

In [213]:
dataset.drop(['Street', 'State', 'Zipcode', 'Market', 'Smart Location', 'Country Code', 'Country', 'Geolocation'], axis= 1, inplace=True)

It finally leaves us with the following attributes:

In [214]:
print(dataset.columns)

Index(['Experiences Offered', 'Host Since', 'Host Location',
       'Host Response Time', 'Host Response Rate', 'Host Acceptance Rate',
       'Host Listings Count', 'Host Total Listings Count',
       'Host Verifications', 'City', 'Latitude', 'Longitude', 'Property Type',
       'Room Type', 'Accommodates', 'Bathrooms', 'Bedrooms', 'Beds',
       'Bed Type', 'Amenities', 'Square Feet', 'Price', 'Weekly Price',
       'Monthly Price', 'Security Deposit', 'Cleaning Fee', 'Guests Included',
       'Extra People', 'Minimum Nights', 'Maximum Nights', 'Calendar Updated',
       'Has Availability', 'Availability 30', 'Availability 60',
       'Availability 90', 'Availability 365', 'Number of Reviews',
       'Review Scores Rating', 'Review Scores Accuracy',
       'Review Scores Cleanliness', 'Review Scores Checkin',
       'Review Scores Communication', 'Review Scores Location',
       'Review Scores Value', 'Cancellation Policy',
       'Calculated host listings count', 'Reviews per Month'

We went from 89 features to 47.

In [215]:
len(dataset.columns)

47

Now that we've completed an initial quick cleaning, we can start by identifying which variables are dependent and independent among all the variables available.

To address our problem, we aim to predict variables (dependant variables) :
- ``Price`` 
- ``Weekly Price``  
- ``Monthly Price`` 

All others correspond to the independant variables that can be used to predict our dependent variables.

In [216]:
dependant_variables = ['Price', 'Weekly Price', 'Monthly Price']
independant_variables = [var for var in dataset.columns if var not in dependant_variables]

In [217]:
print(f'Dependant variables : {dependant_variables}')
print(f'Independant variables : {independant_variables}')

Dependant variables : ['Price', 'Weekly Price', 'Monthly Price']
Independant variables : ['Experiences Offered', 'Host Since', 'Host Location', 'Host Response Time', 'Host Response Rate', 'Host Acceptance Rate', 'Host Listings Count', 'Host Total Listings Count', 'Host Verifications', 'City', 'Latitude', 'Longitude', 'Property Type', 'Room Type', 'Accommodates', 'Bathrooms', 'Bedrooms', 'Beds', 'Bed Type', 'Amenities', 'Square Feet', 'Security Deposit', 'Cleaning Fee', 'Guests Included', 'Extra People', 'Minimum Nights', 'Maximum Nights', 'Calendar Updated', 'Has Availability', 'Availability 30', 'Availability 60', 'Availability 90', 'Availability 365', 'Number of Reviews', 'Review Scores Rating', 'Review Scores Accuracy', 'Review Scores Cleanliness', 'Review Scores Checkin', 'Review Scores Communication', 'Review Scores Location', 'Review Scores Value', 'Cancellation Policy', 'Calculated host listings count', 'Reviews per Month']


So, we have 3 values to predict, and we will utilize 44 features for the prediction.

In [218]:
len(dependant_variables)

3

In [219]:
len(independant_variables)

44

Now, we can examine the types of variables, whether they are numerical, categorical, or if they have null values, to guide our preprocessing and analysis decisions.

In [220]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
Index: 494954 entries, 6017649 to 10562264
Data columns (total 47 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   Experiences Offered             494954 non-null  object 
 1   Host Since                      494449 non-null  object 
 2   Host Location                   492691 non-null  object 
 3   Host Response Time              379885 non-null  object 
 4   Host Response Rate              379884 non-null  float64
 5   Host Acceptance Rate            42258 non-null   object 
 6   Host Listings Count             494449 non-null  float64
 7   Host Total Listings Count       494448 non-null  float64
 8   Host Verifications              494114 non-null  object 
 9   City                            494500 non-null  object 
 10  Latitude                        494953 non-null  float64
 11  Longitude                       494953 non-null  float64
 12  Property Type

For each variable, we can seek to determine whether it is a numerical or categorical variable. This information is crucial in determining which pre-treatment and analysis methods are appropriate for each variable.

In [221]:
qualitative_columns = dataset.select_dtypes(include='object').columns
quantitative_columns = dataset.select_dtypes(include='number').columns

print("Qualitative (categorical)")
for feat in qualitative_columns:
    print(' - ', feat)
    
print("Quantitative (numerical)")
for feat in quantitative_columns:
    print(' - ', feat)

Qualitative (categorical)
 -  Experiences Offered
 -  Host Since
 -  Host Location
 -  Host Response Time
 -  Host Acceptance Rate
 -  Host Verifications
 -  City
 -  Property Type
 -  Room Type
 -  Bed Type
 -  Amenities
 -  Calendar Updated
 -  Has Availability
 -  Cancellation Policy
Quantitative (numerical)
 -  Host Response Rate
 -  Host Listings Count
 -  Host Total Listings Count
 -  Latitude
 -  Longitude
 -  Accommodates
 -  Bathrooms
 -  Bedrooms
 -  Beds
 -  Square Feet
 -  Price
 -  Weekly Price
 -  Monthly Price
 -  Security Deposit
 -  Cleaning Fee
 -  Guests Included
 -  Extra People
 -  Minimum Nights
 -  Maximum Nights
 -  Availability 30
 -  Availability 60
 -  Availability 90
 -  Availability 365
 -  Number of Reviews
 -  Review Scores Rating
 -  Review Scores Accuracy
 -  Review Scores Cleanliness
 -  Review Scores Checkin
 -  Review Scores Communication
 -  Review Scores Location
 -  Review Scores Value
 -  Calculated host listings count
 -  Reviews per Month


Here, we observe several issues that we'll need to investigate to understand why certain fields, initially considered numerical, end up being categorical. Consequently, there are operations to be performed.

For instance, the ``Host Acceptance Rate`` attribute should be numerical rather than categorical.

Additionally, I would like to transform certain categorical attributes into numerical ones. This is particularly the case for:
- ``Host Response Time``: I would like to represent it, for example, as a number of days or hours.
- ``Host Since``: I aim to represent it also as a number of days.
- ``Calendar Updated``: I would also like to represent it as a number of days.
- ``Has Availability``: I want this to be a boolean attribute, taking the value True (1) if available and False (0) if not available.

Thus, transforming these categorical attributes into numerical ones will reduce the number of features in our encoded dataset and consequently the dataset's dimensionality, enhancing computational efficiency. Numerical features are often more suitable for machine learning models, improving interpretability and generalization performance.

So, it will be only necessary to encode the following attributes: 
 -  ``Experiences Offered``
 -  ``Host Verifications``
 -  ``Property Type``
 -  ``Room Type``
 -  ``Bed Type``
 -  ``Amenities``
 -  ``Cancellation Policy``

### 2.3 Remove duplicates
Now that we have quickly sorted through the columns that may be useful, it's also important to consider whether there are redundant rows to be removed. Indeed, since these data were obtained through scraping, it's possible that the scraper collected multiple instances of the same data.

In [222]:
number_duplicated = dataset.duplicated().sum()
print("Total number of duplicates :", number_duplicated)

Total number of duplicates : 2


Here, we have two duplicate rows, so we no need to delete some duplicate rows.

In [223]:
dataset.drop_duplicates(inplace=True)

### 2.4 Remove values errors

### 2.5 Outliers Treatment

### 2.6 Handle Missing Values

### 2.7 Drop Unnecessary Columns

## 3. Exploratory Data Analysis

### 3.1 Univariate Analysis

### 3.2 Bivariate Analysis
#### 3.2.1 Numerical-Numerical Variable

#### 3.2.2 Categorical-Numerical Variable

## 4. Data Preprocessing
### 4.1 Transformation of Distributions

### 4.2 Feature Engineering
#### 4.2.1 Creating New Features

In [224]:
dataset.drop(['Host Location'], axis=1, inplace=True)
dataset.drop(['City'], axis=1, inplace=True)

#### 4.2.2 Feature Scaling

#### 4.2.3 Encoding Categorical Variables
##### 4.2.3.1 Label Encoding

##### 4.2.3.2 One Hot Encoding

### 4.3 Data Splitting (Train-Test-Validation)

## 5. The model
### 5.1 Model Building

### 5.2 Model Training

### 5.3 Model Evaluation
#### 5.3.1 K-Fold Cross Validation

#### 5.3.2 Hyperparameter Tunning

#### 5.3.3 Re-train with optimal hyperparameters for predictions

#### 5.3.4 Feature Importance

#### 5.3.5 Learning Curves

### 5.4 Test the model on Test Set

## 6. Conclusion
### 6.1 Results of the project / Validating hypothesis
...
### 6.2 Improvements
...
### 6.3 Conclusion on the project / course
...