# AirbnbPricePrediction

## Table of Contents
1. Project Preparation
   - 1.1 Defining the problem and project goals.
   - 1.2 Hypothesis

2. Data Cleaning
   - 2.1 Imports
      - 2.1.1 Import libraries
      - 2.1.2 Import dataset
   - 2.2 Variable Identification
   - 2.3 Remove duplicates
   - 2.4 Remove values errors
   - 2.5 Outliers Treatment
   - 2.6 Handle Missing Values

3. Exploratory Data Analysis
   - 3.1 Initial Exploration
   - 3.2 Univariate Analysis
   - 3.3 Bivariate Analysis
      - 3.3.1 Numerical-Numerical Variable
      - 3.3.2 Categorical-Numerical Variable

4. Data Preprocessing
   - 4.1 Transformation of Distributions
   - 4.2 Feature Engineering
      - 4.2.1 Creating New Features
      - 4.2.2 Feature Scaling
      - 4.2.3 Encoding Categorical Variables
         - 4.2.3.1 Label Encoding
         - 4.2.3.2 One Hot Encoding
   - 4.3 Data Splitting (Train-Test-Validation)

5. The model
   - 5.1 Model Building
   - 5.2 Model Training
   - 5.3 Model Evaluation
      - 5.3.1 K-Fold Cross Validation
      - 5.3.2 Hyperparameter Tunning
      - 5.3.3 Re-train with optimal hyperparameters for predictions
      - 5.3.4 Feature Importance
      - 5.3.5 Learning Curves
   - 5.4 Test the model on Test Set

6. Conclusion
   - 6.1 Results of the project / Validating hypothesis
   - 6.2 Improvements
   - 6.3 Conclusion on the project / course

## 1. Project Preparation
...
### 1.1 Defining the problem and project goals
...
### 1.2 Hypothesis
...


## 2. Data Cleaning
...
### 2.1 Imports
#### 2.1.1 Import libraries

In [1]:
import pandas as pd
import numpy as np
import os

In [2]:
pd.set_option('display.max_columns', None)

#### 2.1.2 Import dataset

In [3]:
def extract_infos_from_filename(filename):
    """
    Extracts information (city and period) from a CSV filename.

    The filename follows the format 'city_period.csv', and this function aims to extract the city name
    and the period information (weekends, weekdays) from the CSV file.

    Examples:
    - For 'london_weekends.csv', the function returns ('london', 'weekends').
    - For 'lisbon_weekdays.csv', the function returns ('lisbon', 'weekdays').

    Parameters:
    - filename (str): The input filename to extract information which follows the format 'city_period.csv'.

    Returns:
    - tuple: A tuple containing two elements:
        - city (str): The first part of the filename before the underscore.
        - period (str): The second part of the filename before the CSV extension.
    """
    parts = filename.split('_')
    city = parts[0]
    period = parts[1]
    period = period.split('.')[0]
    return city, period

In [4]:
if os.path.exists('./data'):
    file_names = os.listdir('./data')
    file_names_csv = filter(lambda fichier: fichier.endswith(".csv"), file_names)
else:
    print(f"The directory 'data' does not exist.")

In [5]:
dataset = pd.DataFrame()

In [6]:
for filename in file_names_csv:
    df = pd.read_csv('./data/' + filename)
    
    city, period = extract_infos_from_filename(filename)
    df['city'] = city
    df['period'] = period

    dataset = pd.concat([dataset, df])

In [7]:
dataset

Unnamed: 0.1,Unnamed: 0,realSum,room_type,room_shared,room_private,person_capacity,host_is_superhost,multi,biz,cleanliness_rating,guest_satisfaction_overall,bedrooms,dist,metro_dist,attr_index,attr_index_norm,rest_index,rest_index_norm,lng,lat,city,period
0,0,194.033698,Private room,False,True,2.0,False,1,0,10.0,93.0,1,5.022964,2.539380,78.690379,4.166708,98.253896,6.846473,4.90569,52.41772,amsterdam,weekdays
1,1,344.245776,Private room,False,True,4.0,False,0,0,8.0,85.0,1,0.488389,0.239404,631.176378,33.421209,837.280757,58.342928,4.90005,52.37432,amsterdam,weekdays
2,2,264.101422,Private room,False,True,2.0,False,0,1,9.0,87.0,1,5.748312,3.651621,75.275877,3.985908,95.386955,6.646700,4.97512,52.36103,amsterdam,weekdays
3,3,433.529398,Private room,False,True,4.0,False,0,1,9.0,90.0,2,0.384862,0.439876,493.272534,26.119108,875.033098,60.973565,4.89417,52.37663,amsterdam,weekdays
4,4,485.552926,Private room,False,True,2.0,True,0,0,10.0,98.0,1,0.544738,0.318693,552.830324,29.272733,815.305740,56.811677,4.90051,52.37508,amsterdam,weekdays
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1794,1794,715.938574,Entire home/apt,False,False,6.0,False,0,1,10.0,100.0,3,0.530181,0.135447,219.402478,15.712158,438.756874,10.604584,16.37940,48.21136,vienna,weekends
1795,1795,304.793960,Entire home/apt,False,False,2.0,False,0,0,8.0,86.0,1,0.810205,0.100839,204.970121,14.678608,342.182813,8.270427,16.38070,48.20296,vienna,weekends
1796,1796,637.168969,Entire home/apt,False,False,2.0,False,0,0,10.0,93.0,1,0.994051,0.202539,169.073402,12.107921,282.296424,6.822996,16.38568,48.20460,vienna,weekends
1797,1797,301.054157,Private room,False,True,2.0,False,0,0,10.0,87.0,1,3.044100,0.287435,109.236574,7.822803,158.563398,3.832416,16.34100,48.19200,vienna,weekends


### 2.2 Variable Identification

### 2.3 Remove duplicates

### 2.4 Remove values errors

### 2.5 Outliers Treatment

### 2.6 Handle Missing Values

## 3. Exploratory Data Analysis
### 3.1 Initial Exploration

### 3.2 Univariate Analysis

### 3.3 Bivariate Analysis
#### 3.3.1 Numerical-Numerical Variable

#### 3.3.2 Categorical-Numerical Variable

## 4. Data Preprocessing
### 4.1 Transformation of Distributions

### 4.2 Feature Engineering
#### 4.2.1 Creating New Features

#### 4.2.2 Feature Scaling

#### 4.2.3 Encoding Categorical Variables
##### 4.2.3.1 Label Encoding

##### 4.2.3.2 One Hot Encoding

### 4.3 Data Splitting (Train-Test-Validation)

## 5. The model
### 5.1 Model Building

### 5.2 Model Training

### 5.3 Model Evaluation
#### 5.3.1 K-Fold Cross Validation

#### 5.3.2 Hyperparameter Tunning

#### 5.3.3 Re-train with optimal hyperparameters for predictions

#### 5.3.4 Feature Importance

#### 5.3.5 Learning Curves

### 5.4 Test the model on Test Set

## 6. Conclusion
### 6.1 Results of the project / Validating hypothesis
...
### 6.2 Improvements
...
### 6.3 Conclusion on the project / course
...