written by: Jana Vihs, vihsjana@student.hu-berlin.de, 604930
# Dear Jupyter Notebook Reader

fancy, seeing you here. If you're planning on having a nice holiday in London soon, you came to the right place. 
Before we guide you through on how to use our Airbnb Price Predictor, we would like to explain a few details of the experimental design, but don't you worry that won't take long! After this only have to type one command into your terminal and you'll get the predicted price of your desired airbnb. 

If you like to know how this works in detail, stay tuned and read this  Jupyter Notebook till the end.

Let's start!


## A Word on the Experimental Design 

This Jupyter Notebook builds the core of the project and is intended to provide a common thread regarding data analysis, feature engineering and selection as well as model selection and evaluation of the final method. Nevertheless the main goal is to build a pipeline using .dvc in order to recieve price  predictions for an airbnb in London, United Kingdom [^1].

### Tools

As we are using a lot of new tools to develop our pipeline let us look at them a little more closely.

#### Docker

The whole project comes with a Dockerfile, which defines our project environment. You can use it but you don't have to. If you're a Docker Newbie checkout the following link https://docs.docker.com/get-started/ and make sure Docker is installed. 
Then run the following commands in your terminal to build a docker image and run the application inside a container.


1. *docker build . -t airbnb* 
2. *docker run -it --name airbnb -v $(pwd):/root/ airbnb bash*

Then you should be inside the docker container. To make sure all necessary dependencies are installed please run *pip install -r requirements.txt* inside the terminal of your docker container.

If you don't want to use Docker that is totally fine, just run *pip install -r requirements.txt* to have all the packages available.

#### .DVC
DVC https://dvc.org/ is an amazing tool to develop data driven pipelines for ML-Projects. The designated pipeline is defined in the dvc.yaml file. Usually it includes all stages of an machine learning pipeline like data preprocessing, feature engineering etc. but display the function of DVC we only developed a small pipeline. If you would like to test it, please run 
*dvc init --no-scm* and then *dvc repro -f* in your terminal and you should see the price predction of the listing 0FEMC4VA5U. 

### Project Architecture

Here is an overview of the structure of this project:

    ├── Dockerfile: Defines the environment.
    ├── README.md: Top level documentation for developers.
    ├── data
    │ ├── external: Data from third party sources. 
    │ ├── interim: Intermediate data that has been transformed.
    │ ├── canonical: Final data sets for modeling.
    │ └── raw: The original, immutable data dump.
    |
    ├── dvc.yaml: Defines the data pipelines.
    ├── models: Trained and serialized models, model predictions, or model summaries.
    │
    ├── notebooks: Jupyter notebook.
    ├── requirements.txt: Requirements file python dependencies. 
    └── src: Source code for use in this project.
        ├── __init__.py: Makes `src` a python module
        ├── features: Scripts to turn raw data into features for modeling.
        |–– ingest: Crawler to download, generate or add additional data. 




[^1]: <small> Packaging (installation of the package via pip) of the module is omitted, but is theoretically possible </small>.

As said before if you want to know how everything works in detail i hope you'll enjoy the following Notebook with the title:

# The Airbnb Price Predictor - London Area

### Table of Contents
- Introduction
    - Meta Information
- Data Preprocessing
    - Missing Values        
- Explorative Data Analysis
- Feature Engineering 
    - Hot Encoding
    - Distance to City Center
    - Host History
    - Cancellation Policy
    - Images
    - Ammenities   
    - Reviews
    - Yelp Data
- Benchmark Models
    - Linear Regression
    - Random Forest Regression
    - XGBoost Regression
    - LGBM
    - Neural Network
    - Performance
- Final Method
    - Hyperparameter Tuning
- Conclusion and Outlook
- References 

## Introduction

Airbnb, is a globally known peer-to-peer platform for short-term rental of housing accommodation.
In this termpaper we want to try to forecast the price of an Airbnb based on specific features.


In [None]:
# import all necessary packages 
# Standards 
import pandas as pd 
import numpy as np
import os 
import math
import sys
import random


# NLP
import spacy
from collections import Counter


# Visulaizations
import seaborn as sns
import folium
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
from wordcloud import WordCloud
%matplotlib inline
sns.set_theme(context='notebook', style='darkgrid', palette='deep', font='sans-serif', font_scale=1, color_codes=True, rc=None)
sns.set_palette(palette=sns.color_palette("deep"))
from tqdm import tqdm_notebook as tqdm
from pprint import pprint

import datetime
import warnings
warnings.filterwarnings('ignore')

# Translator
from google_trans_new import google_translator
#Clustering
from sklearn.cluster import KMeans

# Stats
import statsmodels.api as sm
import scipy
from scipy.stats import shapiro

# ML - Machine Learning - Model Testing
import xgboost
import lightgbm as lgb
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedKFold
from xgboost import XGBRegressor

# Evaluation
import math
from sklearn.metrics import mean_squared_error 
from sklearn.metrics import mean_absolute_error 
from sklearn.metrics import r2_score 
# DL - Deep Learning 
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# Grid Search 
from sklearn.model_selection import GridSearchCV
from sklearn.multioutput import MultiOutputRegressor

# save
import pickle




In [None]:
# change python path too include modules that i wrote myself
sys.path.append(os.path.dirname('../src'))
from src.features.preprocess.Processor import Processor
from src.features.preprocess.Textprocessor import Textprocessor
from src.features.preprocess.Evaluator import Evaluator

In [None]:
processor = Processor()
# load data set 
train = pd.read_csv('../data/raw/train.csv', index_col='listing_id')
test = pd.read_csv('../data/raw/test.csv', index_col='listing_id')
reviews = pd.read_csv('../data/raw/reviews.csv', index_col='listing_id')

## Meta Information 

The designated data set is available on Kaggle https://www.kaggle.com/c/adams2021/data and contains the tabular as well as text features which describe each Airbnb listing.
The main goal is to predict the price for a future Airbnb listing based on those properties.

In [None]:
print('Our train data consists of {}'.format(train.shape[0]) + ' rows and {}'.format(train.shape[1]) + ' columns, while our test data contains {}'.format(test.shape[0]) + ' rows and {}'.format(test.shape[1]) + ' columns.')
print('The additional data set reviews consist of {}'.format(reviews.shape[0]) + ' rows and {}'.format(reviews.shape[1]) + ' columns')

## Data Preprocessing

We start with our data preprocessing by calling our defined Processor.
First of all we will change the data types of our dataframes due to memory reasons.

In [None]:
# call processor
processor = Processor()

In [None]:
#  change data types because of memory reasons
train = processor.change_data_types(train)
test = processor.change_data_types(test)
reviews = processor.change_data_types(reviews)

We are going to drop neighbourhood right away as we already have neighbourhood_cleansed, also space and summary as description is a mixture of both. The picture_url will be used in the Image Crawler to crawl the additional image data, so we won't need it here.

In [None]:
train = processor.drop_features(train)
test = processor.drop_features(test)

### Missing values

As the following table shows we have a few missing values in our data. 

In [None]:
# Missing values in test data
test.isnull().sum()

In [None]:
# Missing values  in train data 
train.isnull().sum()

For some features we will just fill them up with a one because their has to be a bathroom, at least one bedroom and one bed even if the bedroom is a living room and the bed is a couch, why then would you need an Airbnb, right ? With the zipcodes we use geopy and get the missing values from latitude and longitude, using our Processor in Processor.py. During our analysis and filling up the missing values of the zipcodes with the ones that we got from geopy we found out that exactly 25 zipcodes in the training and 10 in the test set could not be found. Hence, we drop those in the train data set.


## Explorative Data Analysis

To get a better grip of our data we will explore each feature during our explorative data analysis. We will start iwth target variable *price* and then we will take a look at the relationship between the additional features.

In [None]:
print('The mean average price of an airbnb is {}'.format(round(train.price.mean(),2)) + ' £.')
print('The median price of an airbnb is {}'.format(round(train.price.median(),2)) + ' £.')

In [None]:
# Distribution and Box Plot
f, axes = plt.subplots(1, 2, figsize=(20,10))
sns.distplot(train.price, rug=True, ax=axes[0]).set_title('Distribution of Price Data')
print("Skewness: %f" % train['price'].skew())
print("Kurtosis: %f" % train['price'].kurt())
plt.title('Box Plot of Price Data')
sns.boxplot(train.price, ax=axes[1])
plt.show()


Our target variable seems to be right-skewed, so we are going to apply log transformation to the price data. The box plot shows that we have a few outliers in the 100 range what would explain the high mean average price. 

In [None]:
# Distribution of Price Data after Log Trans.
train['log_price'] = train['price'].apply(lambda x: processor.price_log_transformation(x))
print("Skewness: %f" % train['log_price'].skew())
print("Kurtosis: %f" % train['log_price'].kurt())
plt.figure(figsize=(15,15))
plt.title("Price Distribution after Log Transformation")
sns.distplot(train.log_price, rug=True, fit=scipy.stats.norm)
plt.show() 

The distrbution plot of the price data after the log-transformation leads us to the assumption that the price data might be following a normal distrbution which follows the following formula:
$$ 
f(x) = \frac{1}{\sigma\sqrt{2\pi}} 
  \exp\left( -\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^{\!2}\,\right).
$$,

We are going to perform a Shapiro-Wilk Test in order to test our hypothesis. Since it is only designed to examine samples of size $ 3 \leq n \leq 50$ , we will draw 20 samples randomly from our target variable.

In [None]:
#Shapiro-Wilk Test
random.seed(123)
stat, p = shapiro(train.log_price.sample(n=20))
print('Statistics=%.3f, p=%.3f' % (stat, p))
# interpret
alpha = 0.05
if p > alpha:
	print('Sample looks Gaussian (fail to reject H0)')
else:
	print('Sample does not look Gaussian (reject H0)')

Now we want to cluster the price data using the K-Means Algorithm. We want to devide out data set into three different price clusters: affordable, middle and expensive.

In [None]:
# KMEANS
random.seed(123)
price = np.array(train.price)
price = price.reshape(-1,1)
kmeans = KMeans(n_clusters=3).fit(price.reshape(-1,1))
price_labels = kmeans.predict(price.reshape(-1,1))
centers = kmeans.cluster_centers_
train['price_cluster'] = price_labels

In [None]:
# Price Cluster
random.seed(123)
middle = train[train['price_cluster']==0]
affordable = train[train['price_cluster']==1]
expensive = train[train['price_cluster']==2]
train['price_cluster'] = train['price_cluster'].map({0 : 'affordable', 1: 'middle', 2:'expensive'})


As we can see in the follwoing box plot, we can identify three different price cluster.  

In [None]:
# Box Plots Price Cluster
random.seed(123)
plt.title('Box Plots Price Cluster')
ax = sns.boxplot(x="price_cluster", y="price", data=train)
plt.show()


The following Scatterplot which displays the different location of each Airbnb using longitude and latitude according to their price distribution, reveals that there seem no difference in their price whether the place is in the city.  

In [None]:
# Distribution of Airbnbs and their price clusters according lonitude and lattitude
plt.dims=(12,10)
plt.figure(figsize=plt.dims)
sns.scatterplot(x='longitude', y='latitude', hue='price_cluster', data=train, palette='deep', size=train.price_cluster,sizes=(200, 20), legend="full")
plt.legend()
plt.ioff()

The Correlation Plot reveals, that the different properties of each airbnb either have a strong positve correlation or a rather small negative correlation.
Esspecially the features *accommodates, bathrooms, beds, bedrooms and guests_included* have a strong correlation with the price data. Makes sense, with size of the airbnb the price rises.


In [None]:
# Correlation plot
plt.figure(figsize=(15,15))
corr = train.select_dtypes(['int32', 'float32', 'int64']).corr()
heatmap = sns.heatmap(corr, vmin=-1, vmax=1, annot=True)
heatmap.set_title('Correlation Heatmap', fontdict={'fontsize':10}, pad=12);

In [None]:
plt.dims=(15,15)
plt.figure(figsize=plt.dims)
sns.scatterplot(train.longitude, train.latitude, hue=train.neighbourhood_cleansed, size=train.price_cluster,sizes=(200, 20), legend="full")
plt.ioff()

In [None]:
# Create map
#lonlat = list(zip(train.longitude, train.latitude))
#mapit = folium.Map( location=[52.667989, -1.464582], zoom_start=6 )
#for coord in lonlat:
#    folium.Marker( location=[ coord[0], coord[1] ], fill_color='#43d9de', radius=8 ).add_to( mapit )

#mapit.save( 'map.html')

## Feature Engineering

We certainly applied a lot of feature engineering to our data set. In this section we will summarize, what inform

### Hot Encoding 

We will use one hot encoding for the following data features bed_type, property_type and room_type.


### Distance to City Center

If you are analysing a bigger city that has multiple locations that are considered desirable, you can also run this code as many times as needed with different geographical points. (Don’t forget to change the column names so you don’t overwrite the previous point!).
For example, there is a financial district close to the Amsterdam Zuid station that could be equally (or even more) relevant to working tenants than living close to the city center. Measuring these various scenarios is more important if you are using methods similar to multiple linear regressions rather than machine-learning statistical algorithms because they are inherently better at recognising non-linear relationships and clusters. For this reason, I won’t include it in this analysis but it is an interesting factor to weight in depending on the statistical method being used. As Airbnb is mainly used fo holiday i would say hence we chose a pretty touristic place
**Distance to City Center**. We have chosen the Picadally Circus as city center, but using other coordinates might also be possible.


![Picadally Circus](../data/external/londonpiccadillycircus.jpg) 

### Host History

Since we have our feature *host_since* in our dataframe we can calculate how long the host is acutally registered at the platform, which also might be a great additional feature for the price prediciton. As a user it might be important to increase his trust into the airbnb offer, if a host has gained some experience, offering Airbnb's. *The host_response_rate* was converted into probabilities. 

## Cancellation Policy

More information on the Cancellation policy can be found on https://www.airbnb.de/home/cancellation_policies#long-term. The types of the cancellation polices were summarized into different groups: light, moderate and strict. 


### Images

Thanks to the feature *picture_url* we are able to crawl all the images, so we actually are able to see how the Airbnb looks like.
The code can be found in **/root/src/ingest/ImageCrawler.py** and the actual images will be stored in **/root/data/images/** in the designated directories regarding test and train data. The images themselves will be processed using our **ImageProcessor**. To keep our scope narrow, we will pull four values from the image that may be of value: the brightness and the RGB values, hence the number of red, blue and green pixels. 

** Why could this be important ? **

Taking a step back, we can think of a few things in the image that may impact the price:

    - Are the floors carpet or hardwood?
    - Are the walls painted or wallpaper? Paintings or posters?
    - Does the host have plants? Are they alive and healthy?
    - Is the picture well-lit and inviting?


### Ammenities

The different ammenities were cleaned, recoded, then encoded and added to our data.

### Reviews

In order to analyze our text data we will use **spacy** https://spacy.io/. 
The reviews and the text data in general contains valuable information so we dedicated a whole chapter to it.


### Yelp Data

As we are going for holiday to London we of course want to find out about all the yummy restaurants. Also we want to stay close to the busiest places to actually see London, so being close to the restaurants and markets and shops might have an impact to the price of airbnb.
In order to find that out we scraped data using the Yelp API https://www.yelp.com/developers.
Sadly, we could not collect enough data for our scope but it would be definetely worth a try for future research.

In [None]:
yelp = pd.read_csv('../data/external/yelp.csv', index_col='index')

## Sentiment Analysis of Review Data

Our data contains reviews in different languages like spanish, french, german or different languages from asia. Also we have reviews, that only contain one character or use a Thumpsup to express a positive opinion. Usually we would remove those data in our text cleaning process but in that case, we may lose valuable information if we remove the emojis. In this case, a better approach is to convert emoji to word format so that it preserves the emoji information.

The following cleaning steps will be performed on our review data:

1.Step :   Remove unnecessary characters like \r, \n, urls and further            that might reduce prediction power.

2.Step :   Translation of non-englisch reviews using google translator            python package.

3.Step :   Converting emojis into words.


In [None]:
# read in processed review df
reviews = pd.read_csv('../data/interim/reviews/csv/reviews.csv', index_col='listing_id')

In [None]:
# Initialize our Textprocessor
textprocessor = Textprocessor()

In [None]:
textprocessor.clean(reviews.comments[5])

In [None]:
nlp = spacy.load("en_core_web_sm")
doc = nlp(reviews.comments[5])
tokens = [token for token in doc ]

Now, we will remove any stopwords, punctuations and white spaces that are still left in our data.

In [None]:
# remove stopwords
filtered = [token for token in tokens if not token.is_stop]
# remove puntuations
filtered = [token for token in filtered if not token.is_punct]
# remove white spaces 
filtered = [token for token in filtered if not token.is_space ]
# lemmatize and turn it to lowercase
lemmas = [token.lemma_.strip().lower() for token in filtered]

In [None]:
word_freq = Counter(lemmas)
 # 5 commonly occurring words with their frequencies
common_words = word_freq.most_common(5)
print (common_words)


In [None]:
# Unique words
unique_words = [lemmas for (lemma, freq) in word_freq.items() if freq == 1]
#print (unique_words)

In [None]:
lemmas

In [None]:
plt.figure(figsize=(15,15))
wordcloud = WordCloud().generate(reviews.lemmas[0])
# Display the generated image:
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

The following pie chart just gives a quick overview of the proportion of languages in our review data. Well, english has by far the mayority.

In [None]:
# Overview languages
plt.figure(figsize=(15,15))
lan = reviews['language'].value_counts()[:5]
labels = lan.keys()
lan.plot.pie(autopct="%.1f%%", explode=[0.05]*5, labels=labels, pctdistance=0.5)
plt.title("Languages", fontsize=14);

Then we detected whether a reveiw was positive or negative. The following Distribution Plot displays the polarity or popularity of each airbnb listing and the subjectivity of each review. It makes sense, that they are both equally distributed as an opinion about something is always subjective. A polarity value below zero indicates a negative review while a value above zero indicates a positive review. All in all most of the airbnb's that received reviews at all are evaluated very positively. 

In [None]:
# Distribution of Subjectivity and Polarity in review data
plt.figure(figsize=(15,15))
plt.title("Distribution of Polarity and Subjectivity")
sns.distplot(reviews['polarity'])
sns.distplot(reviews['subjectivity'])
plt.legend(labels=['Polarity', 'Subjectivity'])
plt.show() 


## Feature Preparation

Now we start with our feature prepearation for our models.

In [None]:
# cleaned data frames 
train = pd.read_csv('../data/canonical/train/train.csv', index_col='listing_id')
test = pd.read_csv('../data/canonical/test/test.csv', index_col='listing_id')

In [None]:
print('Our merged train data consists of {}'.format(train.shape[0]) + ' rows and {}'.format(train.shape[1]) + ' columns, while our merged test data contains {}'.format(test.shape[0]) + ' rows and {}'.format(test.shape[1]) + ' columns.')

In [None]:
target = train.price
target_log = train.log_price
train = train.drop(['log_price', 'price'], axis=1)

In [None]:
print(train.shape)
print(test.shape)
print(target_log.shape)

In [None]:
# create validation set 
x_train, x_test, y_train, y_test = processor.create_train_validation_frames(train, target_log, test)

In [None]:
# Check shapes again
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)

## Model Benchmarking

We will try out different models like Linear Regression, Random Forest Regression, XGBoost for Regression and LGBM.

We will evaluate each model using the following metrics :

$R^{2}$, the MSLE (as our target feature is arleady log-transformed we only have to calculate the MSE and use the reverse function $\exp(x)$ to calculate the usual MSE of our price data), RMSE and SMAPE.

We also check the feature importances of each model in order to find out which properties of each airbnb have the most influence on the price data.

In [None]:
# Call the evaluator
evaluate = Evaluator()

### Linear Regression

In [None]:
reg = LinearRegression(copy_X= True, fit_intercept=True, n_jobs=None, normalize=False)
reg = reg.fit(x_train, y_train)

In [None]:
# use  Model to predict values
y_pred = reg.predict(x_train)

# Calculate the Mean Squared Error using the mean_squared_error function.
print("Training Data")
print("R^2 value using score fn: %.3f" % reg.score(x_train,y_train))
print("Mean Squared Log Error : %0.3f" % mean_squared_error(y_train,y_pred))
print("Mean Squared Error : %0.3f" % mean_squared_error(np.exp(y_train),np.exp(y_pred)))
print("Root Mean Squared Error : %0.3f" % mean_squared_error(np.exp(y_train),np.exp(y_pred))**0.5)
print("SMAPE : %0.3f " % evaluate.symmetric_mean_absolute_percentage_error(np.exp(y_train),np.exp(y_pred)))

#predictions = pd.Series(predictions, index=validate_.index, name='price')
#predictions = predictions.apply(lambda x : np.exp(x))


In [None]:
# use  Model to predict values
y_pred = reg.predict(x_test)
r2_reg = reg.score(x_test,y_test)
msle_reg = mean_squared_error(y_test,y_pred)
mse_reg = mean_squared_error(np.exp(y_test),np.exp(y_pred))
rmse_reg = mean_squared_error(np.exp(y_test),np.exp(y_pred))**0.5
smape_reg = evaluate.symmetric_mean_absolute_percentage_error(np.exp(y_test),np.exp(y_pred))
# Calculate the Mean Squared Error using the mean_squared_error function.
print("Test Data")
print("R^2 value using score fn: %.3f" % r2_reg)
print("Mean Squared Log Error : %0.3f" % msle_reg)
print("Mean Squared Error : %0.3f" % mse_reg)
print("Root Mean Squared Error : %0.3f" % rmse_reg)
print("SMAPE : %0.3f " % smape_reg )



In [None]:
# Plot of Casual and Registered model's residuals:
fig = plt.figure(figsize=(10,3))

sns.regplot(np.exp(y_test),np.exp(y_pred), line_kws={"color": "red"})
plt.title("Residuals for Linear Regression")


In [None]:
lin_reg_coef = pd.DataFrame(list(zip(train.columns.tolist(),(reg.coef_))),columns=['Feature','Coefficient'])
lin_reg_coef.sort_values(by='Coefficient',ascending=False).iloc[:50]

### Random Forest Regressor

In [None]:
clf = RandomForestRegressor(max_depth=10, n_estimators=100)

#Train the classifier
clf.fit(x_train, y_train)

#Plot variable importances for the top 10 predictors
importances = clf.feature_importances_
feat_names = train.columns.tolist()
tree_result = pd.DataFrame({'feature': feat_names, 'importance': importances})
tree_result.sort_values(by='importance',ascending=False)[:10].plot(x='feature', y='importance', kind='bar',color='blue')

In [None]:
# Use the model to predict values
y_pred = clf.predict(x_train)

# Calculate the Mean Squared Error using the mean_squared_error function.
print("Training Data")
print("R^2 value using score fn: %.3f" % clf.score(x_train,y_train))
print("Mean Squared Log Error : %0.3f" % mean_squared_error(y_train,y_pred))
print("Mean Squared Error : %0.3f" % mean_squared_error(np.exp(y_train),np.exp(y_pred)))
print("Root Mean Squared Error : %0.3f" % mean_squared_error(np.exp(y_train),np.exp(y_pred))**0.5)
print("SMAPE : %0.3f " % evaluate.symmetric_mean_absolute_percentage_error(np.exp(y_train),np.exp(y_pred)))


In [None]:
#Use the model to predict values
y_pred = clf.predict(x_test)

r2_rf = clf.score(x_test,y_test)
msle_rf = mean_squared_error(y_test,y_pred)
mse_rf = mean_squared_error(np.exp(y_test),np.exp(y_pred))
rmse_rf = mean_squared_error(np.exp(y_test),np.exp(y_pred))**0.5
smape_rf = evaluate.symmetric_mean_absolute_percentage_error(np.exp(y_test),np.exp(y_pred))
# Calculate the Mean Squared Error using the mean_squared_error function.
print("Test Data")
print("R^2 value using score fn: %.3f" % r2_rf)
print("Mean Squared Log Error : %0.3f" % msle_rf)
print("Mean Squared Error : %0.3f" % mse_rf)
print("Root Mean Squared Error : %0.3f" % rmse_rf)
print("SMAPE : %0.3f " % smape_rf)


In [None]:
# Plot of Casual and Registered model's residuals:
fig = plt.figure(figsize=(10,3))

sns.regplot(np.exp(y_test),np.exp(y_pred),line_kws={"color": "red"})
plt.title("Residuals for Random Forest")

### XGB Boost 

In [None]:
xlf = XGBRegressor()
xlf.fit(x_train, y_train)

In [None]:
# use  Model to predict values
y_pred = xlf.predict(x_train)

# Calculate the Mean Squared Error using the mean_squared_error function.
print("Training Data")
print("R^2 value using score fn: %.3f" % xlf.score(x_train,y_train))
print("Mean Squared Log Error : %0.3f" % mean_squared_error(y_train,y_pred))
print("Mean Squared Error : %0.3f" % mean_squared_error(np.exp(y_train),np.exp(y_pred)))
print("Root Mean Squared Error : %0.3f" % mean_squared_error(np.exp(y_train),np.exp(y_pred))**0.5)
print("SMAPE : %0.3f " % evaluate.symmetric_mean_absolute_percentage_error(np.exp(y_train),np.exp(y_pred)))

In [None]:
#Use the model to predict values
y_pred = xlf.predict(x_test)

r2_xgb = xlf.score(x_test,y_test)
msle_xgb = mean_squared_error(y_test,y_pred)
mse_xgb = mean_squared_error(np.exp(y_test),np.exp(y_pred))
rmse_xgb = mean_squared_error(np.exp(y_test),np.exp(y_pred))**0.5
smape_xgb = evaluate.symmetric_mean_absolute_percentage_error(np.exp(y_test),np.exp(y_pred))

# Calculate the Mean Squared Error using the mean_squared_error function.
print("Test Data")
print("R^2 value using score fn: %.3f" % r2_xgb)
print("Mean Squared Log Error : %0.3f" % msle_xgb)
print("Mean Squared Error : %0.3f" % mse_xgb)
print("Root Mean Squared Error : %0.3f" % rmse_xgb)
print("SMAPE : %0.3f " % smape_xgb)

### LGBM

In [None]:
#Create dataset for lightgbm
lgb_train = lgb.Dataset(xtrain20, y_train)
lgb_eval = lgb.Dataset(xtest20, y_test, reference=lgb_train)
feat_names = xtrain20.columns.tolist()
#Config the LGBM model parameters
params = {
    'boosting_type': 'gbdt',
    'objective': 'regression',
    'metric': {'l2', 'l1'},
    'num_leaves': 31,
    'learning_rate': 0.05,
    'feature_fraction': 0.9,
    'bagging_fraction': 0.8,
    'bagging_freq': 5,
    'verbose': 0
}

print('Starting training...')
# train

evals_result = {}

gbm = lgb.train(params,
                lgb_train,
                num_boost_round=300,
                valid_sets=lgb_eval,
                early_stopping_rounds=5,
               evals_result=evals_result,
               feature_name=feat_names)

print('Saving model...')
#Save the fit model to a file
gbm.save_model('../data/models/model.txt')

In [None]:
# Use the model to predict values
y_pred = gbm.predict(xtrain20, num_iteration=gbm.best_iteration)

# Calculate the Mean Squared Error using the mean_squared_error function.
print("Training Data")

print("Mean Squared Log Error : %0.3f" % mean_squared_error(y_train,y_pred))
print("Mean Squared Error : %0.3f" % mean_squared_error(np.exp(y_train),np.exp(y_pred)))
print("Root Mean Squared Error : %0.3f" % mean_squared_error((y_train),(y_pred))**0.5)
print("SMAPE : %0.3f " % evaluate.symmetric_mean_absolute_percentage_error((y_train),(y_pred)))

In [None]:
# Use the model to predict values
y_pred = gbm.predict(xtest20, num_iteration=gbm.best_iteration)

msle_gbm = mean_squared_error(y_test,y_pred)
mse_gbm = mean_squared_error(np.exp(y_test),np.exp(y_pred))
rmse_gbm = mean_squared_error(np.exp(y_test),np.exp(y_pred))**0.5
smape_gbm = evaluate.symmetric_mean_absolute_percentage_error(np.exp(y_test),np.exp(y_pred))

# Calculate the Mean Squared Error using the mean_squared_error function.
print("Test Data")
print("Mean Squared Log Error : %0.3f" % msle_gbm)
print("Mean Squared Error : %0.3f" % mse_gbm)
print("Root Mean Squared Error : %0.3f" % rmse_gbm)
print("SMAPE : %0.3f " % smape_gbm)

In [None]:
print('Plotting metrics recorded during training...')
ax = lgb.plot_metric(evals_result, metric='l1')
plt.show()

print('Plotting feature importances...')
ax = lgb.plot_importance(gbm, max_num_features=20)
plt.show()

In [None]:
# Plot of Casual and Registered model's residuals:
fig = plt.figure(figsize=(10,3))

sns.regplot((y_test),(y_pred), line_kws={'color': 'red'})
plt.title("Residuals for LGBM")

### Neural Network



In [None]:
# Change all to float32
x_train=np.asarray(x_train).astype(np.float32)
y_train=np.asarray(y_train).astype(np.float32)
x_test=np.asarray(x_test).astype(np.float32)
y_test=np.asarray(y_test).astype(np.float32)

In [None]:
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.callbacks import EarlyStopping
#Step1. Define the model
model = Sequential()
model.add(
    Dense(16, activation = 'relu', kernel_initializer = 'he_normal', input_shape = (x_train.shape[1],)))
model.add(Dense(8, activation = 'relu', kernel_initializer = 'he_normal'))
model.add(Dense(4, activation = 'relu', kernel_initializer = 'he_normal'))
model.add(Dense(2, activation = 'relu', kernel_initializer = 'he_normal'))
model.add(Dense(1))
#Step2. Compile the model
model.compile(optimizer = 'adam', loss = 'rmse', metrics = 'rmse')
#Step3. Fit the model
history = model.fit(x_train, y_train, validation_data=(x_test, y_test), epochs=10,batch_size=64, verbose=0)
#Step4.1 Evaluate the model
loss, mae = model.evaluate(x_test, y_test)
#Step4.2 Plot the learning curve
plt.plot(history.history['loss'], label='train')
plt.plot(history.history['val_loss'], label='val')
plt.show()

In [None]:
model.summary()

In [None]:
lr_reducer = tf.keras.callbacks.ReduceLROnPlateau(monitor='val_loss', patience=2, factor=0.2)
early_stopper = tf.keras.callbacks.EarlyStopping(patience=5)
callbacks = [lr_reducer, early_stopper]

In [None]:
history = model.fit(x_train ,y_train, validation_data=(x_test, y_test), callbacks=callbacks, epochs=10, batch_size=64, verbose=2)

In [None]:
# Use the model to predict values
y_pred = model.predict(x_train)

# Calculate the Mean Squared Error using the mean_squared_error function.
print("Training Data")
print("Mean Squared Log Error : %0.3f" % mean_squared_error(y_train,y_pred))
print("Mean Squared Error : %0.3f" % mean_squared_error(np.exp(y_train),np.exp(y_pred)))
print("Root Mean Squared Error : %0.3f" % mean_squared_error((y_train),(y_pred))**0.5)
#print("SMAPE : %0.3f " % evaluate.symmetric_mean_absolute_percentage_error((y_train),(y_pred)))

In [None]:
# Use the model to predict values
y_pred = model.predict(x_test)

msle_nn = mean_squared_error(y_test,y_pred)
mse_nn = mean_squared_error(np.exp(y_test),np.exp(y_pred))
rmse_nn = mean_squared_error((y_test),(y_pred))**0.5
# Calculate the Mean Squared Error using the mean_squared_error function.
print("Test Data")
print("Mean Squared Log Error : %0.3f" % msle_nn)
print("Mean Squared Error : %0.3f" %  mse_nn)
print("Root Mean Squared Error : %0.3f" % rmse_nn)
#print("SMAPE : %0.3f " % evaluate.symmetric_mean_absolute_percentage_error((y_test),(y_pred)))

In [None]:
#Plot of Casual and Registered model's residuals:
fig = plt.figure(figsize=(10,3))

sns.regplot((y_test),(y_pred), line_kws={'color': 'red'})
plt.title("Residuals for the Neural Network")

### Performance 

The following performance summarizes the evaluation metrics of each model we tried.

In [None]:
# Only for test data
models_used = ['Linear Regression', 'Random Forest Regressor', 'XGB Regressor', 'LGBM Regressor', 'Simple Neural Network']

performance_metrics = {'R^2': [r2_reg,r2_rf,r2_xgb,'-','-'], 'MSE':[mse_reg,mse_rf,mse_xgb,mse_gbm,mse_nn], 'RMSE':[rmse_reg,rmse_rf,rmse_xgb,rmse_gbm,rmse_nn], 'MSLE': [msle_reg,msle_rf,msle_xgb,msle_gbm,msle_nn], 'SMAPE':[smape_reg,smape_rf,smape_xgb,smape_gbm,'-']}

pd.DataFrame(performance_metrics, index=models_used)

A Deep Learning Model is known to perform rather poor on tabular data, so the text features could have been separated and fed into a Neural Network while the tabular data will be fed to our final method. At the end, both predictions could be combined to receive a final result.

## Final Method

In [None]:
# Grid Search 
gsc = GridSearchCV(
            estimator=xlf,
            param_grid={"learning_rate": (0.05, 0.10, 0.15),
                        "max_depth": [ 3, 4, 5, 6, 8],
                        "min_child_weight": [ 1, 3, 5, 7],
                        "gamma":[ 0.0, 0.1, 0.2],
                        "colsample_bytree":[ 0.3, 0.4],},
            cv=3, scoring='neg_mean_squared_error', verbose=0, n_jobs=-1)

grid_result = MultiOutputRegressor(gsc).fit(x_train,np.asarray(y_train).reshape(-1,1))

best_params = grid_result.estimators_[0].best_params_  # for the first y_target estimator


In [None]:
train_xgb = xgboost.DMatrix(x_train, y_train)
test_xgb = xgboost.DMatrix(x_test, y_test)


In [None]:
best_params = {
    "colsample_bytree":0.4, 
    "gamma":0.2,
    "learning_rate":0.1,
    "max_depth":8, 
    "min_child_weight":3
}

In [None]:
# training
model = XGBRegressor(
    colsample_bytree=0.4,
    gamma=0.2, 
    learning_rate=0.1, 
    max_depth=8,
    min_child_weight=3
)
model.fit(x_train, y_train, 
          eval_set=[(x_train, y_train), (x_test, y_test)], 
          early_stopping_rounds=20)

In [None]:
predictions = model.predict(test, ntree_limit=model.best_ntree_limit)
predictions = pd.Series(predictions.ravel(), index=test.index, name='price')
predictions = predictions.apply(lambda x: np.exp(x))


In [None]:
predictions

In [None]:
# Distribution of proce predicitons
plt.title('Distribution of Price Predictions')
sns.distplot(predictions)

In [None]:
# save preds
predictions.to_csv('../predictions/sample_submission_604930.csv')
# save model
pickle.dump(model, open("../data/models/xgb_reg.pkl", "wb"))


## Conclusion and Outlook

In the scope of this termpaper many approaches came to our mind and it still offers so much room for improvement. As we included the review data and the image data in our model and hoped for great improvement in our prediciton power which was sadly not the case, we have refrained from adding any further text features, which should be done differntly in the future. 
The following list shall give an overview what kind of approaches could be used to improve our Airbnb-Price-Predictor:

    1. Use Bert on the description data of the airbnb to get word embeddings and maybe try to add a regression layer to predict the price
    2. Use an Object Detection CNN like Yolo or an CNN for Image Processing on the actual crawled pictures of each airbnb to add additonal features
    3. Use the yelp data to get additonal features like explained in the section *Feature Engineering*
    4. Try to cluster the numeric features to get more information about each price cluster. It might be worth a try to transform our current regression problem into a classification problem to detect different price cluster and then predict a price for each segment.
    5. Get even more information on each Airbnb listing. Usually it is listed if one has to pay a cleaning fee which might be important feature as well

There is always so much more you can do!

## References

1.   Kincl, Tomáš & Novák, Michal & Pribil, Jiri. (2016). Sentiment Classification in Multiple Languages: Fifty Shades of Customer Opinions. 10.1007/978-3-319-22593-7_19. 

2.   Patel, Miral & Darji, Mittal & janki,. (2018). Stock Price Prediction Using Clustering and Regression: A Review. 

3.   Rezazadeh, Pouya & Nikolenko, Liubov & Rezaei, Hoormazd. (2019). Airbnb Price Prediction Using Machine Learning and Sentiment Analysis. 

4.  Trivedi, Shubhendu & Pardos, Zachary & Heffernan, Neil. (2015). The Utility of Clustering in Prediction Tasks. 

5.  Xing, Yazhou & Qian, Zian & Chen, Qifeng. (2021). Invertible Image Signal Processing. 