# Milan Price Prediction
N.b to see the output of the commands, go to my colab file: https://colab.research.google.com/drive/1SNDJqy8hIXT6z2BjiBQqc3YifkhVX1_9?usp=sharing

This summer, I am going to work in Milan. Therefore, I though it would be intersting to predict the price at which an airbnb is gonna sell so that I can estimate whether the house I am renting is overpriced or underpriced. 





## Context 
The dataset I am using is [this](https://www.kaggle.com/antoniokaggle/milan-airbnb-open-data-only-entire-apartments) one, which contains all the entire apartments located in Milan. The dataset was originally taken from the airbnb site

## Problem statement
We would like to predict, given a series of variables relative to an apartment, the price at which it should be rented. 

## Setup
Import the necessary libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import sklearn
%matplotlib inline

Load the data

In [None]:
apartments = pd.read_csv("../input/milan-airbnb-open-data-only-entire-apartments/Airbnb_Milan.csv")

Let's start by taking a look at the data

In [None]:
apartments.info()

We can see that we have many fields, some of which may not be that helpful. We also note that there are no null values

In [None]:
apartments.head()

## Data cleaning

Let's take a look more deeply into the data to clean it

In [None]:
apartments.head()

As we can see, we can for sure remove the Unnamed column (which is just a record counter), the id of the apartment, the host id and the zip code.

In [None]:
apartments.drop(apartments.columns[[0, 1, 2]],axis=1,inplace=True)
apartments.drop(columns=["zipcode"], inplace=True)

We are going to assume a stay of 7 days, since the guest has to pay both for the cleaning fee and the daily fee, it's convenient to put them togheter in a new column. The cleaning fee has to be paid only one time

In [None]:
weekly_price = apartments.cleaning_fee + apartments.daily_price * 7
apartments["weekly_price"] = weekly_price.values
apartments.drop(columns=["cleaning_fee", "daily_price"], inplace=True)


We are also going to remove some data which we don't find important in our analysis. For example, since room type is always the same

In [None]:
apartments.room_type.value_counts()

We can remove it

In [None]:
apartments.drop(columns=["room_type"], inplace=True)

The data from the dataset is overall pretty clean and the other information may be useful for our predictions, so for now we are going to leave them.

## Explorative analysis




### Visualizing the data
One interesting thing we can do is visualize the apartments on a map. For the map to be relevant, we should plot the points given their latitude, longitude, weekly price, and the "municipio" in which they are. 

In [None]:
import urllib

cmap = cm.jet
m = cm.ScalarMappable(cmap=cmap)
quartiere_colors = m.to_rgba(apartments.neighbourhood_cleansed)
quant_minimum = apartments.weekly_price.quantile(0.1)
quant_maximum = apartments.weekly_price.quantile(0.9)
price = ((apartments.weekly_price - quant_minimum) / (quant_maximum - quant_minimum)) * 30

#initializing the figure size
plt.figure(figsize=(20,20))
#loading the png milan image found on open street map and saving to my local folder along with the project
i=urllib.request.urlopen('https://i.ibb.co/s1Jf5k7/map-2.png')
mil_img=plt.imread(i)
#scaling the image based on the latitude and longitude max and mins for proper output
plt.imshow(mil_img, zorder=0, extent=[
                                      apartments.longitude.min(), 
                                      apartments.longitude.max(), 
                                      apartments.latitude.min(), 
                                      apartments.latitude.max(),
                                    ]
           )

ax=plt.gca()
# plot the points
apartments.plot(kind='scatter', x='longitude', y='latitude', label='price', c=quartiere_colors, s=price, ax=ax, zorder=5, edgecolors='black')

patch = []
for a in range(1, 9):
  patch.append(mpatches.Patch(color=m.to_rgba(a), label='Neighborhood ' + str(a)))

plt.legend(handles=patch)

plt.show()

We can notice how the majority of the Airbnb apartments are in the center (neighbourhood = 1, the one in dark blue) and the further we are from the center the less the price of the housing is

In [None]:
apartments.neighbourhood_cleansed.value_counts()

### Visualizing specific type of apartments

Let's now take a look at parts of our data on the map. Let's first create a function to draw the graph


In [None]:
def draw_plot(apartments):
    cmap = cm.jet
    m = cm.ScalarMappable(cmap=cmap)
    quartiere_colors = m.to_rgba(apartments.neighbourhood_cleansed)
    quant_minimum = apartments.weekly_price.quantile(0.1)
    quant_maximum = apartments.weekly_price.quantile(0.9)
    price = ((apartments.weekly_price - quant_minimum) / (quant_maximum - quant_minimum)) * 30

    #initializing the figure size
    plt.figure(figsize=(20,20))
    #loading the png milan image found on open street map and saving to my local folder along with the project
    i=urllib.request.urlopen('https://i.ibb.co/s1Jf5k7/map-2.png')
    mil_img=plt.imread(i)
    #scaling the image based on the latitude and longitude max and mins for proper output
    plt.imshow(mil_img, zorder=0, extent=[
                                          apartments.longitude.min(), 
                                          apartments.longitude.max(), 
                                          apartments.latitude.min(), 
                                          apartments.latitude.max(),
                                        ]
              )

    ax=plt.gca()
    # plot the points
    apartments.plot(kind='scatter', x='longitude', y='latitude', label='price', c=quartiere_colors, s=price, ax=ax, zorder=5, edgecolors='black')
    patch = []
    for a in range(1, 9):
      patch.append(mpatches.Patch(color=m.to_rgba(a), label='Neighborhood ' + str(a)))

    plt.legend(handles=patch)
    plt.show()

#### Distribution by price
Now, let's visualize the apartments which costs more than 800€ per week


In [None]:
draw_plot(apartments[apartments.weekly_price > 800])

As I was saying before, we can see clearly how the very expensive apartments (> 75 quartile) are all concentred in the center. In contrast, looking at inexpensive apartments

In [None]:
draw_plot(apartments[apartments.weekly_price < 350])

We can see how none of them are in the center.

#### Distribution by number of reviews
It may also be interesting to see which apartments are visited the most. Let's take a look using the number of reviews

In [None]:
apartments.number_of_reviews.describe()

In [None]:
draw_plot(apartments[apartments.number_of_reviews > 44])

In [None]:
draw_plot(apartments[apartments.number_of_reviews <= 4])

I expected that the places at the center were visited more often, but that's not the case as it seems by our graphs. One thing to notice is that very expensive apartments have very few reviews.

#### Distribution by number of bedrooms
It may also be intersting to take a look where bigger houses are located. Let's consider them

In [None]:
draw_plot(apartments[apartments.bedrooms > 2])

In [None]:
draw_plot(apartments[apartments.bedrooms == 1])

We can notice how most of the apartments have only one bedroom. Contrary to what I believed, big houses can be found in all the neighborhoods, and they are not that expensive.  

### Visualizing correlations

Let's now explore some possible data correlation to see what parameters may be interesting to predict price. Let's see the price correlation based on neighbourhood

In [None]:
apartments.boxplot(by="neighbourhood_cleansed", column="weekly_price", figsize=(10,10), showmeans=True)

We can see how the price distribution is very diversified. What we can notice is that on average the price in the center of Milan are slightly higher, which was expected. The thing that strikes the most about these prices are the number of outliers present. We will have to fix that before start predicting

#### Matrix of correlations
Let's now look at the Pearson correlation between two variables.

In [None]:
# taken from https://stackoverflow.com/questions/29432629/plot-correlation-matrix-using-pandas
f = plt.figure(figsize=(25, 20))
plt.matshow(apartments.corr(method='pearson'), fignum=f.number)
plt.xticks(range(apartments.shape[1]), apartments.columns, fontsize=14, rotation=90)
plt.yticks(range(apartments.shape[1]), apartments.columns, fontsize=14)
cb = plt.colorbar()
cb.ax.tick_params(labelsize=14)

This matrix is super interesting. I want to highlight some things that are worth be mentioned.

Regarding the weekly price:
*   It sees that **the weekly price is strongly correlated with the number of  bed, bedrooms and how many people it can accomodate**, which I didn't notice in the map over Milan. It is however, expected
*   It also seems that **there is a correlation between the number of guest and extra people that are inside the house**. Since this is a positive correlation, that wasn't quite expected, as usually having more guests in the house should lower the price of it
*   There also seem to be a **negative correlation with the number of reviews**, meaning that the less they are the higher is the price. That does make sense since people don't like to spend money so they will prefer a cheaper place over a more expensive one.
*   Finally, it seems that **the more listing a host has, the higher his price is.**

Regarding other intersting correlation:
*   The host response rate and the host response time are correlated. Makes sense
*   If there are guests included, the number of bedrooms and bed is higher. This is intersting because the latter refers only to the beds available to the client, not the total
*   All data regarding the reviews is strongly correlated between each other. It is indeed more likely that with a positive review in something, you are also likely to have a positive review for the rest
*   What's more, if the host is superhost he will receive a higher number of reviews with a more positive rating. This indeed means that once you are a superhost it is more likely you'll get more positive reviews. 
*   In contrast to that, a higher number of listing means that an host will receive a lower number of reviews with a more negative value. This may be because with a lot of listing it's harder to pay enough attention to all apartments.


## Feature Engineering

We want to express the latitude and longitude in a more meaningful way. Let's add a new column, `dist_from_center`, which represents the distance of every place from the center of milan

In [None]:
from geopy.distance import great_circle

def distance_to_mid(lat, lon):
    milan_centre = (45.464664, 9.188540)
    accommodation = (lat, lon)
    return great_circle(milan_centre, accommodation).km

apartments["dist_from_center"] = apartments.apply(lambda x: distance_to_mid(x.latitude, x.longitude), axis=1)

## Feature Selection

Before starting to predict the data, we want to remove the values that have very low variability and therefore they do not impact the prediction, only making it slow

In [None]:
from sklearn.feature_selection import VarianceThreshold

def remove_almost_constant_columns(threshold=0):
    qconstant_filter = VarianceThreshold(threshold=threshold)
    qconstant_filter.fit(apartments)
    constant_columns = [column for column in apartments.columns
                        if column not in apartments.columns[qconstant_filter.get_support()]]
    print(constant_columns)
    apartments.drop(labels=constant_columns, axis=1, inplace=True)

remove_almost_constant_columns(threshold=0.03) # we remove data that is 97% of the time the same

## Price prediction

### Note before starting
Before starting to predict the data, we have seen how the weekly price presents a lot of outliers. For a regression model to work correctly, we ideally would like to have a [gaussian curve instead of a skewed model](https://towardsdatascience.com/skewed-data-a-problem-to-your-statistical-model-9a6b5bb74e37). Therefore, we must first transform our weekly price in something more resembling that.

As suggested by [a notebook about the new york data](https://www.kaggle.com/duygut/airbnb-nyc-price-prediction), we are going to try a log.

First, let's take a look at the weekly price distribution now

In [None]:
import seaborn as sns
from scipy.stats import norm

def plot_price_distribution(prices):
  plt.figure(figsize=(10,10))
  sns.distplot(prices, fit=norm)
  plt.title("Price Distribution Plot", size=15, weight='bold')

plot_price_distribution(apartments.weekly_price)

As stated before, the distribution is highly skewed. Let's try to use a log

In [None]:
log_weekly_price = np.log2(apartments.weekly_price)
plot_price_distribution(log_weekly_price)

Much better, let's add it to the dataframe

In [None]:
apartments["log_weekly_price"] = log_weekly_price.values

Let's get started and try to make some predictions. Let's first divide the Dataset into X, y and divide between the train set and the test set.

In [None]:
from sklearn.model_selection import train_test_split

X = apartments.drop(columns=["log_weekly_price", "weekly_price"])
y = apartments.log_weekly_price

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

We will also define a 5-fold cross validation that we are going to use whenever possible

In [None]:
from sklearn.model_selection import KFold
kf = KFold(5, shuffle=True, random_state=42)

Let's now create a function that will help us in training and asserting the score of our models

In [None]:
from sklearn.model_selection import cross_validate

def train_and_validate(model, X, y, cv):
    cv_result = cross_validate(model, X, y, cv=kf, return_train_score=True)
    return pd.DataFrame(cv_result)

Finally, let's also define a function for the confidence interval of our predictions

In [None]:
from statsmodels.stats.proportion import proportion_confint

def confidence_interval(n_elements, R2_score, confidence):    
    return proportion_confint(n_elements * R2_score, n_elements, 1-confidence/100, method='wilson')

def print_confidence_interval(n_elements, R2_score):
    lower, upper = confidence_interval(n_elements, R2_score, 95)
    print(f"Interval of confidence: {lower:.3f}, {upper:.3f}")


### A simple LinearRegression

In [None]:
from sklearn.linear_model import LinearRegression

result = train_and_validate(LinearRegression(), X, y, kf)
print(result)

The results are promising, especially if we compare them with the one without using the log



In [None]:
print(train_and_validate(LinearRegression(), X, apartments.weekly_price, kf))

### Normalization of data

Let's now try to use a StandardScaler to see if we can perform better



In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

model = Pipeline([
    ("scale",  StandardScaler()),   # <- aggiunto
    ("linreg", LinearRegression())
])
print(train_and_validate(model, X, y, kf))


No difference at all

### Regolarization

Let's now try to swap the LinearRegression with an ElasticNet to see if the regularization L1 and L2 can improve our scores

In [None]:
from sklearn.linear_model import ElasticNet
from sklearn.model_selection import GridSearchCV

model = Pipeline([
    ("scale",  StandardScaler()),   # <- aggiunto
    ("regr", ElasticNet())
])

grid = {
    "regr__l1_ratio": np.linspace(0, 1, 10),      # <- grado polinomio
    "regr__alpha":  [0.1, 1, 10] # <- regolarizzazione
}
gs = GridSearchCV(model, grid, cv=kf)
gs.fit(X_train, y_train);

display(pd.DataFrame(gs.cv_results_).sort_values("mean_test_score", ascending=False))
print_confidence_interval(len(X_test), gs.score(X_test, y_test))


but we don't get a strong improvement, even for `alpha = 0.1` and `l1_ratio = 0`. What we should do in this is try to use a non-linear regression

### Non Linear Regression





Let's now try to use a non linear regression to see if we can improve our score.

In [None]:
from sklearn.kernel_ridge import KernelRidge
# best param alpha = 50, coef0=4, degree = 3
model = Pipeline([
    ("scale", StandardScaler()),
    ("regr",  KernelRidge(alpha=20, kernel="poly", degree=3, coef0=2))
])
grid = {
    "regr__alpha":  np.linspace(50, 200, 3), # <- regolarizzazione
    "regr__coef0": [4,5,6,7,3],
}
gs = GridSearchCV(model, grid, cv=kf)
gs.fit(X_train, y_train);
display(pd.DataFrame(gs.cv_results_).sort_values("mean_test_score", ascending=False))
print_confidence_interval(len(X_test), gs.score(X_test, y_test))

With the KernelRidge we obtain better result getting around 49%. However, if we want to further improve accurancy, we need a new model

## Why our regressions are performing badly?

The main problem that makes our regression perform badly seems to be in the outliers. With the logarithm, we greatly improved accuracy, but maybe that's not enough. Let's create a new box plot, this time showing the log of the price

In [None]:
red_square = dict(markerfacecolor='r', markeredgecolor='r', marker='.')
apartments.log_weekly_price.plot(kind='box', xlim=(6, 15), vert=False, flierprops=red_square, figsize=(20,2));

As we can see from this graph, a lot of the values are outliers, making our regression struggle. For this reason, let's try to excluse this outliers and see how our model performs

In [None]:
apartments.drop(apartments[(apartments.log_weekly_price > 13) | (apartments.log_weekly_price < 7) ].index, axis=0, inplace=True)
apartments.log_weekly_price.plot(kind='box', xlim=(7, 13), vert=False, flierprops=red_square, figsize=(20,2));

In [None]:
X = apartments.drop(columns=["log_weekly_price", "weekly_price"])
y = apartments.log_weekly_price

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

Then, we can try use models that performs relatively good even with outliers

### Random Forest Regressor

A notorious model that performs well with outliers is the Random Forest regressor

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV

model = RandomForestRegressor()

grid = { 
            "n_estimators"      : [10,20,30,50,100],
            "max_features"      : ["auto", "sqrt", "log2"],
            "min_samples_split" : [2,4,8],
            "bootstrap": [True, False],
            "max_depth": [1,20,100, None],
            "min_samples_leaf": [1, 5, 10]
}
gs = GridSearchCV(model, grid, cv=kf)
gs.fit(X_train, y_train);
display(pd.DataFrame(gs.cv_results_).sort_values("mean_test_score", ascending=False))
print_confidence_interval(len(X_test), gs.score(X_test, y_test))

As we can see, with a RandomForestRegressor we get to a 52% accuracy. Another really good model that performs well with outliers is XGBoost

### XGBoost

To further improve our accuracy, we decided to use XGBoost which is the state of art regression algorithm for most of the situations. Firstly, we want to search for the best hyperparameter to XGBoost

In [None]:
from sklearn.model_selection import GridSearchCV
import xgboost as xgb


booster = xgb.XGBRegressor()
# create Grid
param_grid = {'n_estimators': [100, 150, 200],
              'learning_rate': [0.01, 0.05, 0.1], 
              'max_depth': [3, 4, 5, 6, 7],
              'colsample_bytree': [0.6, 0.7, 1],
              'gamma': [0.0, 0.1, 0.2],
              'alpha': [0.0, 0.5, 1, 2]}

# instantiate the tuned random forest
booster_grid_search = GridSearchCV(booster, param_grid, cv=3, n_jobs=-1)

# train the tuned random forest
booster_grid_search.fit(X_train, y_train)

# print best estimator parameters found during the grid search
print(booster_grid_search.best_params_)

Then, we want to compare it to a standard LinearRegression

In [None]:
from sklearn.metrics import mean_squared_error, r2_score
import xgboost as xgb
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test  = sc.transform(X_test)

data_dmatrix = xgb.DMatrix(data=X,label=y)
booster = xgb.XGBRegressor(objective ='reg:squarederror', colsample_bytree = 0.6, learning_rate = 0.05,
                max_depth = 6, gamma = 0, alpha=1, n_estimators = 300)

booster.fit(X_train,y_train)

preds = booster.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, preds))
print("RMSE: %f" % (rmse))
print(f"R2 score: {r2_score(y_test, preds)}")
print_confidence_interval(len(X_test), r2_score(y_test, preds))

lin = LinearRegression()
lin.fit(X_train, y_train)
preds = lin.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, preds))
print("RMSE: %f" % (rmse))
print(f"R2 score: {r2_score(y_test, preds)}")
print_confidence_interval(len(X_test), r2_score(y_test, preds))



We can see how we get a score almost 14% better than a normal linear regression. Let's now use cross-fold validation to see how our xgboost algorithm performs

In [None]:
xg_train = xgb.DMatrix(data=X_train, label=y_train)
params = {'colsample_bytree': 0.6, 'gamma': 0, 'alpha': 1,  'learning_rate': 0.05, 'max_depth': 6}

cv_results = xgb.cv(dtrain=xg_train, params=params, nfold=4,
                    num_boost_round=400, early_stopping_rounds=10, 
                    metrics="rmse", as_pandas=True)

In [None]:
cv_results.tail()

In [None]:
train_and_validate(booster, X, y, kf)

We can see how it reaches 60% accurancy for the test group and almost 80% for the training set. The rmse it obtains is at about 0.49. Thus, this means that on any predition we must consider an error of +-0.49. 

For example, for an estimated log price of 9, which is 2**9 = 512\$ per week, the price range of confidence is 364\$ - 719\$.


### Performance with all outliers
Let's now see how xgboost performs with all the outliers 


In [None]:
# we put back all the outliers by re-executing code above
train_and_validate(booster, X, y, kf)

We lose on average 2.5% accuracy. 

### Performance only with non-airbnb data
Let's say, as a new airbnb host, we want to predict the price at which we should sell our house. Therefore, we can't use the current prediction for the given house. Let's then remove all this data and see how our algorithm performs

In [None]:
X_no_airbnb_data = X.drop(columns=[
                                   "number_of_reviews", 
                                   "review_scores_rating", 
                                   "review_scores_accuracy", 
                                   "review_scores_cleanliness", 
                                   "review_scores_checkin", 
                                   "review_scores_communication", 
                                   "review_scores_location", 
                                   "review_scores_value",
                                  ])
train_and_validate(booster, X_no_airbnb_data, y, kf)

In this case, on average we lose a 3% in accuracy. 




## Comparing our best model against a random 

Let's see how our XGBoost model performs against a random model generated from the log_price_distribution

In [None]:
y_train.plot.hist(bins=40, figsize=(12, 4));

In [None]:
np.random.seed(42)
random_preds = np.random.normal(
    y_train.mean(),   # centro (media)
    y_train.std(),    # scala (dev. standard)
    len(y)        # numero di campioni
)
plt.figure(figsize=(12, 4))
plt.hist(random_preds, bins=40);

In [None]:
scores = []
for i in range(1, 1000):
  np.random.seed(i)
  random_preds = np.random.normal(
      y_train.mean(),   # centro (media)
      y_train.std(),    # scala (dev. standard)
      len(y)        # numero di campioni
  )
  scores.append(r2_score(y, random_preds))

np.mean(scores)

So, a random model is definetely worst than our XGBoost algorithm 

## Conclusions


### Visualizing the most important features
Let's visualize the features that are the most important for our XGBoost model using [Shap](https://github.com/slundberg/shap)



In [None]:
import xgboost
import shap
# download shap with pip3 install https://github.com/slundberg/shap/archive/master.zip
# load JS visualization code to notebook
shap.initjs()

# explain the model's predictions using SHAP
# (same syntax works for LightGBM, CatBoost, scikit-learn and spark models)
explainer = shap.TreeExplainer(booster)
shap_values = explainer.shap_values(X)

# visualize the first prediction's explanation (use matplotlib=True to avoid Javascript)
shap.force_plot(explainer.expected_value, shap_values[0,:], X.iloc[0,:])

The above explanation is about the first prediction and shows features each contributing to push the model output from the base value, `9.299`, to the model output. Features pushing the prediction higher are shown in red, those pushing the prediction lower are in blue

Let's now take a look more in general about the most important features.

The plot below sorts features by the sum of SHAP value magnitudes over all samples, and uses SHAP values to show the distribution of the impacts each feature has on the model output. The color represents the feature value (red high, blue low). 

For example, an high value on `dist_from_center` reduce the prediction by 0.50 from the base value, whereas and high value on the number of extra people an apartment can accomade boost the value by more than 1.

In [None]:
shap.summary_plot(shap_values, X)



Taking a look at all the work done, we can make quite a few observations:


*   Our accuracy is not enough to provide a good estimation for a weekly price, but it's good enough to give a range at which you should rent your house 
*   To improve our accurancy, we need more data. When a person makes a judgment about an Airbnb apartment, he also takes into account the pictures the host has published, the text of the review, the description that the hosts gives and also the size. I believe that if we had all this extra data with some text mining we could get to an 80-90% accuracy.
*   The features that are the most important in estimating the price are the one which we expected. The only thing out of place is that only the `number_of_reviews` is inversly correlated with the price. While I expected the contrary (the more I have reviews on Airbnb the more I can set an higher price), probably what happens is that houses which have a price that is set too high are not booked, thus having a lower number of reviews. 
*   All the features such as Kitchen, Heating, Washer and Wi-Fi, which we expected to be important, turned out not to be that intersting because almost all of the hosts have them. 

