## Introduction

The name of housing unit types are specific of the Italian market, therefore I create a short dictionary with its translation in English:
- Monolocale: studio flat(one-room apartment)
- Bilocale: two-rooms apartment       
- Trilocale: three-rooms apartment       
- Quadrilocale: four-rooms apartment
- Appartamento: flat/apartment (that is from four to more rooms) 
- Attico: attic           
- Villa: house/villa
- Palazzo: building/palace
- Mansarda: mansard        
- Loft: loft apartment       
- Terratetto: this is a typical italian building of the early year of the 20th century. It means "from the ground to the roof"      
- Open space: open space apartment  
- Casale: farmhouse

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

import math
import time

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import  mean_absolute_error, r2_score, mean_absolute_percentage_error, mean_squared_error

from xgboost import XGBRegressor 

from lightgbm import LGBMRegressor

import warnings
warnings.filterwarnings('ignore')

In [None]:
df_italia = pd.read_csv('D002_CLEANED_DATASET_RENT_ITALY.csv')

df_italia

## Remove outliers and suspicious values

Outliers are removed using the IQR and and some suspicious values are deleted. 
For suspicious values I mean those houses/flats with:
- Price rent < 150, because they are probably daily or weekly rents in B&B, while I am focusing on monthly rents 
- Surface < 20 m2, because they are more likely to be scams since they are too small to be said livable

In [None]:
plt.boxplot(df_italia['Price'])
plt.title('Price');

In [None]:
plt.hist(df_italia['Price'], bins = 90)
plt.title('Price distribution before outliers removal');

In [None]:
plt.boxplot(df_italia['Surface'])
plt.title('Surface');

In [None]:
plt.hist(df_italia['Surface'], bins = 30)
plt.title('Surface distribution before outliers removal');

In [None]:
len(df_italia[df_italia['Price']<250]), len(df_italia[df_italia['Surface']<20])

In [None]:
#Removing suspicious values, that could be scam announcements or daily/weekly rents in B&B
df_italia =df_italia[(df_italia['Surface']>=20) & (df_italia['Price']>=250)]

In [None]:
df_italia.shape

In [None]:
# def remove_outliers_iqr(df,columns):
#     for col in columns:
#         q1 = df[col].quantile(0.25)
#         q3 = df[col].quantile(0.75)
#         iqr = q3 - q1
#         lower_bound = q1 - 1.5 * iqr
#         upper_bound = q3 + 1.5 * iqr
#         df = df[(df[col] >= lower_bound) & (df[col] <= upper_bound)]
#     return df


# columns_to_check = ['Price','Surface']
# df_cleaned = remove_outliers_iqr(df_italia, columns_to_check)

# df_cleaned.head()

In [None]:
df_cleaned = df_italia.copy()
df_cleaned.shape

In [None]:
plt.hist(df_cleaned['Price'], bins = 30)
plt.title('Price distribution after outliers removal');

In [None]:
plt.hist(df_cleaned['Surface'], bins = 30)
plt.title('Surface distribution after outliers removal');

# Model Building

## Data preparation

The machine learning algoritms used for model building are Decision Tree, Random Forest, XGBoost and LGBM; the last two are gradient boosting backed algorithms. All of them have a regressor option that was used to predict the rent price.

in particulare, LGBM has a specific procedure for data preparation that require to transform numerical variables into float variables

In [None]:
df_cleaned.info()


In [None]:
df_cleaned.select_dtypes(include='object').columns

In [None]:
df_cleaned.select_dtypes(include=['float64','int64']).columns

In [None]:
categorical_cols = ['City', 
                    'Housing_unit', 
                    'city_size', 
                    'macroregion', 
                    'rent_bracket',
                    'surface_bracket']

numeric_cols = ['Surface', 
                'floor', 
                'num_rooms']

features = categorical_cols + numeric_cols

target = ['Price']

In [None]:
df_cleaned[categorical_cols] = df_cleaned[categorical_cols].astype('category')

In [None]:
X = df_cleaned.drop(columns = 'Price', axis = 1)
Y = df_cleaned['Price']

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, train_size=0.9, random_state = 42)
X_train.shape, X_test.shape, X.shape

In [None]:
X_train, X_val, Y_train, Y_val = train_test_split(X_train, Y_train, train_size=0.9, random_state = 42)
X_train.shape, X_val.shape, X.shape

## Decision Tree

In [None]:
dt = DecisionTreeRegressor()

start_time = time.time()


param_grid ={
    'max_depth':[2,4,6,8],
    'min_samples_split': [2,4,6,8],
    'min_samples_leaf': [1,2,3,4],
    'max_features': ['auto','sqrt','log2'],
    'random_state': [0,42]
}

grid_search = GridSearchCV(dt, param_grid, cv=5, scoring = 'neg_mean_squared_error') 

grid_search.fit(X_train , Y_train )

print(grid_search.best_params_)

end_time = time.time()
duration = end_time - start_time
print(duration/60)

In [None]:
dt = DecisionTreeRegressor(max_depth=8, max_features='auto', min_samples_leaf=4, min_samples_split=2, random_state=0)
dt.fit(X_train , Y_train )

y_train_pred = dt.predict(X_train)

mae = round(mean_absolute_error(Y_train,y_train_pred), 2) 
mse = round(mean_squared_error(Y_train,y_train_pred), 2)
mape = round(mean_absolute_percentage_error(Y_train,y_train_pred)*100,2)
r2 =round(r2_score(Y_train,y_train_pred)*100,2)
rmse = round( math.sqrt(mse), 2)


print(f"MAE is: {mae}")
print(f"MSE is: {mse}")
print(f"MAPE is: {mape}")
print(f"R2 SCORE is: {r2}")
print(f"RMSE is: {rmse}")

df_metrics = pd.DataFrame()
df_metrics = df_metrics.append({'Algorithm':'Decision Tree Regressor TRAIN', 'MAE €':mae, 'MSE €': mse,  'RMSE €': rmse, 'MAPE %': mape, 'R2 Score %': r2}, ignore_index=True)

In [None]:
y_pred = dt.predict(X_val)

mae = round(mean_absolute_error(Y_val,y_pred),2)
mse = round(mean_squared_error(Y_val,y_pred),2)
mape = round(mean_absolute_percentage_error(Y_val,y_pred)*100,2)
r2 = round(r2_score(Y_val,y_pred),2)*100
rmse = round(math.sqrt(mse),2)


print(f"MAE is: {mae}")
print(f"MSE is: {mse}")
print(f"MAPE is: {mape}")
print(f"R2 SCORE is: {r2}")
print(f"RMSE is: {rmse}")


df_metrics = df_metrics.append({'Algorithm':'Decision Tree Regressor TEST', 'MAE €':mae, 'MSE €': mse,  'RMSE €': rmse, 'MAPE %': mape, 'R2 Score %': r2}, ignore_index=True)

## Random Forest

In [None]:
rf = RandomForestRegressor()


start_time = time.time()

RANDOM_STATE = 42

param_grid ={
     'n_estimators':[ 50, 100, 150],
    'min_samples_split': [2, 4, 6, 8],
    'min_samples_leaf': [2, 4, 6, 8],
    'max_features': [1.0, 2.0, 3.0],
    'max_depth': [5, 10, 20],
}

grid_search = GridSearchCV(rf, param_grid, cv=5, scoring = 'neg_mean_squared_error') 

grid_search.fit(X_train , Y_train )

print(grid_search.best_params_)

end_time = time.time()
duration = end_time - start_time
print(duration/60)

In [None]:
RANDOM_STATE = 42

rf = RandomForestRegressor(n_estimators= 150, min_samples_split= 2 , min_samples_leaf= 4, max_features= 1.0, max_depth = 10, random_state=RANDOM_STATE)
rf.fit(X_train , Y_train )

y_train_pred = rf.predict(X_train)

mae = round(mean_absolute_error(Y_train,y_train_pred), 2) 
mse = round(mean_squared_error(Y_train,y_train_pred) , 2) 
mape = round(mean_absolute_percentage_error(Y_train,y_train_pred)*100,2)
r2 = round(r2_score(Y_train,y_train_pred)*100,2)
rmse = round(math.sqrt(mse) , 2) 


print(f"MAE is: {mae}")
print(f"MSE is: {mse}")
print(f"MAPE is: {mape}")
print(f"R2 SCORE is: {r2}")
print(f"RMSE is: {rmse}")

df_metrics = df_metrics.append({'Algorithm':'Random Forest Regressor TRAIN', 'MAE €':mae, 'MSE €': mse,  'RMSE €': rmse, 'MAPE %': mape, 'R2 Score %': r2 }, ignore_index=True)

In [None]:
y_pred = rf.predict(X_val)

mae = round( mean_absolute_error(Y_val,y_pred), 2) 
mse = round(mean_squared_error(Y_val,y_pred), 2) 
mape = round(mean_absolute_percentage_error(Y_val,y_pred)*100,2)
r2 = round(r2_score(Y_val,y_pred), 2)*100
rmse = round(math.sqrt(mse), 2) 


print(f"MAE is: {mae}")
print(f"MSE is: {mse}")
print(f"MAPE is: {mape}")
print(f"R2 SCORE is: {r2}")
print(f"RMSE is: {rmse}")


df_metrics = df_metrics.append({'Algorithm':'Random Forest Regressor TEST', 'MAE €':mae, 'MSE €': mse,  'RMSE €': rmse, 'MAPE %': mape, 'R2 Score %': r2 }, ignore_index=True)

## XGBoost

In [None]:
xgb = XGBRegressor()

start_time = time.time()


param_grid ={
    'n_estimators': [50, 100, 150],
    'learning_rate': [0.01, 0.1, 1]
}

grid_search = GridSearchCV(xgb, param_grid, cv=5, scoring = 'neg_mean_squared_error') 

grid_search.fit(X_train , Y_train )

print(grid_search.best_params_)

end_time = time.time()
duration = end_time - start_time
print(duration/60)

In [None]:
xgb = XGBRegressor(n_estimators= 150, eta=0.1)
xgb.fit(X_train , Y_train )

y_train_pred = xgb.predict(X_train)

mae = round( mean_absolute_error(Y_train,y_train_pred), 2)
mse = round(mean_squared_error(Y_train,y_train_pred), 2)
mape = round(mean_absolute_percentage_error(Y_train,y_train_pred)*100,2)
r2 = round(r2_score(Y_train,y_train_pred)*100,2)
rmse = round(math.sqrt(mse), 2)


print(f"MAE is: {mae}")
print(f"MSE is: {mse}")
print(f"MAPE is: {mape}")
print(f"R2 SCORE is: {r2}")
print(f"RMSE is: {rmse}")

df_metrics = df_metrics.append({'Algorithm':'XGB Regressor TRAIN', 'MAE €':mae, 'MSE €': mse,  'RMSE €': rmse, 'MAPE %': mape, 'R2 Score %': r2 }, ignore_index=True)

In [None]:
y_pred = xgb.predict(X_test)

mae = round(mean_absolute_error(Y_val,y_pred), 2)
mse = round(mean_squared_error(Y_val,y_pred), 2)
mape = round(mean_absolute_percentage_error(Y_val,y_pred)*100,2)
r2 = round(r2_score(Y_val,y_pred) , 2)*100
rmse = round(math.sqrt(mse) , 2)


print(f"MAE is: {mae}")
print(f"MSE is: {mse}")
print(f"MAPE is: {mape}")
print(f"R2 SCORE is: {r2}")
print(f"RMSE is: {rmse}")

df_metrics = df_metrics.append({'Algorithm':'XGB Regressor TEST', 'MAE €':mae, 'MSE €': mse,  'RMSE €': rmse, 'MAPE %': mape, 'R2 Score %': r2 }, ignore_index=True)

## LGBM

In [None]:
lgbm = LGBMRegressor()

param_grid ={
    'n_estimators': [50, 100, 150],
    'learning_rate': [0.01, 0.1, 1]
}


grid_search = GridSearchCV(lgbm, param_grid, cv=5, scoring = 'neg_mean_squared_error') 

grid_search.fit(X_train , Y_train )

print(grid_search.best_params_)

In [None]:
X_train.select_dtypes(include='object').columns

In [None]:
# With the LGBM algorithm numerical variables need to be trasnformed into float variables
# X_train = X_train.astype(float)
# X_val = X_val.astype(float)
# Y_train = Y_train.astype(float)
# # Y_val = Y_val.astype(float)


lgbm = LGBMRegressor(n_estimators= 150 ,learning_rate=0.1)
lgbm.fit(X_train , Y_train )

y_train_pred = lgbm.predict(X_train)

mae = round( mean_absolute_error(Y_train,y_train_pred), 2)
mse = round(mean_squared_error(Y_train,y_train_pred), 2)
mape = round(mean_absolute_percentage_error(Y_train,y_train_pred)*100,2)
r2 = round(r2_score(Y_train,y_train_pred)*100,2)
rmse = round(math.sqrt(mse), 2)


print(f"MAE is: {mae}")
print(f"MSE is: {mse}")
print(f"MAPE is: {mape}")
print(f"R2 SCORE is: {r2}")
print(f"RMSE is: {rmse}")

df_metrics = df_metrics.append({'Algorithm':'LGBM Regressor TRAIN', 'MAE €':mae, 'MSE €': mse,  'RMSE €': rmse, 'MAPE %': mape, 'R2 Score %': r2 }, ignore_index=True)

In [None]:
y_pred = lgbm.predict(X_test)

mae = round(mean_absolute_error(Y_val,y_pred), 2)
mse = round(mean_squared_error(Y_val,y_pred), 2)
mape = round(mean_absolute_percentage_error(Y_val,y_pred)*100,2)
r2 = round(r2_score(Y_val,y_pred), 2)*100
rmse =round( math.sqrt(mse), 2)


print(f"MAE is: {mae}")
print(f"MSE is: {mse}")
print(f"MAPE is: {mape}")
print(f"R2 SCORE is: {r2}")
print(f"RMSE is: {rmse}")

df_metrics = df_metrics.append({'Algorithm':'LGBM Regressor TEST', 'MAE €':mae, 'MSE €': mse,  'RMSE €': rmse, 'MAPE %': mape, 'R2 Score %': r2 }, ignore_index=True)

In [None]:
df_metrics

##### Observation:

We can see that the model built with the XGB Regressor is the best performing one among the models used, because it obtained the minimum score of MAE (277.89 €), MSE (153864.50 €), RMSE (392.26 €), and the highest R2 score (62%). 

Nevertheless, we can see that the absolute errors are high due to the high variance among Price and Surface variables. Other reasons can be the fact that there are 20 different rent markets from the 20 different cities selected and that the several housing unit types with distinctive characteristcs are considered. All of these may increase the Price variance.