# Modelling using supervised-learning

The end goal is to predict the price per night of a airbnb location depending on several features. 

We will use the dataset cleaned and preprocessed for modelling (scaling and RFE feature engineering methods). 

Models to build and compare:
- Linear Regression
- KNN
- RandomForest

**Steps:**
- Build train/test sample data 
- Build models
- Get evaluation metrics for each models
- Compare them

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

%matplotlib inline
sns.set()

In [None]:
df = pd.read_csv('../data/airbnb_paris_clean_RFE.csv')
print(df.shape)
df.head()

__________________
### Models set-up & train/test sampling

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score 
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_squared_log_error
from sklearn.metrics import explained_variance_score
from sklearn import preprocessing

In [None]:
def mape(y_true,y_pred):
    if y_true.any() == 0:
        return "dividing by 0 is impossible"
    else:
        return np.mean(np.abs((y_true-y_pred)/y_pred))*100

In [None]:
X = df.drop('price', axis=1)
y = df.price

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=8)

# Creating several samples to test overfitting of models
X_train_2, X_test_2, y_train_2, y_test_2 = train_test_split(X, y, test_size=0.3, random_state=15)
X_train_3, X_test_3, y_train_3, y_test_3 = train_test_split(X, y, test_size=0.3, random_state=42)

print(X_train.shape, y_train.shape)

_________________
### Linear Regression

In [None]:
# building the model
model_lin = LinearRegression()
model_lin.fit(X_train, y_train)
print('R^2 score for X_train:', model_lin.score(X_train, y_train),'\n')

# predicting values
y_pred_lin = model_lin.predict(X_test)

# checking evaluation metrics
print('R^2 score for X_test:',r2_score(y_test,y_pred_lin))
print('MSE:', mean_squared_error(y_test,y_pred_lin))
print('RMSE:', mean_squared_error(y_test,y_pred_lin, squared=False))
print('MAPE:',mape(y_test,y_pred_lin))

# Can't use RMSLE because of negavite predicted values
# print('RMSLE:',(mean_squared_log_error(y_test,abs(y_pred_lin))**0.5),'\n')


In [None]:
unscaled_y = y_test*100
unscaled_y_pred = y_pred_lin*100

# checking evaluation metrics
print('R^2 score for X_test:',r2_score(unscaled_y,unscaled_y_pred))
print('MSE:', mean_squared_error(unscaled_y,unscaled_y_pred))
print('RMSE:', mean_squared_error(unscaled_y,unscaled_y_pred, squared=False))
print('MAPE:',mape(unscaled_y,unscaled_y_pred))

In [None]:
y_pred_lin[y_pred_lin<0]

In [None]:
# Checking predictions
plt.scatter(y_test,y_pred_lin)
plt.title('Prediction Linearity')
plt.show()

# Checking residuals
resid=y_test-y_pred_lin

sns.distplot(resid)
plt.title('Distribution of residuals')
plt.show()

___________________
### KNN

In [None]:
# building model
model_knn = KNeighborsRegressor(n_neighbors=3)
model_knn.fit(X_train, y_train)
print("Sample 1: X_train, X_test")
print('R^2 score for X_train:', model_knn.score(X_train, y_train),'\n')

# predicting values
y_pred_knn = model_knn.predict(X_test)

# checking evaluation metrics
print('R^2 score for X_test:', model_knn.score(X_test, y_test))
print('MSE:', mean_squared_error(y_test,y_pred_knn))
print('RMSE:', mean_squared_error(y_test,y_pred_knn, squared=False))
print('MAPE:',mape(y_test,y_pred_knn))
print('RMSLE:',(mean_squared_log_error(y_test, y_pred_knn)**0.5),'\n')


## Testing with other samples

# building model
model_knn_2 = KNeighborsRegressor(n_neighbors=3)
model_knn_2.fit(X_train_2, y_train_2)
print("Sample 2: X_train_2, X_test_2")
print('R^2 score for X_train:', model_knn_2.score(X_train_2, y_train_2),'\n')

# predicting values
y_pred_knn_2 = model_knn_2.predict(X_test_2)

# checking evaluation metrics
print('R^2 score for X_test:', model_knn_2.score(X_test_2, y_test_2))
print('MSE:', mean_squared_error(y_test_2,y_pred_knn_2))
print('RMSE:', mean_squared_error(y_test_2,y_pred_knn_2, squared=False))
print('MAPE:',mape(y_test_2,y_pred_knn_2))
print('RMSLE:',(mean_squared_log_error(y_test_2, y_pred_knn_2)**0.5),'\n')


## Testing with other samples

# building model
model_knn_3 = KNeighborsRegressor(n_neighbors=3)
model_knn_3.fit(X_train_3, y_train_3)
print("Sample 3: X_train_3, X_test_3")
print('R^2 score for X_train:', model_knn_3.score(X_train_3, y_train_3),'\n')

# predicting values
y_pred_knn_3 = model_knn_3.predict(X_test_3)

# checking evaluation metrics
print('R^2 score for X_test:', model_knn_3.score(X_test_3, y_test_3))
print('MSE:', mean_squared_error(y_test_3,y_pred_knn_3))
print('RMSE:', mean_squared_error(y_test_3,y_pred_knn_3, squared=False))
print('MAPE:',mape(y_test_3,y_pred_knn_3))
print('RMSLE:',(mean_squared_log_error(y_test_3, y_pred_knn_3)**0.5),'\n')

In [None]:
# Creating a for loop to find the best n_neighbors [WARNING BEFORE RUNNING - IT TAKES TIME]

r_score_train = [] # Storing results in list
r_score_test = []
rmse_test = []

for i in range(3,21,2):
    # building model
    model_knn = KNeighborsRegressor(n_neighbors=i)
    model_knn.fit(X_train, y_train)
    print('Number of neighbors:',i)
    print('R^2 score for X_train:', model_knn.score(X_train, y_train),'\n')
    r_score_train.append(model_knn.score(X_train, y_train))
    
    # predicting values
    y_pred_knn = model_knn.predict(X_test)

    # checking evaluation metrics
    r_score_test.append(model_knn.score(X_test, y_test))
    print('R^2 score for X_test:', model_knn.score(X_test, y_test))
    print('MSE:', mean_squared_error(y_test,y_pred_knn))
    rmse_test.append(mean_squared_error(y_test,y_pred_knn, squared=False))
    print('RMSE:', mean_squared_error(y_test,y_pred_knn, squared=False))
    print('MAPE:',mape(y_test,y_pred_knn))
    print('RMSLE:',(mean_squared_log_error(y_test, y_pred_knn)**0.5),'\n')

In [None]:
# Drawing graph to show the result model performance depending on the number neighbors

nb_neighbors = list(range(3,21,2))

fig, ax1 = plt.subplots(figsize=(14,5))

color = 'tab:red'
ax1.plot(nb_neighbors, r_score_test, linestyle='-', marker='o', color=color)
ax1.set_xlabel('Number of neighbors')
ax1.set_ylabel('R^2 Score', color=color)
ax1.tick_params(axis='y', labelcolor=color)

#y0, y1 = ax1.get_ylim()
#ax1.vlines(x=23,ymin=y0,ymax=y1, linestyle='dashed', label='Best Shape = [20,23]')
#ax1.vlines(x=20,ymin=y0,ymax=y1, linestyle='dashed')
#ax1.legend()

ax2 = ax1.twinx()  # instantiate a second axes that shares the same x-axis

color = 'tab:blue'
ax2.set_ylabel('RMSE', color=color)
ax2.plot(nb_neighbors, rmse_test, linestyle='-', marker='o', color=color)
ax2.tick_params(axis='y', labelcolor=color)

fig.suptitle('KNN Model performance depending on neighbors', fontsize=14)
plt.savefig('../img/KNN_model_performance.png')
plt.show()

In [None]:
# Creating another visualization to show the overfitting effect 

plt.plot(nb_neighbors, r_score_train, color='tab:red', label='r_score X_train')
plt.plot(nb_neighbors, r_score_test, color='tab:blue', label='r_score X_test')
plt.xlabel('Number of neighbors')
plt.ylabel('R^2 score')
plt.title('Comparision of R^2 score for Train and Test samples', fontsize=14)
plt.legend()
plt.savefig('../img/comparison_rscore_KNN.png')
plt.show()

In [None]:
# Rebuilding the best model

model_knn = KNeighborsRegressor(n_neighbors=19)
model_knn.fit(X_train, y_train)
print('R^2 score for X_train:', model_knn.score(X_train, y_train),'\n')

# predicting values
y_pred_knn = model_knn.predict(X_test)

# checking evaluation metrics
print('R^2 score for X_test:', model_knn.score(X_test, y_test))
print('MSE:', mean_squared_error(y_test,y_pred_knn))
print('RMSE:', mean_squared_error(y_test,y_pred_knn, squared=False))
print('MAPE:',mape(y_test,y_pred_knn))
print('RMSLE:',(mean_squared_log_error(y_test, y_pred_knn)**0.5),'\n')


# Testing with weights on distance
    
model_knn_w = KNeighborsRegressor(n_neighbors=19, weights='distance')
model_knn_w.fit(X_train, y_train)
print('R^2 score for X_train with weighted distance:', model_knn_w.score(X_train, y_train),'\n')

# predicting values
y_pred_knn_w = model_knn_w.predict(X_test)

# checking evaluation metrics
print('R^2 score for X_test with weighted distance:', model_knn_w.score(X_test, y_test))
print('MSE:', mean_squared_error(y_test,y_pred_knn_w))
print('RMSE:', mean_squared_error(y_test,y_pred_knn_w, squared=False))
print('MAPE:',mape(y_test,y_pred_knn_w))
print('RMSLE:',(mean_squared_log_error(y_test, y_pred_knn_w)**0.5),'\n')

In [None]:
y_pred_knn[y_pred_knn<0]

In [None]:
# Checking predictions
plt.scatter(y_test,y_pred_knn)
plt.title('Prediction Linearity')
plt.show()

# Checking residuals
resid=y_test-y_pred_knn

sns.distplot(resid)
plt.title('Distribution of residuals')
plt.show()

____________________
### RandomForest

In [None]:
# Testing Random Forest 

    # building the model with default estimator
    
model_rf = RandomForestRegressor()
model_rf.fit(X_train, y_train)
print("R^2 score for X_train:", model_rf.score(X_train, y_train),'\n')

# predicting values 
y_pred_rf = model_rf.predict(X_test)

# checking evaluation metrics
print('R^2 score for X_test:', model_rf.score(X_test, y_test))
print('MSE:', mean_squared_error(y_test,y_pred_rf))
print('RMSE:', mean_squared_error(y_test,y_pred_rf, squared=False))
print('MAPE:',mape(y_test,y_pred_rf))
print('RMSLE:',(mean_squared_log_error(y_test, y_pred_rf)**0.5),'\n')


    # building the model with estimator = 1000
    
model_rf_2 = RandomForestRegressor(n_estimators=1000)
model_rf_2.fit(X_train, y_train)
print("R^2 score for X_train:", model_rf_2.score(X_train, y_train),'\n')

# predicting values 
y_pred_rf_2 = model_rf_2.predict(X_test)

# checking evaluation metrics
print('R^2 score for X_test:', model_rf_2.score(X_test, y_test))
print('MSE:', mean_squared_error(y_test,y_pred_rf_2))
print('RMSE:', mean_squared_error(y_test,y_pred_rf_2, squared=False))
print('MAPE:',mape(y_test,y_pred_rf_2))
print('RMSLE:',(mean_squared_log_error(y_test, y_pred_rf_2)**0.5),'\n')

In [None]:
plt.scatter(y_test,y_pred_rf)
plt.title('Prediction Linearity')
plt.show()

# Checking residuals

resid=y_test-y_pred_rf

sns.distplot(resid)
plt.title('Distribution of residuals')
plt.show()

In [None]:
df_original = pd.read_csv('../data/airbnb_paris_clean_wo_dummies_feat.csv')
print(df_original.shape)
df_original.head()

In [None]:
# Building samples
X_orig = df_original.drop('price',axis=1)
y_orig = df_original.price

X_train_orig, X_test_orig, y_train_orig, y_test_orig = train_test_split(X_orig, y_orig, test_size=0.3, random_state=8)

# building the model 
model_rf = RandomForestRegressor()
model_rf.fit(X_train_orig, y_train_orig)
print("R^2 score for X_train:", model_rf.score(X_train_orig, y_train_orig),'\n')

# predicting values 
y_pred_rf = model_rf.predict(X_test_orig)

# checking evaluation metrics
print('R^2 score for X_test:', model_rf.score(X_test_orig, y_test_orig))
print('MSE:', mean_squared_error(y_test_orig,y_pred_rf))
print('RMSE:', mean_squared_error(y_test_orig,y_pred_rf, squared=False))
print('MAPE:',mape(y_test_orig,y_pred_rf))
print('RMSLE:',(mean_squared_log_error(y_test_orig, y_pred_rf)**0.5),'\n')



### Comments

- At first sight Linear Regression and KNeighbors are the best model because they have almost the same result, although Linear Regression it is not overfitted (from the beginning), but all assumptions aren't met.
- In the meantime, the Linear Regression predicted negative values (is it wrong?)
- The models seems to be overfitted on the trained sample for KNN and Random Forest models even though I managed to reduce the overfitting in KNN by increasing the number of neighbors
- The RMSE score doble from MSE that it is due to outliers still in the dataset

**Possible improvements:**
- Try Random Forest with dataset without dummies
- Try other preprocessing methods (Sequential Selection, PCA and other scaling) to improve the resutls
- Try Embedded feature engineering methods (Lasso)

**Next Steps:**
- ~~Check why there are negative values in prediction values~~
- ~~Change scaling method for Price~~
- ~~Confirm the overfitting of models~~
- Fix Random Forest without dummies
- ~~Try Random Forest with an estimator = 1000~~
- ~~Build a for loop to find the best n_neighbors for KNN (between 3 and 21)~~
- Check assumptions of each models