# Modelling using supervised-learning

The end goal is to predict the price per night of a airbnb location depending on several features. 

We will use the dataset cleaned and preprocessed for modelling (scaling and RFE feature engineering methods). 

Models to build and compare:
- Linear Regression
- KNN
- RandomForest

**Steps:**
- Build train/test sample data 
- Build models
- Get evaluation metrics for each models
- Compare them

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

%matplotlib inline
sns.set()

In [None]:
df = pd.read_csv('../data/airbnb_paris_clean_feat.csv')
print(df.shape)
df.head()

__________________
### Models set-up & train/test sampling

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score 
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_squared_log_error
from sklearn.metrics import explained_variance_score
from sklearn import preprocessing

In [None]:
def mape(y_true,y_pred):
    if y_true.any() == 0:
        return "dividing by 0 is impossible"
    else:
        return np.mean(np.abs((y_true-y_pred)/y_pred))*100

In [None]:
X = df.drop('price', axis=1)
y = df.price

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=8)

print(X_train.shape, y_train.shape)

_________________
### Linear Regression

In [None]:
# building the model
model_lin = LinearRegression()
model_lin.fit(X_train, y_train)
print('R^2 score for X_train:', model_lin.score(X_train, y_train),'\n')

# predicting values
y_pred_lin = model_lin.predict(X_test)

# checking evaluation metrics
print('R^2 score for X_test:',r2_score(y_test,y_pred_lin))
print('MSE:', mean_squared_error(y_test,y_pred_lin))
print('RMSE:', mean_squared_error(y_test,y_pred_lin, squared=False))
print('MAPE:',mape(y_test,y_pred_lin))
print('RMSLE:',(mean_squared_log_error(y_test,abs(y_pred_lin))**0.5),'\n')


In [None]:
y_pred_lin[y_pred_lin<0]

In [None]:
# Checking predictions
plt.scatter(y_test,y_pred_lin)
plt.title('Prediction Linearity')
plt.show()

# Checking residuals
resid=y_test-y_pred_lin

print(resid.mean())
plt.plot(resid)
plt.title('Residuals Variance')
plt.show()

sns.distplot(resid)
plt.title('Distribution of residuals')
plt.show()

___________________
### KNN

In [None]:
# building model
model_knn = KNeighborsRegressor(n_neighbors=3)
model_knn.fit(X_train, y_train)
print('R^2 score for X_train:', model_knn.score(X_train, y_train),'\n')

# predicting values
y_pred_knn = model_knn.predict(X_test)

# checking evaluation metrics
print('R^2 score for X_test:', model_knn.score(X_test, y_test))
print('MSE:', mean_squared_error(y_test,y_pred_knn))
print('RMSE:', mean_squared_error(y_test,y_pred_knn, squared=False))
print('MAPE:',mape(y_test,y_pred_knn))
print('RMSLE:',(mean_squared_log_error(y_test, y_pred_knn)**0.5),'\n')

In [None]:
# building model
model_knn = KNeighborsRegressor(n_neighbors=5)
model_knn.fit(X_train, y_train)
print('R^2 score for X_train:', model_knn.score(X_train, y_train),'\n')

# predicting values
y_pred_knn = model_knn.predict(X_test)

# checking evaluation metrics
print('R^2 score for X_test:', model_knn.score(X_test, y_test))
print('MSE:', mean_squared_error(y_test,y_pred_knn))
print('RMSE:', mean_squared_error(y_test,y_pred_knn, squared=False))
print('MAPE:',mape(y_test,y_pred_knn))
print('RMSLE:',(mean_squared_log_error(y_test, y_pred_knn)**0.5),'\n')

In [None]:
# Checking predictions
plt.scatter(y_test,y_pred_knn)
plt.title('Prediction Linearity')
plt.show()

# Checking residuals
resid=y_test-y_pred_knn

print(resid.mean())
plt.plot(resid)
plt.title('Residuals Variance')
plt.show()

sns.distplot(resid)
plt.title('Distribution of residuals')
plt.show()

____________________
### RandomForest

In [None]:
# building the model 
model_rf = RandomForestRegressor()
model_rf.fit(X_train, y_train)
print("R^2 score for X_train:", model_rf.score(X_train, y_train),'\n')

# predicting values 
y_pred_rf = model_rf.predict(X_test)

# checking evaluation metrics
print('R^2 score for X_test:', model_rf.score(X_test, y_test))
print('MSE:', mean_squared_error(y_test,y_pred_rf))
print('RMSE:', mean_squared_error(y_test,y_pred_rf, squared=False))
print('MAPE:',mape(y_test,y_pred_rf))
print('RMSLE:',(mean_squared_log_error(y_test, y_pred_rf)**0.5),'\n')

In [None]:
plt.scatter(y_test,y_pred_rf)
plt.title('Prediction Linearity')
plt.show()

# Checking residuals

resid=y_test-y_pred_rf

print(resid.mean())
plt.plot(resid)
plt.title('Residuals Variance')
plt.show()

sns.distplot(resid)
plt.title('Distribution of residuals')
plt.show()

### Comments

- At first sight the best model seems to be the Linear Regression because it has an higher R^2, it is not overfitted, the errors are the smallest (MAPE and MSE) and linearity of model seems better
- In the meantime, the Linear Regression predicted 3 negative values which seems wrong
- The models seems to be overfitted on the trained sample for KNN and Random Forest models 
- The RMSE score didn't increase much so we may consider that there are no outliers that affect the data

**Possible improvements:**
- Change parameters to see how it affects the models 
- Try other preprocessing methods (Sequential Selection, PCA and other scaling) to improve the resutls
- Try Embedded feature engineering methods (Lasso)

**Next Steps:**
- Fix the negative value of Linear Prediction
- Fix the overfitting of models
- Check assumptions