# House Prices Kaggle Competition


## Authors 
David Moreno Maldonado 100441714    
Inés Fernández Campos 100443936    

## Assignment

For the completion of this exercise we start by importing all the libraries we are going to need as well as defining all parameters for the assignment.

In [2]:
import pandas as pd
import numpy as np
import sys
import time
import math
import statistics as st
from sklearn import preprocessing, model_selection, tree, neighbors, metrics
from scipy.stats import uniform, randint as sp_randint
from skopt import BayesSearchCV
from skopt.space import Integer, Real, Categorical
import optuna

In [3]:
#MAIN PARAMETERS FOR THE ASSIGNMENT
budget = 100
random_state = 0
verbose = 0

#PARAMETERS FOR THE HYPER-PARAMETER TUNNING
min_max_depth = 2
max_max_depth = 20#16
min_n_neigbors = 1
max_n_neigbors = 16#16

We also create a dataframe that will contain all information regarding the studied models for each different configuration applied.

In [4]:
#Dataframes with all the information of each model
summary = {
    'tree': pd.DataFrame(columns=['Time (sec)', 'Score (RMSE)', 'Min. samples split', 'Criterion', 'Max. depth']),
    'knn': pd.DataFrame(columns=['Time (sec)', 'Score (RMSE)', 'N. neighbors', 'Weights', 'P'])
}


### Loading data

In the next cells we load the data, split it in four matrices, two used for training and two used for the competition, we standardize the input attributes and split our training matrices into train test splits, as well as define the cross validation grid used for 2-fold cross validation throughout the exercise.

In [5]:
#Loading data
data = pd.read_csv("kaggleCompetition.csv")
data = data.values

#Splitting data in the one used for training and the one used for the competition
x = data[0:1460, :-1]
y = data[0:1460, -1] 
x_comp = data[1460:,:-1] 
y_comp = data[1460:,-1]

In [6]:
#Standardize input attributes.
scaler = preprocessing.StandardScaler().fit(x) 
x = scaler.transform(x)
x_comp = scaler.transform(x_comp)

In [7]:
#Split in train/test sets using holdout 3/4 for training, 1/4 for testing
x_train, x_test, y_train, y_test = model_selection.train_test_split(x, y, train_size=0.75, random_state=0)

#Hyperparams evaluated by 2-fold CV (inner evaluation)
cv_grid = model_selection.KFold(n_splits=2, shuffle=True, random_state=random_state)

### Training models using **default parameters**

In this section we evaluate the performance of the regression using decission trees and KNN when using default parameters.

We start by training the decision tree with all its default parameters: min_samples_split = 2, max_depth = None and criterion='mse'.    
Once trained, we perform its inner evaluation applying 2-fold CV on the train splitted data and save the acquired data to our summary dataframe.

In [24]:
#3.1.1 Decision Tree
np.random.seed(random_state)
tree_default = tree.DecisionTreeRegressor(random_state=random_state)

scores = -model_selection.cross_val_score(tree_default, x_train, y_train, scoring='neg_root_mean_squared_error', cv=cv_grid)

summary['tree'] = summary['tree'].append(pd.Series({
    'Time (sec)': 0, 
    'Score (RMSE)': scores.mean(),
    'Min. samples split': 2, 
    'Criterion': 'mse', 
    'Max. depth': 'None'
    },
    name='default'))

We also implement regression through KNN with all its default parameters: n_neighbors=5, weights='uniform', p=2, metric='minkowski'.   
As we did with the decision tree, once trained, we perform its inner evaluation applying 2-fold CV on the train splitted data and save the acquired data to our summary dataframe.

In [9]:
#3.1.2 K Nearest neighbours
np.random.seed(random_state)
knn_default = neighbors.KNeighborsRegressor()
scores = -model_selection.cross_val_score(knn_default, x_train, y_train, scoring='neg_root_mean_squared_error', cv=cv_grid) 

summary['knn'] = summary['knn'].append(pd.Series({
    'Time (sec)': 0, 
    'Score (RMSE)': scores.mean(), 
    'N. neighbors': 5, 
    'Weights': 'uniform', 
    'P': 2
    }, 
    name='default'))

### Training models using **Random Search** tuning

In this section we evaluate the performance of the regression using decission trees and KNN when using random search to tune the hyper-parameters.

Since now we are using random search to tune our hyper-parameters, we must first define our hyper-parameter search space, in this case, *param_grid*.   
For decision trees this greed hold three hyper-parameters to tune: the min_samples_split, a set of real numbers within 0.0 and 1.0, the criterion, either mse of friedman_mse, and finally the max_depth wich will be an integer between 2 and 20.   

Once the search space defined, we define the two step method to be applied on the decision tree and evaluated using 2-fold CV over *budget* (20) iterations and performance measure the negative root MSE.      
Then, the two step method is trained and timed and the best values for the hyper-parameters along with the time are saved in the correspondant summary dataframe.

In [10]:
###3.2 Random search for Decission Tree hyper-parameter tunning
np.random.seed(random_state)
param_grid = {
    'min_samples_split': uniform(0, 1),
    'criterion': ['mse','friedman_mse'], 
    'max_depth': sp_randint(min_max_depth, max_max_depth)
}
tree_random_search = model_selection.RandomizedSearchCV(
    tree.DecisionTreeRegressor(random_state=random_state), 
    param_grid,
    scoring='neg_root_mean_squared_error',
    cv=cv_grid, 
    verbose=verbose,
    n_iter=budget
    )
start_time = time.time()
tree_random_search.fit(X=x_train, y=y_train)
end_time = time.time()

summary['tree'] = summary['tree'].append(pd.Series({
    'Time (sec)': '{:.4f}'.format(end_time - start_time), 
    'Score (RMSE)': -tree_random_search.best_score_,
    'Min. samples split': tree_random_search.best_params_['min_samples_split'], 
    'Criterion': tree_random_search.best_params_['criterion'], 
    'Max. depth': tree_random_search.best_params_['max_depth']
    },
    name='random_search'))

For KNN, the procedure is the same, the only thing that changes is the hyper-parameters to tune. In this case, these are: the number of neighbors, a random integer between 1 and 16, the weights, the type of distance, and p, the exponent of the distance.

In [11]:
###3.3 Random search for K Nearest Neighbours hyper-parameter tunning
np.random.seed(random_state)
param_grid = {
    'n_neighbors': sp_randint(min_n_neigbors, max_n_neigbors),
    'weights': ['uniform', 'distance'],
    'p': [1, 2]
}

knn_random_search = model_selection.RandomizedSearchCV(
    neighbors.KNeighborsRegressor(), 
    param_grid,
    scoring='neg_root_mean_squared_error',
    cv=cv_grid, 
    verbose=verbose,
    n_iter=budget
    )
start_time = time.time()
knn_random_search.fit(X=x_train, y=y_train)
end_time = time.time()

summary['knn'] = summary['knn'].append(pd.Series({
    'Time (sec)': '{:.4f}'.format(end_time - start_time), 
    'Score (RMSE)': -knn_random_search.best_score_, 
    'N. neighbors': knn_random_search.best_params_['n_neighbors'], 
    'Weights': knn_random_search.best_params_['weights'], 
    'P': knn_random_search.best_params_['p']
    }, 
    name='random_search'))

### Training models using **SKOPT - Bayesian Optimization** Hyper-Parameter tuning

The following section goes through the implementation of hyper-parameter tuning by means of model based optimization (bayesian optimization).    

In comparison with the previous search methods, the implementation of Bayesian optimization is very similar in the sense that we define a similar parameter grid with the hyper-parameters to explore, evaluate with 2-fold CV and measure performance through the negative root MSE.    
Differences can be found however, in the way the search space of the hyper-parameters are defined, in this case using specific skopt classes that stablish the hyper-parameter's type.

In [12]:
#3.4.1 Decission trees
np.random.seed(random_state)
param_grid = {
    'min_samples_split': Real(0+sys.float_info.min, 1),
    'criterion': Categorical(['mse','friedman_mse']), 
    'max_depth': Integer(min_max_depth, max_max_depth)
}
tree_skopt = BayesSearchCV(tree.DecisionTreeRegressor(random_state=random_state), 
    param_grid,
    cv=cv_grid,    
    verbose=verbose,
    scoring='neg_root_mean_squared_error',
    n_iter=budget
    )
start_time = time.time()
tree_skopt.fit(X=x_train, y=y_train)
end_time = time.time()

summary['tree'] = summary['tree'].append(pd.Series({
    'Time (sec)': '{:.4f}'.format(end_time - start_time), 
    'Score (RMSE)': -tree_skopt.best_score_,
    'Min. samples split': tree_skopt.best_params_['min_samples_split'], 
    'Criterion': tree_skopt.best_params_['criterion'], 
    'Max. depth': tree_skopt.best_params_['max_depth']
    },
    name='skopt'))


The objective has been evaluated at this point before.


The objective has been evaluated at this point before.



In [13]:
#3.4.1 K Nearest neighbours
np.random.seed(random_state)
param_grid = {
    'n_neighbors': Integer(min_n_neigbors, max_n_neigbors),
    'weights': Categorical(['uniform', 'distance']),
    'p': Categorical([1, 2])
}
knn_skopt = BayesSearchCV(neighbors.KNeighborsRegressor(), 
    param_grid,
    cv=cv_grid,    
    verbose=verbose,
    scoring='neg_root_mean_squared_error',
    n_iter=budget
    )
start_time = time.time()
knn_skopt.fit(X=x_train, y=y_train)
end_time = time.time()

summary['knn'] = summary['knn'].append(pd.Series({
    'Time (sec)': '{:.4f}'.format(end_time - start_time), 
    'Score (RMSE)': -knn_skopt.best_score_, 
    'N. neighbors': knn_skopt.best_params_['n_neighbors'], 
    'Weights': knn_skopt.best_params_['weights'], 
    'P': knn_skopt.best_params_['p']
    }, 
    name='skopt'))


The objective has been evaluated at this point before.


The objective has been evaluated at this point before.


The objective has been evaluated at this point before.


The objective has been evaluated at this point before.


The objective has been evaluated at this point before.


The objective has been evaluated at this point before.


The objective has been evaluated at this point before.


The objective has been evaluated at this point before.


The objective has been evaluated at this point before.


The objective has been evaluated at this point before.


The objective has been evaluated at this point before.


The objective has been evaluated at this point before.


The objective has been evaluated at this point before.


The objective has been evaluated at this point before.


The objective has been evaluated at this point before.


The objective has been evaluated at this point before.


The objective has been evaluated at this point before.


The objective has been evaluat

### Training Optuna models - Bayesian Optimization

In this section, we tune our hyperparameters now using optuna to apply bayesian optimization.   

In order to use optuna, we define an objective function (*tree_objective* and *knn_objective*) where we suggest values for the hyper-parameters using a trial object. The trial object sets a new point in the hyper-parameter space, a suggestion of hyper-parameters to evaluate. Then, the model is created using those hyper-parameters and through CV a score is produced.

Through optuna, we create a *study*, an optimization session with a direction. In our case we seek to minimize the objective function's negative root MSE.

In [14]:
#3.5.1 Decission trees
np.random.seed(random_state)
def tree_objective(trial):
    min_samples_split = trial.suggest_uniform('min_samples_split', 0+sys.float_info.min, 1)
    criterion = trial.suggest_categorical('criterion', ['mse','friedman_mse'])
    max_depth = trial.suggest_int('max_depth', min_max_depth, max_max_depth)

    clf = tree.DecisionTreeRegressor(
        random_state=random_state,
        min_samples_split=min_samples_split,
        criterion=criterion,
        max_depth=max_depth)

    scores = -model_selection.cross_val_score(clf, x_train, y_train,
        cv=cv_grid,
        verbose=verbose,
        scoring='neg_root_mean_squared_error')

    return scores.mean()

tree_optuna = optuna.create_study(direction='minimize')
start_time = time.time()
tree_optuna.optimize(tree_objective, n_trials=budget)
end_time = time.time()

summary['tree'] = summary['tree'].append(pd.Series({
    'Time (sec)': '{:.4f}'.format(end_time - start_time), 
    'Score (RMSE)': tree_optuna.best_value,
    'Min. samples split': tree_optuna.best_params['min_samples_split'], 
    'Criterion': tree_optuna.best_params['criterion'], 
    'Max. depth': tree_optuna.best_params['max_depth']
    },
    name='optuna'))

[32m[I 2020-12-21 18:56:54,495][0m A new study created in memory with name: no-name-9d73be2c-e63a-4ca1-a3ad-3efe05ba658f[0m
[32m[I 2020-12-21 18:56:54,505][0m Trial 0 finished with value: 0.2984974304504965 and parameters: {'min_samples_split': 0.6861249154215259, 'criterion': 'mse', 'max_depth': 16}. Best is trial 0 with value: 0.2984974304504965.[0m
[32m[I 2020-12-21 18:56:54,522][0m Trial 1 finished with value: 0.23452873176177735 and parameters: {'min_samples_split': 0.1907816462795069, 'criterion': 'friedman_mse', 'max_depth': 7}. Best is trial 1 with value: 0.23452873176177735.[0m
[32m[I 2020-12-21 18:56:54,537][0m Trial 2 finished with value: 0.2827141767132929 and parameters: {'min_samples_split': 0.4378314630460862, 'criterion': 'friedman_mse', 'max_depth': 9}. Best is trial 1 with value: 0.23452873176177735.[0m
[32m[I 2020-12-21 18:56:54,555][0m Trial 3 finished with value: 0.22403157570587573 and parameters: {'min_samples_split': 0.11408441057280105, 'criterion

In [15]:
#3.5.2 K Nearest Neighbours
np.random.seed(random_state)
def knn_objective(trial):
    n_neighbors = trial.suggest_int('n_neighbors', min_n_neigbors, max_n_neigbors)
    weights = trial.suggest_categorical('weights', ['uniform','distance'])
    p = trial.suggest_categorical('p', [1, 2])

    clf = neighbors.KNeighborsRegressor(
        n_neighbors=n_neighbors,
        weights=weights,
        p=p)

    scores = -model_selection.cross_val_score(clf, x_train, y_train,
        cv=cv_grid,
        verbose=verbose,
        scoring='neg_root_mean_squared_error')

    return scores.mean()

knn_optuna = optuna.create_study(direction='minimize')
start_time = time.time()
knn_optuna.optimize(knn_objective, n_trials=budget)
end_time = time.time()

summary['knn'] = summary['knn'].append(pd.Series({
    'Time (sec)': '{:.4f}'.format(end_time - start_time), 
    'Score (RMSE)': knn_optuna.best_value, 
    'N. neighbors': knn_optuna.best_params['n_neighbors'], 
    'Weights': knn_optuna.best_params['weights'], 
    'P': knn_optuna.best_params['p']
    }, 
    name='optuna'))

[32m[I 2020-12-21 19:09:53,606][0m A new study created in memory with name: no-name-884d50f3-199d-4e22-b220-c22de1e9a4b7[0m
[32m[I 2020-12-21 19:09:53,718][0m Trial 0 finished with value: 0.18251608100710465 and parameters: {'n_neighbors': 10, 'weights': 'uniform', 'p': 1}. Best is trial 0 with value: 0.18251608100710465.[0m
[32m[I 2020-12-21 19:09:53,815][0m Trial 1 finished with value: 0.19095741162861635 and parameters: {'n_neighbors': 12, 'weights': 'distance', 'p': 2}. Best is trial 0 with value: 0.18251608100710465.[0m
[32m[I 2020-12-21 19:09:53,898][0m Trial 2 finished with value: 0.1794445718169178 and parameters: {'n_neighbors': 3, 'weights': 'uniform', 'p': 1}. Best is trial 2 with value: 0.1794445718169178.[0m
[32m[I 2020-12-21 19:09:53,982][0m Trial 3 finished with value: 0.17823769926145944 and parameters: {'n_neighbors': 5, 'weights': 'distance', 'p': 1}. Best is trial 3 with value: 0.17823769926145944.[0m
[32m[I 2020-12-21 19:09:54,065][0m Trial 4 finish

## Summary

Now, we will try to sum up our results. 

The next cell outputs all the information given to us by the models, the best values for each hyper-parameter, time needed to train the model, and RMSE score for each hyper-parameter method used.

From this dataframe we can draw some interesting conclusions. Clearly, hyper-parameter tuning improves the models performance for both KNN and decision trees by at least 0.014385 and 0.025002 for each model respectively.    
The best hyper-parameter method judging by the given scores is the one given by skopt for decision trees and the one given by optuna for KNearestNeighbors respectively.   
However, the fastest hyper-parameter tuning method (excluding setting hyper-parameters by default which isn't really tuning) is random search for decision trees and optuna for KNN. So for KNN optuna yields the best score in the least time.


In [16]:
print("\nSUMMARY FOR DECISSION TREE MODELS")
print(summary['tree'])
print("\nSUMMARY FOR K NEAREST NEIGHBORS MODELS")
print(summary['knn'])


SUMMARY FOR DECISSION TREE MODELS
              Time (sec)  Score (RMSE) Min. samples split     Criterion  \
default                0      0.236676                  2           mse   
random_search     1.1164      0.215140          0.0580292  friedman_mse   
skopt           399.1219      0.211516          0.0529852           mse   
optuna            2.1498      0.211674          0.0528331           mse   

              Max. depth  
default             None  
random_search          9  
skopt                 19  
optuna                15  

SUMMARY FOR K NEAREST NEIGHBORS MODELS
              Time (sec)  Score (RMSE) N. neighbors   Weights  P
default                0      0.191550            5   uniform  2
random_search    13.7176      0.177165            6  distance  1
skopt           425.5685      0.177087            7  distance  1
optuna            9.2219      0.177086            7  distance  1



The following cell locates from within our summary dataframe the model that has obtained the best RMSE score for both decision tree models and KNN models.   

The output tells us that **the best model according to the inner evaluation is the one given by KNN tuned by means of optuna**. If we contrast this information with the previous' cell output we can verify that this model has a score of 0.177086 which is a big improvement compared to the decision tree model tuned by optuna that receives a score of 0.211674.

In [20]:
###3.6 Determine the best model from its inner evaluation
best_tree_model = summary['tree']['Score (RMSE)'].idxmin()
best_knn_model = summary['knn']['Score (RMSE)'].idxmin()

if summary['tree'].loc[best_tree_model]['Score (RMSE)'] < summary['knn'].loc[best_knn_model]['Score (RMSE)']:
    print('\n--> The best model is Decision Tree Regressor with {}'.format(best_tree_model))
    best_model = tree.DecisionTreeRegressor(
        random_state=random_state,
        min_samples_split=summary['tree'].loc[best_tree_model]['Min. samples split'] ,
        criterion=summary['tree'].loc[best_tree_model]['Criterion'],
        max_depth=summary['tree'].loc[best_tree_model]['Max. depth'])
else:
    print('\nThe best model is K Nearest Neighbors Regressor with {}'.format(best_knn_model))
    best_model = neighbors.KNeighborsRegressor(
        n_neighbors=summary['knn'].loc[best_knn_model]['N. neighbors'] ,
        weights=summary['knn'].loc[best_knn_model]['Weights'],
        p=summary['knn'].loc[best_knn_model]['P'])



The best model is K Nearest Neighbors Regressor with optuna


Now, taking this best model **we make an estimation of the performance we would get at the competition**. To do this, we evaluate the KNN optuna model on the test matrices, *X_test* and *y_test*. 

In [18]:
###3.7 Performance estimation
best_model.fit(x_train, y_train)
best_model_predict = best_model.predict(x_test)
print('\nBest Model performance at competition:')
print('RMSE: {:.4f} (should be lower than the trivial predictor using the mean MSE: {:.4f})'.format(
    math.sqrt(metrics.mean_squared_error(y_test, best_model_predict)),
    math.sqrt(metrics.mean_squared_error(y_test, [y_test.mean() for i in range(len(y_test))]))))
print('R square: {:.4f} (should be higher than the trivial predictor using the mean: R square {:.4f})'.format(
    metrics.r2_score(y_test, best_model_predict),
    metrics.r2_score(y_test, [y_test.mean() for i in range(len(y_test))])))



Best Model performance at competition:
RMSE: 0.1620 (should be lower than the trivial predictor using the mean MSE: 0.3856)
R square: 0.8234 (should be higher than the trivial predictor using the mean: R square 0.0000)


Lastly, we train the best model (KNN with optuna) on the whole available dataset (x, y) and make predictions on the competition matrix *x_comp*. 

In [19]:
#3.8 Final model train
best_model.fit(x, y)
y_comp = [math.exp(i) for i in best_model.predict(x_comp)]

submission = pd.DataFrame(columns=['Id', 'SalePrice'])
submission['Id'] = pd.Series(range(1461, 2920))
submission['SalePrice'] = pd.Series(y_comp)
submission.to_csv('submission.csv', index=False)