# Ames Housing Dataset - Adaboost with Random Forest estimator

> Gianmaria Pizzo - 872966@stud.unive.it

These notebooks represent the project submission for the course [Data and Web Mining](https://www.unive.it/data/course/337525) by Professor [Claudio Lucchese](https://www.unive.it/data/people/5590426) at [Ca' Foscari University of Venice](https://www.unive.it).

---

## Structure of this notebook

This notebook covers the following points
* The idea
* GridSearchCV Hyperparameters tuning for R.F.
* Model validation
    * For different counts of estimators
    * For different learning rate
* Results

---

### Before running this notebook

To avoid issues, before running the following notebook it is best to
* Clean previous cell outputs
* Restart the kernel

---

## The idea - Combining AdaBoost and Random Forest

As we know, different predictors have different flaws and strengths. This means we can train multiple models in order to exploit what they learnt and obtain a more accurate result.

[Khulna University of Engineering and Technology](https://www.researchgate.net/institution/Khulna_University_of_Engineering_and_Technology) published a research article about the use of Random Forest as base estimators for AdaBoost in order to create a model for breast cancer detection, which can be found [here](https://www.researchgate.net/publication/339978251_A_Precise_Breast_Cancer_Detection_Approach_Using_Ensemble_of_Random_Forest_with_AdaBoost). 
Although used as a binary classifier, the results were impressive: "*The structure provided accuracy of 98.5714% along with sensitivity and specificity of 100% and 96.296% respectively in the testing phase*" states the abstract.

It seemed a good idea to use an ensemble of ensembles, in order to get a more accurate result.

As we are using random forests, we expect to find some level of overfitting when testing it on the dataset where the outliers and most noise were removed. Plus, as the dataset shows very few instances, it migth be better to use this kind of model on a larger dataset.
However, there should be some level of improvement given the boosting algorithm will try to lower the bias and the random forest can handle well the categorical variables which are present here.

---

### Environment

In [2]:
%matplotlib notebook

import os
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
import warnings
import IPython

sns.set()
plt.style.use('ggplot')
sns.set_style("darkgrid")
warnings.filterwarnings('ignore') 

# Working folder
WORKING_DIR = os.getcwd()
# Resources folder
RESOURCES_DIR = os.path.join(os.getcwd(), 'resources')

# Original
IN_LABEL_ORIG = 'ames_housing_out_2_orig.csv'
# Modified
IN_LABEL_MOD = 'ames_housing_out_2.csv'

In [17]:
df = pd.read_csv(os.path.join(RESOURCES_DIR, IN_LABEL_MOD))
df.drop(columns='Unnamed: 0', inplace=True)

df_or = pd.read_csv(os.path.join(RESOURCES_DIR, IN_LABEL_ORIG))
df_or.drop(columns='Unnamed: 0', inplace=True)

In [4]:
def sort_alphabetically(dataset, last_label = None):
    """
    Sorts the dataset alphabetically 

    :param dataset: a pd.DataFrame
    :param last_label: a str containing an existing column label in the dataset
    :returns: pd.DataFrame
    """
    # Sort
    dataset = dataset.reindex(sorted(dataset.columns), axis=1)
    # Move target column to last index
    if last_label is not None:
        col = dataset.pop(last_label)
        dataset.insert(dataset.shape[1], last_label, col)
    return dataset

In [5]:
df=sort_alphabetically(df, 'Sale_Price')

In [6]:
def get_X_y(dataset, label, ignore=None):
    """
    Returns X and y and ignores labels in ignore
    :param dataset: a pd.DataFrame
    :param label: a str containing an existing target column label in the dataset
    :param ignore: a str containing an existing column label in the dataset to ignore
    :returns: tuple of pd.DataFrame X, y
    """
    if ignore is not None:
        # Drop the labels
        return dataset.drop(columns=[label, ignore]), dataset.loc[:,[label]]
    return dataset.drop(columns=[label]), dataset.loc[:,[label]]

def get_train_test(X, y, size = 0.2, state = 33):
    """
    Returns X_train_[size], X_test, y_train_[size], y_test
    :param X: a pd.DataFrame without the target column
    :param y: a pd.DataFrame with one column, the target
    :param size: a float representing the fraction for the test size
    :param state: an integer representing the random state for the test
    :returns: 4 pd.DataFrame usually called "X_train_[size], X_test, y_train_[size], y_test"
    """
    return train_test_split(X, y, test_size=size, random_state = state)

def get_train_val_test(X, y, size_t=0.2, size_v=0.25, state_v = 42):
    """
    Returns X_train, X_valid, X_test, y_train, y_valid, y_test
    :param X: a pd.DataFrame without the target column
    :param y: a pd.DataFrame with one column, the target
    :param size_t: a float representing the fraction for the test size
    :param size_v: a float representing the fraction for the validation
    :param state_v: an integer representing the random state for the validation
    :returns: 6 pd.DataFrame usually called X_train, X_valid, X_test, y_train, y_valid, y_test
    """
    X_train_s, X_test, y_train_s, y_test = get_train_test(X, y, size = size_t)
    X_train, X_valid, y_train, y_valid = get_train_test(X_train_s, y_train_s, size = size_v, state = state_v)
    return X_train, X_valid, X_test, y_train, y_valid, y_test

## Data Preparation

As we know from https://towardsdatascience.com/one-hot-encoding-is-making-your-tree-based-ensembles-worse-heres-why-d64b282b5769

Because this procedure generates several new variables, it is prone to causing a large problem (too many predictors) if the original column has a large number of unique values. Another disadvantage of one-hot encoding is that it produces multicollinearity among the various variables, lowering the model's accuracy.


---

## Hyperparameters Tuning for R.F. estimator

Since the R.F. is largely customizable when it comes to its parameters, the GridSearchCV seems the way to go. This allows for the use of Cross Validation.

We are going to this **two times considering the two different targets** `Sale_Price` and `Log1p_Sale_Price`. We are not going to exclude the outliers as the R.F. 

The best parameters are selected and a score is returned for the resulting model. It is possible to feed these parameters to the estimator before the latter is used by the AdaBoost Algorithm

In [31]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_squared_log_error, mean_absolute_error, r2_score, max_error 

scoring = {
    "MSE":"neg_mean_squared_error",
    "MSLE":"neg_mean_squared_log_error",
    "MAE": "neg_mean_absolute_error",
    "R2": "r2",
    "MAXERROR": "max_error"
}
    
random_grid = {
    'criterion':('squared_error', 'absolute_error', 'friedman_mse', 'poisson'),
    'max_depth':[8, 10, 12],
    'min_samples_split': [5,10, 8, 12, 15],
    'min_samples_leaf': [10, 12, 15, 17, 20],
    'max_features': ['sqrt'],
    'max_leaf_nodes': [10, 15, 20, 25],
    'random_state':[2324, 159857, 21412],
    'max_samples':[800, 900, 1000, 1050, 1100]
}

# Base estimator
rf = RandomForestRegressor()

# Grid Search CV with 10 fold 
tuned_model = RandomizedSearchCV(
    estimator = rf, param_distributions = random_grid, refit='MSE',
    scoring = scoring, n_iter=50,
    cv=5, verbose=0, n_jobs=-1)

In [32]:
df_or['Log1p_Sale_Price'] = np.log1p(df_or['Sale_Price'])
X_or, y_or = get_X_y(df_or.select_dtypes(exclude=['object']), label = 'Sale_Price', ignore = 'Log1p_Sale_Price')

In [33]:
# 1 - Original Data, Sale_Price

tuned_model.fit(X_or, y_or)

print ("Best Score: {:.3f}".format(tuned_model.best_score_) )
print ("Best Params: ", tuned_model.best_params_)


Best Score: -1212147335.083
Best Params:  {'random_state': 159857, 'min_samples_split': 10, 'min_samples_leaf': 12, 'max_samples': 1050, 'max_leaf_nodes': 25, 'max_features': 'sqrt', 'max_depth': 10, 'criterion': 'poisson'}


In [34]:
X_or_log, y_or_log = get_X_y(df_or.select_dtypes(exclude=['object']), label = 'Log1p_Sale_Price', ignore = 'Sale_Price')

In [35]:
# 2 - Original Data, Log1p Sale_Price

tuned_model.fit(X_or_log, y_or_log)

print ("Best Score: {:.3f}".format(tuned_model.best_score_) )
print ("Best Params: ", tuned_model.best_params_)


Best Score: -0.033
Best Params:  {'random_state': 2324, 'min_samples_split': 12, 'min_samples_leaf': 12, 'max_samples': 1050, 'max_leaf_nodes': 25, 'max_features': 'sqrt', 'max_depth': 10, 'criterion': 'squared_error'}


In [36]:
X_mod, y_mod = get_X_y(df.select_dtypes(exclude=['object']), label = 'Sale_Price', ignore = 'Log1p_Sale_Price')

In [37]:
# 3 - Engineered Data, Sale_Price

tuned_model.fit(X_mod, y_mod)

print ("Best Score: {:.3f}".format(tuned_model.best_score_) )
print ("Best Params: ", tuned_model.best_params_)


Best Score: -887434029.521
Best Params:  {'random_state': 159857, 'min_samples_split': 8, 'min_samples_leaf': 12, 'max_samples': 1050, 'max_leaf_nodes': 25, 'max_features': 'sqrt', 'max_depth': 10, 'criterion': 'poisson'}


In [38]:
X_mod_log, y_mod_log = get_X_y(df.select_dtypes(exclude=['object']), label ='Log1p_Sale_Price', ignore = 'Sale_Price')

In [39]:
# 4 - Engineered Data, Log1p Sale_Price

tuned_model.fit(X_mod_log, y_mod_log)

print ("Best Score: {:.3f}".format(tuned_model.best_score_) )
print ("Best Params: ", tuned_model.best_params_)


Best Score: -0.022
Best Params:  {'random_state': 159857, 'min_samples_split': 12, 'min_samples_leaf': 12, 'max_samples': 900, 'max_leaf_nodes': 25, 'max_features': 'sqrt', 'max_depth': 8, 'criterion': 'friedman_mse'}


In [None]:
# The final hyperparameters

# Sale_Price
rf_sp_param = {'random_state': 159857, 
               'min_samples_split': 10, 
               'min_samples_leaf': 12, 
               'max_samples': 800, 
               'max_leaf_nodes': 25, 
               'max_features': 'sqrt', 
               'max_depth': 10, 
               'criterion': 'poisson'}

# Log1p Sale Price

rf_logsp_param = {'random_state': 159857, 
               'min_samples_split': 10, 
               'min_samples_leaf': 12, 
               'max_samples': 800, 
               'max_leaf_nodes': 25, 
               'max_features': 'sqrt', 
               'max_depth': 10, 
               'criterion': 'poisson'}

Now that we have some good hyperparameters, we shall see how the accuracy changes with regards to the number of estimators. In fact, we need to limit this value as we are combining two ensemble methods and this could lead to overfitting and high time complexity.

In [None]:
n_estimators = [10, 20, 40, 50, 80, 100, 120, 200]

---

## AdaBoost with R.F.

Now that we have some good hyperparameters for the estimator, we can closely analyze how accurate the model is.

To cut short through the choice of the number of estimators (considering they are ensemble too), we leave the decision to the GridSearchCV, once a again. This is because we want to focus on another aspect: the learning rate.

In [None]:
# GridSearchCV for n_estimators and loss

adab_parameters = {
    'n_estimators':[5, 10, 25, 50, 100],
    'learning_rate':[0.01,0.1, 1.0, 2.0, 5.0, 10, 100]
}

In [None]:
# Cycle to see how the learning rate affects prediction on log1p Sale_Price and Sale_Price

    # skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=1)
    # lst_accu_stratified = []  
    # for train_index, test_index in skf.split(x, y):

---

## Train-Val-Test 1: Regression on `Sale_Price`

### Train and Parameters Tuning

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score

#Split 60/20/20
# 80% Train, 20% Test
X_train_80, X_test, y_train_80, y_test = train_test_split(df_train, df_target,
                                                          test_size = 0.20, random_state = 33)
# %80 Train -> 55% Train, 25% Validate
X_train, X_valid, y_train, y_valid  = train_test_split(X_train_80, y_train_80, 
                                                       test_size=0.25, random_state=42)

accuracies = []

for c in [0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0]:
    # train and predict
    model = SVC(C=c, kernel='poly')
    model.fit(X_train, y_train)

    # compute Accuracy
    train_acc = accuracy_score(y_true = y_train, 
                               y_pred = model.predict(X_train))
    valid_acc = accuracy_score(y_true = y_valid, 
                               y_pred = model.predict(X_valid))
    print ("C: {:8.3f} - Train Accuracy: {:.3f} - Validation Accuracy: {:.3f}"
           .format( c, train_acc, valid_acc) )
    
    accuracies += [ [valid_acc, c] ]

best_accuracy, best_c = max(accuracies)
print ( "Best C:", best_c )

# here we are using both training and validation,
# to exploit the most data
model = SVC(C=best_c, kernel='poly')
model.fit(X_train_80,y_train_80)

test_acc = accuracy_score(y_true = y_test, 
                          y_pred = model.predict(X_test) )
print ("Test Accuracy: {:.3f}".format(test_acc) )

### Validation

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score

X_train_80, X_test, y_train_80, y_test = train_test_split(df_train, df_target,
                                                          test_size = 0.20, random_state = 42)

model = SVC()
parameters = { 'C': [0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0],
                'kernel': ['linear', 'poly', 'rbf', 'sigmoid']}
        
tuned_model = GridSearchCV(model, parameters, cv=5, verbose=0)
tuned_model.fit(X_train_80, y_train_80)

print ("Best Score: {:.3f}".format(tuned_model.best_score_) )
print ("Best Params: ", tuned_model.best_params_)

### Test

In [None]:
tuned_model.cv_results_

In [None]:
pd.DataFrame( tuned_model.cv_results_ )

In [None]:
test_acc = accuracy_score(y_true = y_test, 
                          y_pred = tuned_model.predict(X_test) )
print ("Test Accuracy: {:.3f}".format(test_acc) )

### Diagnostics and Evaluation

### Investigating Instances

---

## Train-Val-Test 2: Regression on `Log1p_Sale_Price`

### Train and Parameters Tuning

### Validation

### Test

### Diagnostics and Evaluation

### Investigating Instances