# Ames Housing Dataset - XGboost Regressor

> Gianmaria Pizzo - 872966@stud.unive.it

These notebooks represent the project submission for the course [Data and Web Mining](https://www.unive.it/data/course/337525) by Professor [Claudio Lucchese](https://www.unive.it/data/people/5590426) at [Ca' Foscari University of Venice](https://www.unive.it).

---

## Structure of this notebook

This notebook covers the following points
* The idea
* Tuning:
    * Automatic: GridSearchCV Hyperparameters tuning for XGBoost.
    * Manual
* Model validation
* Results
* Analysis of worst and best predictions.

---

### Before running this notebook

To avoid issues, before running the following notebook it is best to
* Clean previous cell outputs
* Restart the kernel

---

## The idea

As we know, different predictors have different flaws and strengths. This means we can train multiple models in order to exploit what they learnt and obtain a more accurate result.

As we are using a boosting method, we expect to find some level of overfitting when testing it on the dataset where the outliers and most noise were removed. 

Plus, as the dataset shows very few instances, it migth be better to use this kind of model on a larger dataset.

However, there should be some level of improvement given the boosting algorithm will try to lower the bias.

---

### Environment, Globals and Imports

In [None]:
!pip install mlxtend
!pip install xgboost

In [1]:
# Interactive
%matplotlib notebook
# Static
# %matplotlib inline

# Environment for this notebook
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy
import warnings
import sklearn 
import IPython
import xgboost
from scipy import stats
from xgboost import XGBRegressor
from sklearn.model_selection import RepeatedKFold

# Set the style for the plots
sns.set()
plt.style.use('ggplot')
sns.set_style("darkgrid")
# Ignore warnings
warnings.filterwarnings('ignore') 

# Working folder
WORKING_DIR = os.getcwd()
# Resources folder
RESOURCES_DIR = os.path.join(os.getcwd(), 'resources')
# Name of file
IN_LABEL = 'ames_housing_out_2.csv'
ORIG_LABEL = 'ames_housing_out_2_orig.csv'

In [2]:
# Utils Module

def sort_alphabetically(dataset, last_label = None):
    """
    Sorts the dataset alphabetically 

    :param dataset: a pd.DataFrame
    :param last_label: a str containing an existing column label in the dataset
    :returns: pd.DataFrame
    """
    # Sort
    dataset = dataset.reindex(sorted(dataset.columns), axis=1)
    # Move target column to last index
    if last_label is not None:
        col = dataset.pop(last_label)
        dataset.insert(dataset.shape[1], last_label, col)
    return dataset

In [3]:
from sklearn.model_selection import train_test_split

# Module for train test split

def get_X_y(dataset, label, ignore=None):
    """
    Returns X and y and ignores labels in ignore
    :param dataset: a pd.DataFrame
    :param label: a str containing an existing target column label in the dataset
    :param ignore: a list of str containing an existing column label in the dataset to ignore
    :returns: tuple of pd.DataFrame
    """
    if ignore is not None:
        # Drop the labels 
        all_columns = list(dataset.columns)
        # Include only columns that are existing 
        to_drop = [i for i in all_columns if i in ignore] +[label]
        return dataset.drop(columns=to_drop), dataset[[label]]
    return dataset.drop(columns=[label]), dataset[[label]]

def get_train_test(X, y, size = 0.2, state = 33):
    """
    Returns X_train_[size], X_test, y_train_[size], y_test
    :param X: a pd.DataFrame without the target column
    :param y: a pd.DataFrame with one column, the target
    :param size: a float representing the fraction for the test size
    :param state: an integer representing the random state for the test
    :returns: 4 pd.DataFrame usually called "X_train_[size], X_test, y_train_[size], y_test"
    """
    return train_test_split(X, y, test_size=size, random_state = state)

def get_train_val_test(X, y, size_t=0.2, size_v=0.25, state_v = 42):
    """
    Returns X_train, X_valid, X_test, y_train, y_valid, y_test
    :param X: a pd.DataFrame without the target column
    :param y: a pd.DataFrame with one column, the target
    :param size_t: a float representing the fraction for the test size
    :param size_v: a float representing the fraction for the validation
    :param state_v: an integer representing the random state for the validation
    :returns: 6 pd.DataFrame usually called X_train, X_valid, X_test, y_train, y_valid, y_test
    """
    X_train_s, X_test, y_train_s, y_test = get_train_test(X, y, size = size_t)
    X_train, X_valid, y_train, y_valid = get_train_test(X_train_s, y_train_s, size = size_v, state = state_v)
    return X_train, X_valid, X_test, y_train, y_valid, y_test

In [26]:
from sklearn.model_selection import LeaveOneOut, GridSearchCV
from sklearn.metrics import mean_squared_error, mean_squared_log_error, mean_absolute_error, r2_score, max_error 
from mlxtend.evaluate import bias_variance_decomp

# Module for traininig and testing
def get_regression_metrics(y_test, y_pred):
    metrics = {
            "RMSE": mean_squared_error(y_true=y_test, y_pred=y_pred, squared=False),
            "MSE": mean_squared_error(y_true=y_test, y_pred=y_pred),
            "MSLE": mean_squared_log_error(y_true=y_test, y_pred=y_pred),
            "MAE": mean_absolute_error(y_true=y_test, y_pred=y_pred),
            "R2": r2_score(y_true=y_test, y_pred=y_pred),
            "MAX_Err": max_error(y_true=y_test, y_pred=y_pred)}
    return metrics


def get_bias_variance_decomp(dataset, model, label, split_size, ignore, 
                             num_rounds=50, random_state=230324945):
    # Get split
    X, y = get_X_y(dataset, label=label, ignore=None)
    X_train, X_test, y_train, y_test = get_train_test(X, y, size = split_size, 
                                                      state = random_state)
    # Only accepts np.arrays
    mse, bias, var = bias_variance_decomp(estimator=model, 
                                          X_train=X_train.values, 
                                          y_train=y_train.values, 
                                          X_test=X_test.values, 
                                          y_test=y_test.values, 
                                          loss='mse', num_rounds=num_rounds, 
                                          random_seed=random_state)
    print('Avg Expected RMSE: %.3f' % np.sqrt(mse))
    print('Avg Expected MSE: %.3f' % mse)
    print('Avg Bias: %.3f' % bias)
    print('Avg Variance: %.3f' % var)
    pass


def LOO_estimator_eval(dataset, target, estimator, params, ignore=None):
    """
    Function used to evaluate estimators, based on Leave One Out process. It adds a 
    column 'Predicted' to the given dataset, and returns the metrics used to evaluate the 
    performances
    
    :param dataset: a pd.DataFrame with the target column
    :param target: a str representing the target
    :param estimator: instance of some estimator (i.e. XGBoostRegressor())
    :param params: a dictionary containing the parameters for the estimator
    :param ignor: a list of strings representing the feature to ignore
    :returns: the pd.DataFrame
    """
    # Splitter
    splitter = LeaveOneOut()
    
    # Add predicted
    dataset['Predicted'] = 0.0
    
    # Ignore
    if ignore is not None:
        ignore = ignore + ['Predicted']
    else:
        ignore = ['Predicted']
    
    # Split X, y
    X, y = get_X_y(dataset, label=target, ignore=ignore)
    
    # For each fold and tuple train, test indices
    for i, (train_index, test_index) in enumerate(splitter.split(X)):
        # Re-Assign
        model = estimator
        
        # Base model initialized with some parameters
        if params is not None: 
            print(params)
            model.set_params(params)
        
        # Get train part
        train = dataset.loc[train_index.tolist()]
        X_train, y_train = get_X_y(train, label=target, ignore=ignore)
       
        # Train 
        model.fit(X_train, y_train)

        # Get test part 
        test = dataset.loc[test_index.tolist()]
        X_test, y_test = get_X_y(test, label=target, ignore=ignore)
        
        # Add predict to dataset
        y_pred = model.predict(X_test)
        dataset.loc[test_index.tolist()[0]]['Predicted'] = y_pred[0]
    return get_regression_metrics(dataset[[target]], dataset[['Predicted']])


def GridSearch_CV_Tuning(dataset, target, estimator, params, ignore=None, n_repeats=4, n_splits=4, 
                random_state=33412):
    """
    Function used to evaluate estimators, based on GridSearchCV process. It evaluates the
    performances through a Repeated K Fold, and returns the results
    
    :param dataset: a pd.DataFrame with the target column
    :param target: a str representing the target
    :param estimator: instance of some estimator (i.e. XGBoostRegressor())
    :param params: a dictionary containing the parameters for the estimator
    :param ignore: a list of strings representing the feature to ignore
    :param n_repeats: a integer
    :param n_splits: a integer
    :returns: the pd.DataFrame containing the results
    """
    # Ignore
    if ignore is not None:
        ignore = ignore + ['Predicted']
    else:
        ignore = ['Predicted']
        
    # RepeatedKFold splitter
    splitter = RepeatedKFold(n_repeats=n_repeats, n_splits=n_splits, random_state=random_state)
    
    # GridSearchCV
    clf = GridSearchCV(estimator=estimator, cv=splitter,
                       param_grid=params, return_train_score = True,
                       scoring =['neg_mean_squared_error', 'neg_root_mean_squared_error', 'r2'],
                       refit=False, n_jobs=-1, verbose=3)
    # X, y
    X, y = get_X_y(dataset, label=target, ignore=(ignore + ['Predicted']))
    # Train, Test split
    X_train, X_test, y_train, y_test = get_train_test(X, y)
    # Fit
    clf.fit(X_train, y_train)
    
    return pd.DataFrame(clf.cv_results_)



## Dataset Overview

The dataset we are going to consider are the following ones:
* The modified dataset
* The original dataset

In [6]:
df = pd.read_csv(os.path.join(RESOURCES_DIR, IN_LABEL))
df_orig = pd.read_csv(os.path.join(RESOURCES_DIR, ORIG_LABEL))

df.drop(columns=['Unnamed: 0', 'Latitude', 'Longitude'], inplace=True)
df_orig.drop(columns=['Unnamed: 0', 'Latitude', 'Longitude'], inplace=True)

df = sort_alphabetically(df, 'Sale_Price')
df_orig = sort_alphabetically(df_orig, 'Sale_Price')

---

## Hyperparameters Tuning

First of all, let us try to use a Grid Search CV to find the best parameters.

### Automatic Parameters Tuning: Randomized Grid Search

By defining the repetitions, the splits and the parameters, we repeatedly train and test the models. From each one of the model, we obtain three scores which we can use to check the best a parameters.

In [7]:
xgb_params = {
    'n_estimators': [5, 7, 9, 11],
    'max_depth': [5, 7, 9, 11, 13, 15],
    'max_leaves': [8, 10, 12, 14],
    'learning_rate': [0.5, 0.25, 1],
    'booster' : ['gbtree'],
    'importance_type': ['weight', 'gain'],
}

In [8]:
results = GridSearch_CV_Tuning(dataset=df, target='Sale_Price', estimator=XGBRegressor(), params=xgb_params)

Fitting 16 folds for each of 576 candidates, totalling 9216 fits


From this dataframe we want to obtain the 10 best models for each metric we used. 

In [9]:
results.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 576 entries, 0 to 575
Columns: 122 entries, mean_fit_time to std_train_r2
dtypes: float64(112), int32(3), object(7)
memory usage: 542.4+ KB


In [10]:
best_r2 = list(results[['rank_test_r2','mean_train_r2', 'mean_test_r2']][results['rank_test_r2']==1].index)
best_mse = list(results[['rank_test_neg_mean_squared_error','mean_train_neg_mean_squared_error', 'mean_test_neg_mean_squared_error']][results['rank_test_neg_mean_squared_error']==1].index)
best_rmse = list(results[['rank_test_neg_root_mean_squared_error','mean_train_neg_root_mean_squared_error', 'mean_test_neg_root_mean_squared_error']][results['rank_test_neg_root_mean_squared_error']==1].index)

best = list(set(best_r2) | set(best_mse) | set(best_rmse))

In [11]:
best_df = results[['mean_fit_time', 'mean_test_neg_mean_squared_error', 'mean_test_neg_root_mean_squared_error', 'mean_test_r2', 'params',]].loc[best].sort_values(by=['mean_fit_time'])

In [12]:
best_df

Unnamed: 0,mean_fit_time,mean_test_neg_mean_squared_error,mean_test_neg_root_mean_squared_error,mean_test_r2,params
11,0.05425,-681956100.0,-26002.622275,0.886337,"{'booster': 'gbtree', 'importance_type': 'weig..."
299,0.054687,-681956100.0,-26002.622275,0.886337,"{'booster': 'gbtree', 'importance_type': 'gain..."
7,0.05475,-681956100.0,-26002.622275,0.886337,"{'booster': 'gbtree', 'importance_type': 'weig..."
295,0.055,-681956100.0,-26002.622275,0.886337,"{'booster': 'gbtree', 'importance_type': 'gain..."
303,0.05525,-681956100.0,-26002.622275,0.886337,"{'booster': 'gbtree', 'importance_type': 'gain..."
3,0.055625,-681956100.0,-26002.622275,0.886337,"{'booster': 'gbtree', 'importance_type': 'weig..."
15,0.055625,-681956100.0,-26002.622275,0.886337,"{'booster': 'gbtree', 'importance_type': 'weig..."
291,0.055749,-681956100.0,-26002.622275,0.886337,"{'booster': 'gbtree', 'importance_type': 'gain..."


In [13]:
pd.DataFrame(list(best_df.params))

Unnamed: 0,booster,importance_type,learning_rate,max_depth,max_leaves,n_estimators
0,gbtree,weight,0.5,5,12,11
1,gbtree,gain,0.5,5,12,11
2,gbtree,weight,0.5,5,10,11
3,gbtree,gain,0.5,5,10,11
4,gbtree,gain,0.5,5,14,11
5,gbtree,weight,0.5,5,8,11
6,gbtree,weight,0.5,5,14,11
7,gbtree,gain,0.5,5,8,11


Just to make sure this is the right way I want to re iter this on the original dataset

In [14]:
results_orig = GridSearch_CV_Tuning(dataset=df_orig, target='Sale_Price', estimator=XGBRegressor(), params=xgb_params)

Fitting 16 folds for each of 576 candidates, totalling 9216 fits


In [15]:
results_orig

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_booster,param_importance_type,param_learning_rate,param_max_depth,param_max_leaves,param_n_estimators,...,split8_train_r2,split9_train_r2,split10_train_r2,split11_train_r2,split12_train_r2,split13_train_r2,split14_train_r2,split15_train_r2,mean_train_r2,std_train_r2
0,0.032561,0.001457,0.006750,0.000560,gbtree,weight,0.5,5,8,5,...,0.940117,0.940010,0.938449,0.940638,0.943239,0.942178,0.938509,0.938723,0.940707,0.002149
1,0.041937,0.003211,0.006688,0.000846,gbtree,weight,0.5,5,8,7,...,0.960758,0.958896,0.958548,0.960516,0.962637,0.961705,0.961151,0.959022,0.960583,0.001934
2,0.052874,0.003140,0.006813,0.000634,gbtree,weight,0.5,5,8,9,...,0.968322,0.963765,0.964407,0.967358,0.968628,0.968248,0.966524,0.964350,0.967188,0.002025
3,0.060875,0.002690,0.006125,0.000696,gbtree,weight,0.5,5,8,11,...,0.971262,0.969459,0.969855,0.970985,0.971937,0.972881,0.972361,0.968931,0.971334,0.001449
4,0.032749,0.001521,0.006313,0.000464,gbtree,weight,0.5,5,10,5,...,0.940117,0.940010,0.938449,0.940638,0.943239,0.942178,0.938509,0.938723,0.940707,0.002149
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
571,0.192188,0.006876,0.007313,0.000916,gbtree,gain,1,15,12,11,...,0.999999,0.999999,0.999998,0.999997,0.999999,0.999994,0.999999,0.999997,0.999998,0.000001
572,0.085062,0.003268,0.006687,0.000682,gbtree,gain,1,15,14,5,...,0.999128,0.998967,0.998390,0.997821,0.999580,0.998539,0.997493,0.997802,0.998584,0.000679
573,0.119250,0.005081,0.006875,0.000484,gbtree,gain,1,15,14,7,...,0.999924,0.999933,0.999872,0.999579,0.999946,0.999810,0.999629,0.999855,0.999816,0.000111
574,0.158000,0.010217,0.007000,0.000790,gbtree,gain,1,15,14,9,...,0.999990,0.999993,0.999982,0.999956,0.999988,0.999975,0.999979,0.999989,0.999984,0.000009


In [16]:
best_r22 = list(results_orig[['rank_test_r2','mean_train_r2', 'mean_test_r2']][results_orig['rank_test_r2']==1].index)
best_mse2 = list(results_orig[['rank_test_neg_mean_squared_error','mean_train_neg_mean_squared_error', 'mean_test_neg_mean_squared_error']][results_orig['rank_test_neg_mean_squared_error']==1].index)
best_rmse2 = list(results_orig[['rank_test_neg_root_mean_squared_error','mean_train_neg_root_mean_squared_error', 'mean_test_neg_root_mean_squared_error']][results_orig['rank_test_neg_root_mean_squared_error']==1].index)

best2 = list(set(best_r22) | set(best_mse2) | set(best_rmse2))

best_df2 = results_orig[['mean_fit_time', 'mean_test_neg_mean_squared_error', 'mean_test_neg_root_mean_squared_error', 'mean_test_r2', 'params',]].loc[best2].sort_values(by=['mean_fit_time'])

In [17]:
best_df2 

Unnamed: 0,mean_fit_time,mean_test_neg_mean_squared_error,mean_test_neg_root_mean_squared_error,mean_test_r2,params
299,0.059624,-741644300.0,-27110.407772,0.878302,"{'booster': 'gbtree', 'importance_type': 'gain..."
291,0.060624,-741644300.0,-27110.407772,0.878302,"{'booster': 'gbtree', 'importance_type': 'gain..."
7,0.06075,-741644300.0,-27110.407772,0.878302,"{'booster': 'gbtree', 'importance_type': 'weig..."
295,0.060875,-741644300.0,-27110.407772,0.878302,"{'booster': 'gbtree', 'importance_type': 'gain..."
3,0.060875,-741644300.0,-27110.407772,0.878302,"{'booster': 'gbtree', 'importance_type': 'weig..."
11,0.06175,-741644300.0,-27110.407772,0.878302,"{'booster': 'gbtree', 'importance_type': 'weig..."
15,0.062313,-741644300.0,-27110.407772,0.878302,"{'booster': 'gbtree', 'importance_type': 'weig..."
303,0.06275,-741644300.0,-27110.407772,0.878302,"{'booster': 'gbtree', 'importance_type': 'gain..."


In [18]:
pd.DataFrame(list(best_df2.params))

Unnamed: 0,booster,importance_type,learning_rate,max_depth,max_leaves,n_estimators
0,gbtree,gain,0.5,5,12,11
1,gbtree,gain,0.5,5,8,11
2,gbtree,weight,0.5,5,10,11
3,gbtree,gain,0.5,5,10,11
4,gbtree,weight,0.5,5,8,11
5,gbtree,weight,0.5,5,12,11
6,gbtree,weight,0.5,5,14,11
7,gbtree,gain,0.5,5,14,11


---

## XGBoost Evaluation - Leave One Out Metrics Computing

Now that we have some good knowledge about hyperparameters for the estimator, we can closely analyze how accurate the model is.

To get the most accurate results on the test, we are going to use the Leave One Out Approach for the instances. The model is going to be evaluated multiple times and the dataset will be changed in place. 
This is going to allow us to find the best and worst predictions

In [28]:
LOO_estimator_eval(dataset=df, target='Sale_Price', 
                   estimator=XGBRegressor(
                       booster='gbtree', importance_type='weight', learning_rate = 0.5,
                       max_depth=5, max_leaves=8, n_estimators = 10), 
                   params=None, ignore=None)

{'Scatter_Index': Sale_Price    13.550746
 dtype: float64,
 'RMSE': 24894.6694607419,
 'MSE': 619744567.5595955,
 'MSLE': 0.014360546363033506,
 'MAE': 16224.271640979576,
 'R2': 0.8987033747428737,
 'MAX_Err': 415166.1875}

In [29]:
LOO_estimator_eval(dataset=df, target='Sale_Price', 
                   estimator=XGBRegressor(
                       booster='gbtree', importance_type='gain', learning_rate = 0.5,
                       max_depth=5, max_leaves=12, n_estimators = 8), 
                   params=None, ignore=None)

{'Scatter_Index': Sale_Price    13.747743
 dtype: float64,
 'RMSE': 25256.58003303498,
 'MSE': 637894834.9651012,
 'MSLE': 0.014704208549910953,
 'MAE': 16478.32322445868,
 'R2': 0.8957367318194319,
 'MAX_Err': 408974.0625}

In [30]:
LOO_estimator_eval(dataset=df_orig, target='Sale_Price', 
                   estimator=XGBRegressor(
                       booster='gbtree', importance_type='weight', learning_rate = 0.5,
                       max_depth=5, max_leaves=8, n_estimators = 11), 
                   params=None, ignore=None)

{'Scatter_Index': Sale_Price    15.003242
 dtype: float64,
 'RMSE': 27125.269828808487,
 'MSE': 735780263.2856679,
 'MSLE': 0.02035329314297205,
 'MAE': 16817.206779276876,
 'R2': 0.8846686162129825,
 'MAX_Err': 331820.3125}

In [31]:
LOO_estimator_eval(dataset=df_orig, target='Sale_Price', 
                   estimator=XGBRegressor(
                       booster='gbtree', importance_type='weight', learning_rate = 0.5,
                       max_depth=5, max_leaves=12, n_estimators = 8), 
                   params=None, ignore=None)

{'Scatter_Index': Sale_Price    15.184065
 dtype: float64,
 'RMSE': 27452.191702856682,
 'MSE': 753622829.2903934,
 'MSLE': 0.02086520457268114,
 'MAE': 17147.833393771332,
 'R2': 0.8818718466741438,
 'MAX_Err': 334046.34375}

## Worst Predictions and Best Predictions

In [33]:
df['Prediction_Error'] = np.abs(df['Sale_Price']-df['Predicted'])
df_orig['Prediction_Error'] = np.abs(df_orig['Sale_Price']-df_orig['Predicted'])

### Most wrong on df

In [45]:
df.sort_values(by=['Prediction_Error', 'Sale_Price'], ascending=False).head(30)

Unnamed: 0,Age,Alley_Access,Baths,Bedroom_AbvGr,Bedroom_Liv_Area_Ratio,Bsmt,Bsmt_Eval,Bsmt_Unf_SF,Central_Air,Electrical_SBrkr,...,TotBath_LivArea_Ratio,TotRms_AbvGrd,Total_Bsmt_SF,Total_SF,Year_Sold,neighborhoods_1,neighborhoods_5,Sale_Price,Predicted,Prediction_Error
1985,0.0,0.0,4.5,3.0,1558.6666,1.0,13022.7,878.0,0.0,1.0,...,1336.0,11.0,3138.0,7814.0,2007.0,1.0,0.0,184750.0,593724.0625,408974.0625
381,1.0,0.0,3.0,2.0,1201.0,1.0,12840.101,788.0,0.0,1.0,...,1201.0,10.0,3094.0,5496.0,2009.0,0.0,1.0,555000.0,309243.84375,245756.15625
1532,1.0,0.0,2.5,3.0,881.0,1.0,7481.675,2153.0,0.0,1.0,...,1057.2,9.0,2153.0,4796.0,2007.0,0.0,1.0,380000.0,558245.3125,178245.3125
1611,1.0,0.0,2.0,2.0,897.5,1.0,7090.25,1795.0,0.0,1.0,...,897.5,7.0,1795.0,3590.0,2007.0,0.0,0.0,147000.0,316311.5625,169311.5625
41,1.0,0.0,3.5,2.0,1182.0,1.0,9669.5,142.0,0.0,1.0,...,945.6,11.0,2330.0,4694.0,2010.0,0.0,1.0,611657.0,447620.46875,164036.53125
1598,13.0,0.0,4.0,4.0,1079.0,1.0,10142.601,989.0,0.0,1.0,...,1233.1428,10.0,2444.0,6760.0,2007.0,0.0,1.0,755000.0,591762.0,163238.0
1477,0.0,0.0,3.0,2.0,709.5,1.0,4931.025,474.0,0.0,1.0,...,709.5,7.0,1419.0,2838.0,2007.0,0.0,1.0,392000.0,253797.84375,138202.15625
960,5.0,0.0,3.5,1.0,2470.0,1.0,10520.25,278.0,0.0,1.0,...,1646.6666,7.0,2535.0,5005.0,2008.0,0.0,1.0,615000.0,485213.21875,129786.78125
1478,1.0,0.0,2.5,1.0,2234.0,1.0,9934.5,662.0,0.0,1.0,...,1489.3334,7.0,2220.0,4454.0,2007.0,0.0,1.0,441929.0,314628.875,127300.125
1602,15.0,0.0,4.0,4.0,807.0,1.0,9839.999,1969.0,0.0,1.0,...,1076.0,10.0,3200.0,6428.0,2007.0,0.0,1.0,430000.0,319086.40625,110913.59375


### Best on df

In [44]:
df.sort_values(by=['Prediction_Error', 'Sale_Price']).head(10)

Unnamed: 0,Age,Alley_Access,Baths,Bedroom_AbvGr,Bedroom_Liv_Area_Ratio,Bsmt,Bsmt_Eval,Bsmt_Unf_SF,Central_Air,Electrical_SBrkr,...,TotBath_LivArea_Ratio,TotRms_AbvGrd,Total_Bsmt_SF,Total_SF,Year_Sold,neighborhoods_1,neighborhoods_5,Sale_Price,Predicted,Prediction_Error
2567,58.0,0.0,1.0,3.0,312.0,1.0,1450.8,624.0,0.0,1.0,...,936.0,5.0,624.0,1560.0,2006.0,1.0,0.0,97900.0,97882.5625,17.4375
909,16.0,0.0,2.0,2.0,640.0,1.0,4192.0,1280.0,0.0,1.0,...,640.0,5.0,1280.0,2560.0,2008.0,0.0,1.0,180000.0,180024.234375,24.234375
2033,68.0,0.0,1.0,3.0,374.33334,1.0,1903.2,732.0,0.0,1.0,...,1123.0,4.0,732.0,1855.0,2007.0,1.0,0.0,100000.0,100035.789062,35.789062
1640,7.0,0.0,2.0,3.0,449.66666,1.0,4148.175,1349.0,0.0,1.0,...,674.5,6.0,1349.0,2698.0,2007.0,1.0,0.0,179000.0,178946.546875,53.453125
2507,8.0,0.0,3.0,3.0,475.0,1.0,4666.875,579.0,0.0,1.0,...,712.5,5.0,1425.0,2850.0,2006.0,0.0,0.0,193000.0,192939.484375,60.515625
796,13.0,0.0,2.5,2.0,858.0,1.0,2705.9998,880.0,0.0,1.0,...,686.4,7.0,880.0,2596.0,2009.0,0.0,0.0,191000.0,190927.015625,72.984375
1240,60.0,0.0,2.0,4.0,317.75,1.0,1871.9999,720.0,0.0,1.0,...,635.5,7.0,720.0,1991.0,2008.0,1.0,0.0,135000.0,135077.53125,77.53125
226,2.0,0.0,2.0,3.0,564.6667,1.0,5886.65,1694.0,0.0,1.0,...,847.0,7.0,1694.0,3388.0,2010.0,0.0,0.0,245350.0,245266.78125,83.21875
452,11.0,0.0,3.5,4.0,653.0,1.0,3766.8748,371.0,0.0,1.0,...,1044.8,8.0,1225.0,3837.0,2009.0,0.0,1.0,336000.0,336095.25,95.25
2207,3.0,0.0,2.0,2.0,728.0,1.0,4678.275,1273.0,0.0,1.0,...,728.0,7.0,1273.0,2729.0,2006.0,0.0,0.0,215000.0,215098.578125,98.578125


### Most wrong on df_orig

In [46]:
df_orig.sort_values(by=['Prediction_Error', 'Sale_Price'], ascending=False).head(20)

Unnamed: 0,Age,Alley_Access,Baths,Bedroom_AbvGr,Bedroom_Liv_Area_Ratio,Bsmt,Bsmt_Eval,Bsmt_Unf_SF,Central_Air,Electrical_SBrkr,...,TotBath_LivArea_Ratio,TotRms_AbvGrd,Total_Bsmt_SF,Total_SF,Year_Sold,neighborhoods_1,neighborhoods_5,Sale_Price,Predicted,Prediction_Error
423,1.0,0.0,3.0,2.0,1201.0,1.0,12840.101,788.0,0.0,1.0,...,1201.0,10.0,3094.0,5496.0,2009.0,0.0,1.0,555000.0,220953.65625,334046.34375
1498,0.0,0.0,4.5,3.0,1880.6666,1.0,25356.5,466.0,0.0,1.0,...,2256.8,12.0,6110.0,11752.0,2008.0,1.0,0.0,160000.0,409873.0,249873.0
2666,114.0,0.0,2.5,4.0,902.0,1.0,3099.5999,1107.0,0.0,1.0,...,1443.2,12.0,1107.0,4715.0,2006.0,1.0,0.0,475000.0,227868.125,247131.875
1182,31.0,0.0,3.0,3.0,981.3333,1.0,3056.5498,584.0,0.0,1.0,...,981.3333,9.0,994.0,3938.0,2008.0,0.0,0.0,150000.0,375495.125,225495.125
1637,1.0,0.0,3.5,4.0,584.5,1.0,10507.0,1559.0,0.0,1.0,...,935.2,8.0,2660.0,4998.0,2007.0,0.0,1.0,591587.0,385116.46875,206470.53125
2181,0.0,0.0,4.5,3.0,1558.6666,1.0,13022.7,878.0,0.0,1.0,...,1336.0,11.0,3138.0,7814.0,2007.0,1.0,0.0,184750.0,389454.3125,204704.3125
2737,71.0,0.0,3.5,5.0,734.4,1.0,4773.5996,1411.0,0.0,1.0,...,1049.1428,7.0,1836.0,5508.0,2006.0,1.0,0.0,415000.0,231412.203125,183587.796875
2570,88.0,0.0,3.5,4.0,778.0,1.0,3535.9998,140.0,0.0,1.0,...,1556.0,8.0,1360.0,4472.0,2006.0,1.0,0.0,235000.0,416931.8125,181931.8125
1063,5.0,0.0,3.5,1.0,2470.0,1.0,10520.25,278.0,0.0,1.0,...,1646.6666,7.0,2535.0,5005.0,2008.0,0.0,1.0,615000.0,445403.5625,169596.4375
433,1.0,0.0,3.5,4.0,705.5,1.0,7196.1,1734.0,0.0,1.0,...,806.2857,12.0,1734.0,4556.0,2009.0,0.0,1.0,582933.0,417415.09375,165517.90625


### Best on df

In [47]:
df_orig.sort_values(by=['Prediction_Error', 'Sale_Price']).head(20)

Unnamed: 0,Age,Alley_Access,Baths,Bedroom_AbvGr,Bedroom_Liv_Area_Ratio,Bsmt,Bsmt_Eval,Bsmt_Unf_SF,Central_Air,Electrical_SBrkr,...,TotBath_LivArea_Ratio,TotRms_AbvGrd,Total_Bsmt_SF,Total_SF,Year_Sold,neighborhoods_1,neighborhoods_5,Sale_Price,Predicted,Prediction_Error
2464,1.0,0.0,2.5,3.0,517.6667,1.0,2324.7,756.0,0.0,1.0,...,621.2,6.0,756.0,2309.0,2006.0,0.0,0.0,186500.0,186499.171875,0.828125
803,16.0,0.0,3.5,3.0,591.6667,1.0,2275.4998,227.0,0.0,1.0,...,710.0,7.0,740.0,2515.0,2009.0,1.0,0.0,213000.0,212992.625,7.375
2050,43.0,0.0,2.0,3.0,335.0,1.0,2613.0,348.0,0.0,1.0,...,1005.0,5.0,1005.0,2010.0,2007.0,1.0,0.0,115400.0,115421.4375,21.4375
2018,78.0,0.0,1.0,2.0,427.0,1.0,2163.2,832.0,0.0,0.0,...,854.0,5.0,832.0,1686.0,2007.0,1.0,0.0,132000.0,132021.625,21.625
327,37.0,0.0,2.5,2.0,472.5,1.0,3472.875,30.0,0.0,1.0,...,945.0,5.0,945.0,1890.0,2010.0,0.0,0.0,119500.0,119532.078125,32.078125
1497,49.0,0.0,3.5,5.0,764.0,0.0,0.0,0.0,0.0,1.0,...,1091.4286,11.0,0.0,3820.0,2008.0,1.0,0.0,284700.0,284664.875,35.125
2057,36.0,0.0,2.0,2.0,434.0,1.0,2668.7998,20.0,0.0,1.0,...,868.0,6.0,768.0,1636.0,2007.0,1.0,0.0,119900.0,119847.09375,52.90625
2072,23.0,0.0,2.0,3.0,507.33334,1.0,3659.2498,0.0,0.0,1.0,...,761.0,7.0,1190.0,2712.0,2007.0,0.0,0.0,182000.0,182053.0625,53.0625
2720,49.0,0.0,3.0,3.0,601.0,1.0,2948.4,284.0,0.0,1.0,...,901.5,8.0,1134.0,2937.0,2006.0,1.0,0.0,155000.0,154946.109375,53.890625
1623,11.0,0.0,2.5,3.0,517.6667,1.0,2124.825,277.0,0.0,1.0,...,621.2,6.0,691.0,2244.0,2007.0,0.0,0.0,178750.0,178693.5625,56.4375



## Investigating Instances

---

### Final Comment