# House Prices: Advanced Regression Techniques - Part III

In this notebook XGBRegressor(), ExtraTreesRegressor(), RandomForestRegressor(), GradientBoostingRegressor(), DecisionTreeRegressor(), AdaBoostRegressor() and LGBMRegressor() will be implemented.

In [3]:
from load_modules_files_functions_clean import *

No. features: 79
No. numerical features: 33
No. ordinal features: 21
No. (possible) categorical features: 25 

num_cols: ['LotArea', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces', 'GarageYrBlt', 'GarageCars', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal', 'MoSold', 'YrSold', 'LotFrontage'] 

ord_cols: ['OverallQual', 'OverallCond', 'ExterQual', 'ExterCond', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'HeatingQC', 'KitchenQual', 'GarageQual', 'GarageCond', 'Utilities', 'Functional', 'GarageFinish', 'PavedDrive', 'Alley', 'Fence', 'FireplaceQu', 'PoolQC'] 

cat_cols: ['MSSubClass', 'MSZoning', 'Street', 'LotShape', 'LandContour', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condit

Define functions introduced in this notebook.

In [4]:
def print_cv_val_score(my_s, print_best_est = True): # This function does not work properly when it is imported from load_modules_files_functions_clean; X[cols] is not updated correctly.   
    best_est = my_s.best_estimator_
    best_est.fit(X_train, y_train)
    y_pred = best_est.predict(X_val)
    val_score = rmsle(y_val, y_pred)
    best_CV_score = my_s.best_score_        
    print('Best CV score:', round(-best_CV_score, 5))
    print('Validation score:', round(val_score, 5))
    if print_best_est:
        print(best_est)
        
def get_sub_csv(my_s, cols, name_csv): # There is a similar problem for this function as well.
    print(name_csv)
    best_est = my_s.best_estimator_    
    best_est.fit(X[cols], y)
    X_test = test[cols]
    y_pred = best_est.predict(X_test)    
    test_submission = pd.DataFrame({'Id':test['Id'], 'SalePrice':y_pred})
    test_submission.to_csv(name_csv, index=False)

def load_run_save_GSCV(key, param_grid, save_s = True):
    global results
    filename = key + '.joblib'
    if os.path.isfile(filename):
        my_s = joblib.load(filename)
    else:  
        my_s = GridSearchCV(ttr, param_grid = param_grid, cv = 5, scoring = rmsle_scorer, n_jobs = -1, verbose = 10, error_score = 'raise')
        my_s = my_s.fit(X_train, y_train)
        if save_s:
            joblib.dump(my_s, filename)
    best_est = my_s.best_estimator_
    best_est.fit(X_train, y_train)   
    y_pred = best_est.predict(X_val)
    val_score = rmsle(y_val, y_pred)
    best_CV_score = my_s.best_score_    
    results_model = pd.Series({'Best CV score': -best_CV_score, 'Val score':val_score})
    results_model.name = key
    results = results.append(results_model)
    return my_s

def min_imp_filter(cols, feat_imps, min_imp):
    feats_keep = list(feat_imps[feat_imps > min_imp].index)
    cols_keep = []
    for col in cols:
        if col in feats_keep:
            cols_keep.append(col)
    return cols_keep

## Models

Define the dataframe where the results are stored.

In [None]:
results = pd.DataFrame({'Best CV score':[], 'Val score':[]}) 

Build the general pipeline for the models.

In [None]:
imputer = ColumnTransformer([
    ('imputer_num_cols', 'passthrough', slice(0, len(num_cols))),
    ('imputer_ord_cols', 'passthrough', slice(len(num_cols), len(num_cols + ord_cols))),
    ('imputer_cat_cols', 'passthrough' , slice(len(num_cols + ord_cols), len(num_cols + ord_cols + cat_cols))) 
])

preprocessor = ColumnTransformer([
            ('scaler_num_cols', 'passthrough', slice(0, len(num_cols))),
            ('scaler_ord_cols', 'passthrough', slice(len(num_cols), len(num_cols + ord_cols)))
            #('category_encoder_cat_cols', None, slice(len(num_cols + ord_cols), len(num_cols + ord_cols + cat_cols)))
], remainder = 'drop') # Temporary solution so that the model can be built without categorical features.

steps = [
    ('imputer', imputer),
    ('preprocessor', preprocessor),         
    ('model', None)
]

pipeline = Pipeline(steps)

ttr = TransformedTargetRegressor(regressor = pipeline, func = np.log1p, inverse_func = np.expm1)

Use the default XGBRegressor() and define a special param_grid where no imputation is possible.

In [None]:
xgbr = XGBRegressor(objective = 'reg:squarederror', random_state = 1)
key = 'xgbr_default'

param_grid = [
    {
    'regressor__imputer__imputer_num_cols': ['passthrough'],
    'regressor__imputer__imputer_ord_cols': ['passthrough'],
    'regressor__imputer__imputer_cat_cols': ['passthrough'],
    'regressor__model': [xgbr],       
    },    
    {
    'regressor__imputer__imputer_num_cols': [SimpleImputer(fill_value = -999)],
    'regressor__imputer__imputer_num_cols__strategy': ['mean', 'median', 'most_frequent', 'constant'],
    'regressor__imputer__imputer_ord_cols': [SimpleImputer(fill_value = -999)],
    'regressor__imputer__imputer_ord_cols__strategy': ['mean', 'median', 'most_frequent', 'constant'],
    'regressor__imputer__imputer_cat_cols': [SimpleImputer(strategy = 'most_frequent')],
    'regressor__model': [xgbr],    
    }
]

#my_s = load_run_save_GSCV(key, param_grid)

Define the param_grid used for all other models.

In [None]:
param_grid = {
    'regressor__imputer__imputer_num_cols': [SimpleImputer(fill_value = -999)],
    'regressor__imputer__imputer_num_cols__strategy': ['mean', 'median', 'most_frequent', 'constant'],
    'regressor__imputer__imputer_ord_cols': [SimpleImputer(fill_value = -999)],
    'regressor__imputer__imputer_ord_cols__strategy': ['mean', 'median', 'most_frequent', 'constant'],
    'regressor__imputer__imputer_cat_cols': [SimpleImputer(strategy = 'most_frequent')],
    'regressor__model': [None],
}

Use the default ExtraTreesRegressor().

In [None]:
etr = ExtraTreesRegressor(random_state = 1)
key = 'etr_default'
param_grid['regressor__model'] = [etr]
#my_s = load_run_save_GSCV(key, param_grid)

Use the default RandomForestRegressor().

In [None]:
rfr = RandomForestRegressor(random_state = 1)
key = 'rfr_default'
param_grid['regressor__model'] = [rfr]
#my_s = load_run_save_GSCV(key, param_grid)

Use the default GradientBoostingRegressor().

In [None]:
gbr = GradientBoostingRegressor(random_state = 1)
key = 'gbr_default'
param_grid['regressor__model'] = [gbr]
my_s = load_run_save_GSCV(key, param_grid)

Use the default DecisionTreeRegressor().

In [None]:
dtr = DecisionTreeRegressor(random_state = 1)
key = 'dtr_default'
param_grid['regressor__model'] = [dtr]
#my_s = load_run_save_GSCV(key, param_grid)

Use the default AdaBoostRegressor().

In [None]:
abr = AdaBoostRegressor(random_state = 1)
key = 'abr_default'
param_grid['regressor__model'] = [abr]
#my_s = load_run_save_GSCV(key, param_grid)

Use the default LGBMRegressor()

In [None]:
lgbmr = LGBMRegressor(random_state = 1)
key = 'lgbmr_default'
param_grid['regressor__model'] = [lgbmr]
my_s = load_run_save_GSCV(key, param_grid)

In [None]:
results

The GradientBoostingRegressor() has both the best CV and val score. 
Let's submit this model and explore it further.

In [None]:
key = 'gbr_default'
my_s = load_run_save_GSCV(key, param_grid)
print_cv_val_score(my_s, print_best_est = False)

In [None]:
name = 'gbr_sub_num_ord.csv'
#get_sub_csv(my_s, num_cols + ord_cols + cat_cols, name)

Plot the feature importances.

In [None]:
model = my_s.best_estimator_.regressor_.named_steps.model
feat_imps = pd.Series(model.feature_importances_, index=X[num_cols + ord_cols].columns)
plt.figure(figsize = (17.5, 5))
feat_imps.sort_values(ascending = False).plot(kind='bar', rot = 60)

Apply a minimum importance filter.

In [None]:
#min_imp = 0
#min_imp = 0.01
min_imp = 0.001 # The best one found so far.
#min_imp = 0.0001
#min_imp = 10**(-5)

min_imp = -1 # Keep all features.

print('No. features before min importance filter:', len(num_cols + ord_cols))
num_cols = min_imp_filter(num_cols, feat_imps, min_imp)
ord_cols = min_imp_filter(ord_cols, feat_imps, min_imp)
print('No. features after min importance filter', len(num_cols + ord_cols))

We must get new train and validation sets since the order in num_cols and ord_cols has changed. 

In [None]:
X_train, X_val, y_train, y_val = get_train_val_sets(X, y, num_cols + ord_cols + cat_cols) 

Rebuild the general pipeline with the new num_cols and ord_cols.

In [None]:
imputer = ColumnTransformer([
    ('imputer_num_cols', 'passthrough', slice(0, len(num_cols))),
    ('imputer_ord_cols', 'passthrough', slice(len(num_cols), len(num_cols + ord_cols))),
    ('imputer_cat_cols', 'passthrough' , slice(len(num_cols + ord_cols), len(num_cols + ord_cols + cat_cols))) 
])

preprocessor = ColumnTransformer([
    ('scaler_num_cols', 'passthrough', slice(0, len(num_cols))),
    ('scaler_ord_cols', 'passthrough', slice(len(num_cols), len(num_cols + ord_cols)))
    #('category_encoder_cat_cols', None, slice(len(num_cols + ord_cols), len(num_cols + ord_cols + cat_cols)))
], remainder = 'drop') # Temporary solution so that the model can be built without categorical features.

steps = [
    ('imputer', imputer),
    ('preprocessor', preprocessor),         
    ('model', None)
]

pipeline = Pipeline(steps)

ttr = TransformedTargetRegressor(regressor = pipeline, func = np.log1p, inverse_func = np.expm1)

Fit the same model with fewer features.

In [None]:
gbr = GradientBoostingRegressor(random_state = 1)
key = 'gbr_default_min_imp' + str(min_imp)
param_grid['regressor__model'] = [gbr]
my_s = load_run_save_GSCV(key, param_grid)

In [None]:
print_cv_val_score(my_s, print_best_est = False)

Submit if the model gets a better CV or Val score.

In [None]:
name = 'gbr_sub_num_ord_min_imp' + str(min_imp) + '.csv'
#get_sub_csv(my_s, num_cols + ord_cols + cat_cols, name)

Plot the feature importances again

In [None]:
model = my_s.best_estimator_.regressor_.named_steps.model
feat_imps = pd.Series(model.feature_importances_, index=X[num_cols + ord_cols].columns)
plt.figure(figsize = (17.5, 5))
feat_imps.sort_values(ascending = False).plot(kind='bar', rot = 60)

**Comment:** The order of the importances of different features has changed.

Consider potential categorical features after EDA.

In [None]:
#cat_cols = ['MSSubClass']
#cat_cols = ['LandContour']
#cat_cols = ['MSSubClass', 'LandContour']
#cat_cols = ['MSSubClass', 'LandContour', 'MSZoning']
#cat_cols = ['LandContour', 'MSZoning']
#cat_cols = ['LandContour', 'MSZoning', 'LotShape', 'LotConfig']
#cat_cols = ['MSSubClass', 'LandContour', 'MSZoning', 'LotShape', 'LotConfig']

**Comment:** This is a quite inefficient way of working (adding each categorical feature one at a time).
We will instead add all possible categorical features, perform OHE and see what the results are.


We must get new train and validation sets that take into account cat_cols. 

In [None]:
X_train, X_val, y_train, y_val = get_train_val_sets(X, y, num_cols + ord_cols + cat_cols)

Build pipeline with OneHotEndcoder() for the categorical features.

In [None]:
imputer = ColumnTransformer([
    ('imputer_num_cols', 'passthrough', slice(0, len(num_cols))),
    ('imputer_ord_cols', 'passthrough', slice(len(num_cols), len(num_cols + ord_cols))),
    ('imputer_cat_cols', 'passthrough' , slice(len(num_cols + ord_cols), len(num_cols + ord_cols + cat_cols))) 
])

preprocessor = ColumnTransformer([
            ('scaler_num_cols', 'passthrough', slice(0, len(num_cols))),
            ('scaler_ord_cols', 'passthrough', slice(len(num_cols), len(num_cols + ord_cols))),
            ('category_encoder_cat_cols', OneHotEncoder(handle_unknown = 'ignore'), slice(len(num_cols + ord_cols), len(num_cols + ord_cols + cat_cols)))
]) 

steps = [
    ('imputer', imputer),
    ('preprocessor', preprocessor),         
    ('model', None)
]

pipeline = Pipeline(steps)

ttr = TransformedTargetRegressor(regressor = pipeline, func = np.log1p, inverse_func = np.expm1)

Define a param_grid where different imputation strategies for the categorical features are considered.

In [None]:
param_grid = {
    'regressor__imputer__imputer_num_cols': [SimpleImputer(fill_value = -999)],
    'regressor__imputer__imputer_num_cols__strategy': ['mean', 'median', 'most_frequent', 'constant'],
    'regressor__imputer__imputer_ord_cols': [SimpleImputer(fill_value = -999)],
    'regressor__imputer__imputer_ord_cols__strategy': ['mean', 'median', 'most_frequent', 'constant'],
    'regressor__imputer__imputer_cat_cols': [SimpleImputer(fill_value = 'MISS')],
    'regressor__imputer__imputer_cat_cols__strategy': ['most_frequent', 'constant'],
    'regressor__model': [None],
}

Fit the model with categorical features.

In [None]:
gbr = GradientBoostingRegressor(random_state = 1)
key = 'gbr_default_min_imp' + str(min_imp) + '_cat'
param_grid['regressor__model'] = [gbr]
my_s = load_run_save_GSCV(key, param_grid, save_s = True)

In [None]:
print_cv_val_score(my_s, print_best_est = False)

Use the OHE.

In [None]:
ohe = my_s.best_estimator_.regressor_.named_steps.preprocessor.named_transformers_.category_encoder_cat_cols

Plot feature importances (including OHE features).

In [None]:
model = my_s.best_estimator_.regressor_.named_steps.model
feat_imps = pd.Series(model.feature_importances_, index = list(X[num_cols + ord_cols].columns) + list(ohe.get_feature_names(cat_cols)))
plt.figure(figsize = (105, 10))
feat_imps.sort_values(ascending = False).plot(kind='bar', rot = 60)

In [None]:
print('No. features after OHE:', len(model.feature_importances_))
print('No. features after OHE larger than 0:', len(model.feature_importances_[model.feature_importances_ > 0]))

**Comment:** Feature selection could (and probably should) be performed.
It is worth noting that some levels of the OHE features are more important than others

In [None]:
name = 'gbr_sub_num_ord_cat_min_imp' + str(min_imp) + '.csv'
#get_sub_csv(my_s, num_cols + ord_cols + cat_cols, name)

Extract the best imputation strategies found in the previous param_grid.

In [None]:
my_imputer = my_s.best_estimator_.regressor_.named_steps.imputer.named_transformers_

Define the param_grid used for HPO of GradientBoostingRegressor()

In [None]:
param_grid = {
    'regressor__imputer__imputer_num_cols': [SimpleImputer(fill_value = -999)],
    'regressor__imputer__imputer_num_cols__strategy': [my_imputer.imputer_num_cols.strategy],
    'regressor__imputer__imputer_ord_cols': [SimpleImputer(fill_value = -999)],
    'regressor__imputer__imputer_ord_cols__strategy': [my_imputer.imputer_ord_cols.strategy],
    'regressor__imputer__imputer_cat_cols': [SimpleImputer(fill_value = -999)],
    'regressor__imputer__imputer_cat_cols__strategy': [my_imputer.imputer_cat_cols.strategy],
    'regressor__model': [None],    
    'regressor__model__loss': ['ls', 'lad', 'huber', 'quantile'], 
    'regressor__model__learning_rate': [0.01, 0.02, 0.05, 0.1, 1],
    'regressor__model__n_estimators': [100, 200, 300, 400, 500],
    'regressor__model__max_depth': [1, 2, 3, 4, 5, 6],
    'regressor__model__max_features': [None, 'sqrt'],   
    'regressor__model__min_samples_leaf': [1, 3, 5],
    'regressor__model__min_samples_split': [2, 4, 8],
    'regressor__model__ccp_alpha': [0, 0.1, 1, 10]
}

# param_grid obtained after HPO on personal computer.
param_grid = {
    'regressor__imputer__imputer_num_cols': [SimpleImputer(fill_value = -999)],
    'regressor__imputer__imputer_num_cols__strategy': [my_imputer.imputer_num_cols.strategy],
    'regressor__imputer__imputer_ord_cols': [SimpleImputer(fill_value = -999)],
    'regressor__imputer__imputer_ord_cols__strategy': [my_imputer.imputer_ord_cols.strategy],
    'regressor__imputer__imputer_cat_cols': [SimpleImputer(fill_value = -999)],
    'regressor__imputer__imputer_cat_cols__strategy': [my_imputer.imputer_cat_cols.strategy],
    'regressor__model': [None],    
    'regressor__model__loss': ['huber'],
    'regressor__model__learning_rate': [0.05],
    'regressor__model__n_estimators': [500],
    'regressor__model__max_depth': [3],
    'regressor__model__max_features': ['sqrt'],
    'regressor__model__min_samples_leaf': [1],
    'regressor__model__min_samples_split': [8],
    'regressor__model__ccp_alpha': [0],
}

In [None]:
gbr = GradientBoostingRegressor(random_state = 1)
key = 'gbr_default_min_imp' + str(min_imp) + '_cat_HPO'
param_grid['regressor__model'] = [gbr]
my_s = load_run_save_GSCV(key, param_grid, save_s = False)

In [None]:
print_cv_val_score(my_s, print_best_est = False) 

In [None]:
name = 'gbr_sub_num_ord_cat_min_imp' + str(min_imp) + '_HPO_GBR.csv'
get_sub_csv(my_s, num_cols + ord_cols + cat_cols, name)

Plot feature importances of the model where HPO of the GBR has been performed.

In [None]:
ohe = my_s.best_estimator_.regressor_.named_steps.preprocessor.named_transformers_.category_encoder_cat_cols
model = my_s.best_estimator_.regressor_.named_steps.model
feat_imps = pd.Series(model.feature_importances_, index = list(X[num_cols + ord_cols].columns) + list(ohe.get_feature_names(cat_cols)))
plt.figure(figsize = (105, 10))
feat_imps.sort_values(ascending = False).plot(kind='bar', rot = 60)

In [None]:
print('No. features after OHE:', len(model.feature_importances_))
print('No. features after OHE larger than 0:', len(model.feature_importances_[model.feature_importances_ > 0]))

**Comment:** After the HPO we can see that two most important features remain the same, but that their relative importance has decreased; the model is now much better at using other features. 
It is also important to notice that the no. feature with importance > 0 has increased massively; from 112 out of 238 features, to 207 out of 240. 
It is, however, strange that len(feature_importances) = 240  now compared to 238 previously...
The order of the importances is also quite different now after the two most important features.

Try out some feature engineering ideas.

In [None]:
#X['DiffYearRemodAddBuilt'] = X['YearRemodAdd'] - X['YearBuilt']
X['DivGrLivAreaFullBath'] = X['GrLivArea'] / X['FullBath']
X['DivGrLivAreaBedroomAbvGr'] = X['GrLivArea'] / X['BedroomAbvGr']

#num_cols.append('DiffYearRemodAddBuilt')

#num_cols.remove('YearRemodAdd')

#num_cols.remove('YearBuilt')

#num_cols.remove('YearRemodAdd')
#num_cols.remove('YearBuilt')

In [None]:
#plot_cols = ['YearBuilt', 'YearRemodAdd', 'DiffYearRemodAddBuilt']
plot_cols = ['GrLivArea', 'FullBath', 'BedroomAbvGr', 'DivGrLivAreaFullBath', 'DivGrLivAreaBedroomAbvGr']
fig, axes = plt.subplots(1, len(plot_cols), figsize = (50, 10))
for i, col in enumerate(plot_cols):
    #sns.scatterplot(data = pd.concat([X[plot_cols], y], axis = 1), x = col, y = 'SalePrice', ax = axes.flat[i], alpha = 0.2)
    sns.regplot(data = pd.concat([X[plot_cols], y], axis = 1), x = col, y = 'SalePrice', ax = axes.flat[i],  line_kws = {"color": "red"}) 

We must get new train and validation sets that take into account the updated cols.

In [None]:
X_train, X_val, y_train, y_val = get_train_val_sets(X, y, num_cols + ord_cols + cat_cols)

Rebuild the pipeline. 

In [None]:
imputer = ColumnTransformer([
    ('imputer_num_cols', 'passthrough', slice(0, len(num_cols))),
    ('imputer_ord_cols', 'passthrough', slice(len(num_cols), len(num_cols + ord_cols))),
    ('imputer_cat_cols', 'passthrough' , slice(len(num_cols + ord_cols), len(num_cols + ord_cols + cat_cols))) 
])

preprocessor = ColumnTransformer([
            ('scaler_num_cols', 'passthrough', slice(0, len(num_cols))),
            ('scaler_ord_cols', 'passthrough', slice(len(num_cols), len(num_cols + ord_cols))),
            ('category_encoder_cat_cols', OneHotEncoder(handle_unknown = 'ignore'), slice(len(num_cols + ord_cols), len(num_cols + ord_cols + cat_cols)))
]) 

steps = [
    ('imputer', imputer),
    ('preprocessor', preprocessor),         
    ('model', None)
]

pipeline = Pipeline(steps)

ttr = TransformedTargetRegressor(regressor = pipeline, func = np.log1p, inverse_func = np.expm1)

In [None]:
# param_grid obtained after HPO on personal computer.
param_grid = {
    'regressor__imputer__imputer_num_cols': [SimpleImputer(fill_value = -999)],
    'regressor__imputer__imputer_num_cols__strategy': [my_imputer.imputer_num_cols.strategy],
    'regressor__imputer__imputer_ord_cols': [SimpleImputer(fill_value = -999)],
    'regressor__imputer__imputer_ord_cols__strategy': [my_imputer.imputer_ord_cols.strategy],
    'regressor__imputer__imputer_cat_cols': [SimpleImputer(fill_value = -999)],
    'regressor__imputer__imputer_cat_cols__strategy': [my_imputer.imputer_cat_cols.strategy],
    'regressor__model': [None],    
    'regressor__model__loss': ['huber'],
    'regressor__model__learning_rate': [0.05],
    'regressor__model__n_estimators': [500],
    'regressor__model__max_depth': [3],
    'regressor__model__max_features': ['sqrt'],
    'regressor__model__min_samples_leaf': [1],
    'regressor__model__min_samples_split': [8],
    'regressor__model__ccp_alpha': [0],
}

In [None]:
gbr = GradientBoostingRegressor(random_state = 1)
key = 'gbr_default_min_imp' + str(min_imp) + '_cat_HPO_FE'
param_grid['regressor__model'] = [gbr]
my_s = load_run_save_GSCV(key, param_grid, save_s = False)

In [None]:
print_cv_val_score(my_s, print_best_est = False)

## Log

Default GradientBoostingRegressor() (54 features) ---> CV: 0.13070, Val: 0.13013, Test: 0.13677 (worse than the best XGBRegressor() but still really good)
**Note:** We will use this model as the source of feature importances that will be used for our filter.

min_imp = 0 (51 features) ---> CV: 0.13122, Val: 0.13039 (slightly worse)

min_imp = 0.01 (14 features) ---> CV: 0.16076, Val: 0.17116 (much worse)

min_imp = 0.001 (28 features) ---> CV: 0.15107, Val: 0.15470 (worse)

min_imp = 0.0001 (45 features) ---> CV: 0.13063, Val: 0.12999, Test: 0.13692 (better CV and Val score, but slightly worse Test score)

min_imp = 10^(-5) (50 features) ---> CV: 0.13110, Val: 0.13530 (worse)

**UPDATE 2020-05-27**
Found a major error in the code wrt. how num_cols and ord_cols are used to extract data from X in the pipeline. Must redo all the min_imp results

min_imp = 0 (51 features) ---> CV: 0.13073, Val: 0.13092, Test: 0.13637 (better)

min_imp = 0.01 (14 features) ---> CV: 0.13305, Val: 0.13488 (worse)

min_imp = 0.001 (29 features) ---> CV: 0.13075, Val: 0.13140, Test: 0.13526 (better)

min_imp = 0.0001 (45 features) ---> CV: 0.13092, Val: 0.13089, Test: 0.13692 (worse)

min_imp = 10^(-5) (50 features) ---> CV: 0.13090, Val: 0.13425 (worse)

**Note:** Could try 0.0001 < min_imp < 0.001

**We will keep min_imp = 0.001.**

**We will now consider potential categorical features.**

MSSubClass (only) (15 levels) ---> CV: 0.13035, Val: 0.13169

LandContour (only) (4 levels) ---> CV: 0.13042, Val: 0.13171

MSSubClass + LandContour ---> CV: 0.13123, Val: 0.13058, Test: 0.13424 (better, even than xgboost)

MSSubClass + LandContour + MSZoning ---> CV: 0.13058, Val: 0.12901, Test: 0.13831 (worse)

LandContour + MSZoning ---> CV: 0.12852, Val: 0.12516, Test: 0.13730 (worse)
**Note:** It seems as if MSSubClass does contain useful information.

LandContour + MSZoning + LotShape---> CV: 0.12973, Val: 0.12885, Test: 0.13507 (worse)

LandContour + MSZoning + LotShape + LotConfig---> CV: 0.12954, Val: 0.12718

MSSubClass + LandContour + MSZoning + LotShape + LotConfig---> CV: 0.12988, Val: 0.12566, Test: 0.13651

All cat_cols ---> CV: 0.12848, Val: 0.12391, Test: 0.13358 (better)

All num_cols + ord_cols + cat_cols (min_imp = -1) ---> CV: 0.12817, Val: 0.12800, Test: 0.13248 (better)

**HPO of the GradientBoostingRegressor()**

Keep the previously used param_grid, where we try different imputations techniques, and add hyperparameters of the GradientBoostingRegressor() ---> GridSearchCV() will take too much time to finish (45 days).

Extract the best hyperparameters found in the previous param_grid, and add hyperparameters  of the GradientBoostingRegressor() ---> CV: 0.12016, Val: 0.11978, Test: 0.12588 (better)
**Note:** 
* GridSearchCV() would take too much time to finish on Kaggle (29h), but it was possible to run on my personal computer.
* n_estimators = 500 was selected, which was the max. We should consider larger values.
* max_features = 'sqrt' was selected, probably due to the high dimensional input matrix.
* ccp_alpha = 0 was selected, which means that no regularization was applied. Might consider adding regularization manually and see if the Val score is improved.

**Feature Engineering** (after HPO of the GradientBoostingRegressor())

Add DiffYearRemodAddBuilt ---> CV: 0.12074, Val: 0.12628 (worse)

Add DiffYearRemodAddBuilt, remove YearRemodAdd ---> CV: 0.12268, Val: 0.12000

Add DiffYearRemodAddBuilt, remove YearBuilt ---> CV: 0.12198, Val: 0.12288

Add DiffYearRemodAddBuilt, remove YearRemodAdd and YearBuilt ---> CV: 0.12437, Val: 0.12753

**Comment:** It probably does not make sense to do FE after the HPO since we have such a highly optimized model. Better to work with one of the previous models that are fairly "raw". ---> Create new notebook for this.

## Various notes

For the XGBRegressor(), when HPO of the pipeline was performed, the best model found did indeed perform better when some kind of missing value imputation is performed.

The models, XGBRegressor() and GradientBoostingRegressor() in particular, seem to be able to handle many features as input and select the important ones (they can find the signal).

The top 5 features chosen by GradientBoostingRegressor() are: OverallQual, GrLivArea, TotalBsmtSF, KitchenQual and ExterQual.
As expected, we see clearly distinct ranges for each level, or a high correlation if the feature is continuous.

The top 5 OHE features chosen by GradientBoostingRegressor() are: MSZoning_RM, CentralAir_N, MSZoning_RL, Neighborhood_Crawfor, CentralAir_Y.
Not quite as obvious that these OHE features are valuable.