# **Modelling and Evaluation**

## Objectives

**Perform Business requirement 2 user story tasks: model selection, pipeline creation, hyperparameter tuning, model evaluation.**
* Create initial data cleaning and engineering pipeline using information from previous notebooks.
* Create initial modelling and evaluation pipeline using information from previous notebooks.
* Find best model candidate.
* Optimise chosen model through tuning and feature selection using feature importance. 
* Evaluate the model performance using performance metrics.
* Successfully achieve $R^2 \ge 0.75$ for the final model, to satisfy the client's success criteria, and thereby satisfy business requirement 2.


## Inputs
* house prices dataset: outputs/datasets/collection/house_prices.csv.
* Information regarding the steps to include in the various pipelines, as indicated in the conclusion sections of the data cleaning and feature engineering notebooks.
* Outlier indices list: outputs/ml/outlier_indices.pkl

## Outputs

---

## Change working directory

Working directory changed to its parent folder.

In [None]:
import os
current_dir = os.getcwd()
current_dir

In [None]:
os.chdir(os.path.dirname(current_dir))
os.getcwd()

---

## Load house price dataset

In [None]:
import pandas as pd

house_prices_df = pd.read_csv(filepath_or_buffer='outputs/datasets/collection/house_prices.csv')

---

## Removing known outliers from the whole dataset

Loading outlier indices list

In [None]:
import joblib
outlier_indices = joblib.load('outputs/ml/outlier_indices.pkl')
outlier_indices

Removing the instances

In [None]:
house_prices_df.drop(labels=outlier_indices, inplace=True)

---

## Create data cleaning and feature engineering pipeline

In [None]:
import numpy as np
import src.ml.transformers_and_functions as tf
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OrdinalEncoder, MinMaxScaler
from feature_engine.selection import SmartCorrelatedSelection, DropFeatures
from sklearn.tree import DecisionTreeRegressor

In [None]:
def data_cleaning_and_feature_engineering():
    """
    Constructs and returns data cleaning and feature engineering pipeline.
    """
    # variables for defining pipeline
    estimator = DecisionTreeRegressor(min_samples_split=10, min_samples_leaf=5, random_state=30)
    # Orginally intended to include the categories parameter used previously for OrdinalEncoder in the
    # feature engineering notebook. However causes problems when not all category options are
    # present in the train or test set. Will use the 'auto' option instead. 

    pipeline = Pipeline([
                        # Data cleaning:
                        # Missing value imputation:
                        ('IndependentKNNImputer', tf.IndependentKNNImputer()),
                        ('EqualFrequencyImputer', tf.EqualFrequencyImputer()),
                        #feature engineering:
                        # encoding:
                        ('OrdinalEncoder', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1, dtype='int64')),
                        # feature number reduction
                        ('CompositeSelectKBest', tf.CompositeSelectKBest()),
                        ('SmartCorrelatedSelection', SmartCorrelatedSelection(method='spearman',
                                                                              threshold=0.8, selection_method='model_performance',
                                                                              estimator=estimator, scoring='r2', cv=5)),
                        # #feature scaling:
                        ('CompositeNormaliser', tf.CompositeNormaliser())
                        ])
    return pipeline

As commented inside the data cleaning and feature engineering function above, it was intended to manually specify the ordinal encoding mapping using ordered arrays, as was done in the feature engineering notebook. However the train and test sets might not have all feature options for all features. For example for the train set the feature 'KitchenQual' does not have the value 'Po'. Therefore the encoding will be done automatically, and consequently the natural ranking of the feature values for a feature may not be preserved. 

In [None]:
data_cleaning_and_feature_engineering_pipeline = data_cleaning_and_feature_engineering()
data_cleaning_and_feature_engineering_pipeline.set_output(transform='pandas')

## Split dataset

In [None]:
from sklearn.model_selection import train_test_split

(train_set_df, test_set_df) = train_test_split(house_prices_df, test_size=0.25, random_state=30)

Splitting the train and test sets in to their features and target

In [None]:
x_train = train_set_df.drop('SalePrice', axis=1)
y_train = train_set_df['SalePrice']
x_test = test_set_df.drop('SalePrice', axis=1)
y_test = test_set_df['SalePrice']

---

## Create Scale target function

In [None]:
def scale_target(y_train, y_test=None):
    """
    Scales target for the train and or test set.

    Args:
        y_train: train target values.
        y_test: test target values.

    Returns a tuple of the scaled train and or test target series, as well as the inverse transform.
    """
    y_train = pd.DataFrame(data=y_train)
    min_max_scaler = MinMaxScaler()
    min_max_scaler.set_output(transform='pandas')
    min_max_scaler.fit(y_train)
    y_train = min_max_scaler.transform(y_train)
    inverse_transform = min_max_scaler.inverse_transform


    if y_test is not None:
        y_test = pd.DataFrame(data=y_test)
        y_test = min_max_scaler.transform(y_test)
        return (y_train.iloc[:, 0], y_test.iloc[:, 0], inverse_transform)

    return (y_train.iloc[:, 0], inverse_transform)


---

## Model Grid Search CV

Initially a search will be done to find the most suitable algorithm using sklearn's 'GridSearchCV', using only the default hyperparameters for each algorithm.
Hyperparmeter tuning will then be performed for this best candidate algorithm, again using 'GridSearchCV', but with multiple hyperparameter value combinations.

### Best algorithm search

Creating a search class to handle the searches.

In [None]:
from sklearn.model_selection import GridSearchCV

# taken from code-Institute-Solutions/churnometer (https://github.com/Code-Institute-Solutions/churnometer)
class HyperparameterOptimizationSearch:

    def __init__(self, models, params):
        self.models = models
        self.params = params
        self.keys = models.keys()
        self.grid_searches = {}

    def fit(self, X, y, cv, n_jobs, verbose=1, scoring=None, refit=False):
        for key in self.keys:
            print(f"\nRunning GridSearchCV for {key} \n")
            model = self.models[key]

            params = self.params[key]
            gs = GridSearchCV(model, params, cv=cv, n_jobs=n_jobs,
                              verbose=verbose, scoring=scoring)
            gs.fit(X, y)
            self.grid_searches[key] = gs

    def score_summary(self, sort_by='mean_score'):
        def row(key, scores, params):
            d = {
                'estimator': key,
                'min_score': min(scores),
                'max_score': max(scores),
                'mean_score': np.mean(scores),
                'std_score': np.std(scores),
            }
            return pd.Series({**params, **d})

        rows = []
        for k in self.grid_searches:
            params = self.grid_searches[k].cv_results_['params']
            scores = []
            for i in range(self.grid_searches[k].cv):
                key = "split{}_test_score".format(i)
                r = self.grid_searches[k].cv_results_[key]
                scores.append(r.reshape(len(params), 1))

            all_scores = np.hstack(scores)
            for p, s in zip(params, all_scores):
                rows.append((row(k, s, p)))

        df = pd.concat(rows, axis=1).T.sort_values([sort_by], ascending=False)

        columns = ['estimator', 'min_score',
                   'mean_score', 'max_score', 'std_score']
        columns = columns + [c for c in df.columns if c not in columns]

        return df[columns], self.grid_searches

Preparing parameters for conducting search.

Creating a dictionary of candidate models.

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor, ExtraTreeRegressor
from sklearn.ensemble import RandomForestRegressor, BaggingRegressor, AdaBoostRegressor

models = {'LinearRegression': LinearRegression(),
          'DecisionTreeRegressor': DecisionTreeRegressor(random_state=30),
          'RandomForestRegressor': RandomForestRegressor(random_state=30),
          'ExtraTreeRegressor': ExtraTreeRegressor(random_state=30),
          'AdaBoostRegressor': AdaBoostRegressor(random_state=30),
          'BaggingRegressor': BaggingRegressor(random_state=30)}

Defining the model parameters for each model; in this case there are no specified parameters meaning the default parameters will be used only as intended.

In [None]:
default_model_params = {'LinearRegression': {},
                        'DecisionTreeRegressor': {},
                        'RandomForestRegressor': {},
                        'ExtraTreeRegressor': {},
                        'AdaBoostRegressor': {},
                        'BaggingRegressor': {}}

Applying the data cleaning and engineering pipeline to a copy of the train set features, and scaling a copy of the train target

In [None]:
x_train_copy = x_train.copy(deep=True)
y_train_copy = y_train.copy(deep=True)

In [None]:
y_train_copy = scale_target(y_train.copy(deep=True))[0]

In [None]:
x_train_copy = data_cleaning_and_feature_engineering_pipeline.fit(x_train_copy, y_train_copy).transform(x_train_copy)

Performing the search with the created parameters.

In [None]:
search = HyperparameterOptimizationSearch(models, default_model_params)
search.fit(x_train_copy, y_train_copy, scoring='r2', cv=5, n_jobs=-1)

In [None]:
grid_search_results_summary, grid_search_pipelines = search.score_summary()

In [None]:
grid_search_results_summary

All estimators have small or fairly small (<0.1*mean) standard deviations. The best max and mean score was achieved by the 'RandomForestRegressor', closely followed by the 'BaggingRegressor'. The top three estimators all achieved min scores better than the desired minimum of $R^2=0.75$.

All things considered the 'RandomForestRegressor' seems to be the best model candidate. Hyperparameter tuning will now be performed for this model using GridSearchCV with multiple
hyperparameter combinations, aided by the use of the HyperparameterOptimizationSearch class.


### Chosen best model candidate hyperparameter tuning

updating models parameter

In [None]:
models = {'RandomForestRegressor': RandomForestRegressor(random_state=30)}

Choosing the model hyperparameter combinations

There are 7 hyperparamters that will be tuned:

* max_depth
* max_leaf_nodes
* min_samples_split
* min_samples_leaf
* n_estimators
* max_features
* max_samples

The ultimate goal is to avoid under-fitting and over-fitting the train set, leading to either high bias and low variance or low bias but high variance.
The other factor to consider is computation cost/time and complexity, with more complex models or higher values of hyperparameters, such as for max_depth, taking a lot more time to compute for the same computation power.

Ultimately there is a trade-off that must be decided subject to constraints and the success metric criteria.


Of the 7 parameters selected for tuning, some will counteract/limit the effects of each other for certain values. For example increasing max depth, everything else held constant, would lead to more nodes/layers in the tree, provided the min_samples_split/max_leaf_nodes/min_samples_split are not limiting more nodes/layers being formed.

What's more many of the parameters possess threshold values, either side of which the model performance on either or the train and test increases, decreases or plateaus.
Time permitting ideally a large range (small step-size) of parameters would be trialed and validation curves, for a single or group of parameters, plotted to discover the optimum values.

For this project, 3 values will be selected for each parameter and the impact of the various combinations assessed through using GridSearchCV via the HyperparameterOptimizationSearch class.

**max_depth**: the maximum number of layers

* As a starting point, the number of features will be used, corresponding to a possible tree structure where each feature is used once (assuming max_features matches) to
split a node. Will then take a value, half this, twice this, and two values in between. Hopefully this will indicate the rough location of a threshold value.
* So values [5,10,20].

**max_leaf_nodes**: the maximum number of terminal nodes. This will influence the number of nodes that can be split.

* If every node in every layer is split the number of nodes (n) increases with the depth at a rate $2^{n-1}$. If every node is split ending in leaves for a given depth n,
then the number of leaves will be $2^{n}$.
* The higher the number of leaf nodes, the more nodes that can be split, the greater the complexity/computation. For n=10 the described structure would have 512 leaves. Also more leaves may lead to over-fitting.
* Will cap the leaves at ~25% of this value
* So considering the depth values chosen, will take values [32,130,250].

**min_samples_split**: minimum number of samples at a node to split. This will counteract the number of leaves and the tree depth.
* 1460 samples. Taking the scenario where on average each split equally divides the samples, then it would take $n=log(1460)/log(2)$ levels to give pure leaves, if every node is split. So roughly 11 levels. Again all pure leaves likely leads to overfitting.
* So in this scenario lets say leaves have 10% max of the 1460 samples, giving a min_sample_split of 8. Will use this as a starting point.
* Will choose values [4,8,64].

**min_samples_leaf**: the minimum samples needed for a node to be a leaf.
* At worst would want a leaf to have no more than 5% of the samples, and probably not 1 sample either.
* Will use values [5,35,65].

**n_estimators**: number of trees in the forest. The more the better up to a point where the performance plateaus. However more trees equals greater computation time.
* The default value is 100, so will use this as a guide.
* Will use values [50,150,250].

**max_features**: The max number of features in a random subset to be used in a tree. Again this increases with the number of features up to a point but tails off and decreases.
* Apparently a good value for this can be obtained using the sqrt(no. of features).
* Also according to the sklearn documentation values close to 100% of the features give good empirical results.
* will use values ['sqrt',0.66,1.0].

**max_samples**: The sample size of the subset of samples.
 * Apparently a larger sample size increases the performance, but saturates quickly, and that only a small fraction of the sample is needed generally to achieve this saturation.
 * Thus will try the values [0.15,0.33,0.5].



**Setting the model parameters using the chosen values.**

In [None]:

model_params = {'RandomForestRegressor': {
    'max_depth': [5,10,20],
    'max_samples': [0.15,0.33,0.5],
    'max_features': ['sqrt',0.66,1.0],
    'n_estimators': [50,150,250],
    'min_samples_leaf': [5,35,65],
    'min_samples_split': [4,8,64],
    'max_leaf_nodes': [32,130,250]
}}

Originally attempted to use 5 values for each parameter, but the computation time was far too long.

**Performing the search**

In [None]:
search = HyperparameterOptimizationSearch(models, model_params)
search.fit(x_train_copy, y_train_copy, scoring='r2', cv=5, n_jobs=-1)

In [None]:
grid_search_results_summary, grid_search_pipelines = search.score_summary()

Displaying the top 10 estimators

In [None]:
grid_search_results_summary.head(10)

The best estimator has a slightly worse mean/max/min score for the train set relative to the default parameters. However this may not be a bad thing, since a model that fits
the train set too well, so low bias, may have higher variance.

**Retrieving the best hyperparameter combination.**

In [None]:
best_model = grid_search_results_summary.iloc[0, 0]
best_model_params = grid_search_pipelines[best_model].best_params_
print('Best model:', best_model)
print('Best hyperparameter combination:', best_model_params)

In [None]:
best_regressor = grid_search_pipelines[best_model].best_estimator_
best_regressor

**Extracting the feature importances**.

In [None]:
pipeline_features_out = data_cleaning_and_feature_engineering_pipeline['SmartCorrelatedSelection'].get_feature_names_out()
regressor_feature_importances = best_regressor.feature_importances_
feature_importances_df = pd.DataFrame(data=regressor_feature_importances, index=pipeline_features_out, columns=['importance']).sort_values(by='importance', ascending=False)
feature_importances_df

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
fig, ax = plt.subplots(tight_layout=True, figsize=(13,5))
ax.set_title('Feature Importance')
sns.barplot(data=feature_importances_df, x=feature_importances_df.index, y=feature_importances_df['importance'], ax=ax)

Can see that half of the features are much less important than rest, and that the feature 'OverallQual' strongly dominates.

### Create model pipeline

In [None]:
model_pipeline = Pipeline([
    ('RandomForestRegressor', best_regressor)
])

model_pipeline

---

## Evaluating model on train and test sets

Creating functions to evaluate model performance

In [None]:
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error

def model_evaluation(x_train, y_train, x_test, y_test, pipelines):
    """
    Calculates predicted values for train and test sets. Prints statistics and plots assessing prediction accuracy.

    Args:
        x_train: train set feature data.
        z_test: test set feature data.
        y_train: actual train target values.
        y_test: actual test target values.
        pipelines: dictionary containing the data cleaning and engineering pipeline and the model pipeline.
    
    Returns tuple of the test and train predicted values, and the scaling inverse transform.
    """
    # transform test and train set features
    x_train = pipelines['data_cleaning_and_feature_engineering'].fit_transform(x_train, y_train)
    x_test = pipelines['data_cleaning_and_feature_engineering'].transform(x_test)

    # transform train and test set target
    y_train, y_test, inverse_transform = scale_target(y_train, y_test)

    # get target predictions
    predictions_train = pipelines['model'].fit(x_train, y_train).predict(x_train)
    predictions_test = pipelines['model'].predict(x_test)
    predictions = (predictions_train, predictions_test)

    # unscaling predictions
    y_train = pd.DataFrame(inverse_transform(pd.DataFrame(data=y_train)))
    y_test = pd.DataFrame(data=inverse_transform(pd.DataFrame(data=y_test)))
    predictions_train = pd.DataFrame(inverse_transform(pd.DataFrame(data=predictions[0])))
    predictions_test = pd.DataFrame(inverse_transform(pd.DataFrame(data=predictions[1])))

    predictions = (predictions_train, predictions_test)


    # # print summary performance statistics
    model_evaluation_statistics(y_train, predictions_train)
    model_evaluation_statistics(y_test, predictions_test)

    # # print prediction vs actual plots
    model_evaluation_plots(y_train, y_test, predictions)
    
    return predictions

def model_evaluation_statistics(y, prediction):
    """
    Prints statistics assessing prediction accuracy.

    Args:
        y: actual values array-likel
        prediction: predicted values array-like.
    """
    print('R2 Score:', r2_score(y, prediction).round(3))
    print('Mean Absolute Error:', mean_absolute_error(y, prediction).round(3))
    print('Mean Squared Error:', mean_squared_error(y, prediction).round(3))
    print('Root Mean Squared Error:', np.sqrt(
        mean_squared_error(y, prediction)).round(3))
    print("\n")


def model_evaluation_plots(y_train, y_test, predictions, alpha_scatter=0.5):
    """
    Plots scatterplots including a line of perfect fit, for test and train actual values vs predicted values.

    Args:
        y_train: actual train target values.
        y_test: actual test target values.
        predictions: tuple of (predicted-train-target, predicted-test-target).
    """
    # plotting scatterplots with a perfect fit line
    prediction_train = predictions[0]
    prediction_test = predictions[1]
    fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(9, 8), tight_layout=True)
    sns.scatterplot(x=y_train.iloc[:, 0], y=prediction_train.iloc[:, 0], alpha=alpha_scatter, ax=axes[0])
    sns.lineplot(x=y_train.iloc[:, 0], y=y_train.iloc[:, 0], color='red', ax=axes[0])
    axes[0].set_xlabel("Actual")
    axes[0].set_ylabel("Predictions")
    axes[0].set_title("Train Set")

    sns.scatterplot(x=y_test.iloc[:, 0], y=prediction_test.iloc[:, 0], alpha=alpha_scatter, ax=axes[1])
    sns.lineplot(x=y_test.iloc[:, 0], y=y_test.iloc[:, 0], color='red', ax=axes[1])
    axes[1].set_xlabel("Actual")
    axes[1].set_ylabel("Predictions")
    axes[1].set_title("Test Set")

    plt.show()

**Evaluating the model on the train and test sets**:

In [None]:
predictions = model_evaluation(x_train, y_train, x_test, y_test, {'data_cleaning_and_feature_engineering': data_cleaning_and_feature_engineering_pipeline,
                                                                  'model': model_pipeline})

Can see from the scatter plots and the result statistics for both the train and test set, that the model is over-fitted: the $R^2$ value for the test set is less than half of that of the train set. Additionally from the plot you can that the model consistently over predicts the sale price for the test set, but matches closely the train set. Overall for the current model, the test $R^2<0.75$, and so the success criterion is not met.

---

## **Refitting the model with less features**

Clearly the model is biased towards the train set, and has high variance for this test set. Will try to refit the model with fewer features with the hope of reducing the likelihood of over-fitting to the train set.

Selecting the four features with the highest feature importance

In [None]:
best_four_features = feature_importances_df.sort_values(by='importance', ascending=False).iloc[0:4,:].index.values.tolist()
features_to_drop = x_train.drop(best_four_features, axis=1).columns.tolist()
print('Best four features:', best_four_features)
print('features to drop:', features_to_drop)

edit data cleaning and feature engineering pipeline

In [None]:
from feature_engine.selection import DropFeatures
import src.ml.transformers_and_functions as tf


def data_cleaning_and_feature_engineering_refined():
    """
    Constructs and returns data cleaning and feature engineering pipeline.
    """
    # variables for defining pipeline
    estimator = DecisionTreeRegressor(min_samples_split=10, min_samples_leaf=5, random_state=30)
    # Orginally intended to include the categories parameter used previously for OrdinalEncoder in the
    # feature engineering notebook. However causes problems when not all category options are
    # present in the train or test set. Will use the 'auto' option instead. 

    pipeline = Pipeline([
                        # Data cleaning:
                        # Missing value imputation:
                        ('IndependentKNNImputer', tf.IndependentKNNImputer()),
                        ('EqualFrequencyImputer', tf.EqualFrequencyImputer()),
                        #feature engineering:
                        # encoding:
                        ('OrdinalEncoder', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1, dtype='int64')),
                        # feature number reduction
                        ('DropFeatures', DropFeatures(features_to_drop=features_to_drop)),
                        # feature scaling:
                        ('CompositeNormaliser', tf.CompositeNormaliser())
                        ])
    return pipeline

Fitting the modified data cleaning and engineering pipeline

In [None]:
data_cleaning_and_feature_engineering_pipeline_refined = data_cleaning_and_feature_engineering_refined()
data_cleaning_and_feature_engineering_pipeline_refined.set_output(transform='pandas')

creating copies of the train set

In [None]:
x_train_copy = x_train.copy(deep=True)
y_train_copy = y_train.copy(deep=True)

Scaling the train target

In [None]:
y_train_copy = scale_target(y_train.copy(deep=True))[0]

Transforming the train set features

In [None]:
x_train_copy = data_cleaning_and_feature_engineering_pipeline_refined.fit(x_train_copy, y_train_copy).transform(x_train_copy)

Redo tuning with best four features only

In [None]:
search_new = HyperparameterOptimizationSearch(models, model_params)
search_new.fit(x_train_copy, y_train_copy, scoring='r2', cv=5, n_jobs=-1)

In [None]:
grid_search_results_summary_best_four, grid_search_pipelines_best_four = search_new.score_summary()


In [None]:
grid_search_results_summary_best_four.head(10)

Compared to the last search, the scores are lower for this smaller group of more important features. This however is desired to avoid over-fitting and bias towards the train set.

**Retrieving the best hyperparameter combination.**

In [None]:
best_model = grid_search_results_summary_best_four.iloc[0, 0]
best_model_params = grid_search_pipelines_best_four[best_model].best_params_
print('Best model:', best_model)
print('Best hyperparameter combination:', best_model_params)

In [None]:
best_regressor_best_four = grid_search_pipelines_best_four[best_model].best_estimator_
best_regressor_best_four

**Extracting the feature importance**

In [None]:
pipeline_features_out = data_cleaning_and_feature_engineering_pipeline_refined['DropFeatures'].get_feature_names_out()

In [None]:
regressor_feature_importances_best_four = best_regressor_best_four.feature_importances_
feature_importances_best_four_df = pd.DataFrame(data=regressor_feature_importances_best_four,
                                                index=pipeline_features_out, columns=['importance']).sort_values(by='importance', ascending=False)
feature_importances_best_four_df

Plotting feature importance

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
fig, ax = plt.subplots(tight_layout=True, figsize=(13,5))
ax.set_title('Feature Importance')
sns.barplot(data=feature_importances_best_four_df, x=feature_importances_best_four_df.index, y=feature_importances_best_four_df['importance'], ax=ax)

The 'OverallQual' feature now dominates even more than before, with the remaining features having equal importance, and collectively having a combined importance value
of about ~40% of the feature importance of 'OverallQual'. Does seem to suggest that 'OverallQual' almost single-handedly determines the sale price prediction. 

Updating the model pipeline

In [None]:
model_pipeline = Pipeline([
    ('RandomForestRegressor', best_regressor_best_four)
])

model_pipeline

Evaluating model on the train and test sets

In [None]:
predictions = model_evaluation(x_train, y_train, x_test, y_test, {'data_cleaning_and_feature_engineering': data_cleaning_and_feature_engineering_pipeline_refined,
                                                                  'model': model_pipeline})

It can be seen from the plots and the $R^2$ values that the model has significantly improved in predicting the test set target values, going from $R^2\approx0.4$ to $R^2$=0.72.
At the same the train set performance has declined a little to $R^2=0.76$, but as mentioned before this is a consequence of reducing the degree of over-fitting. It seems when fitting with more features, the feature importance of the feature 'OverallQual' declines a little and this leads to a poorer performance on the test set.

Despite the improvement in performance on the test, the model performance is marginally below the success criterion of $R^2=0.75$

**Possible next steps**:
* Try an even smaller group of features, or a different combination of the most important features.
* Perform hyperparameter tuning with a new range of hyperparamter values.
* Make adjustments to the data cleaning and feature engineering pipeline.
* Assess the similarity of the test and train set distributions to ensure a bad split, meaning one of the sets is not representative of the parent distribution, has not occurred. Can also try different random_state parameter values for splitting the dataset to test this.

---

## Refitting with 3 most important features

In [None]:
best_three_features = feature_importances_df.sort_values(by='importance', ascending=False).iloc[0:3,:].index.values.tolist()
features_to_drop = x_train.drop(best_three_features, axis=1).columns.tolist()
print('Best three features:', best_three_features)
print('features to drop:', features_to_drop)

In [None]:
from feature_engine.selection import DropFeatures
import src.ml.transformers_and_functions as tf


def data_cleaning_and_feature_engineering_refined():
    """
    Constructs and returns data cleaning and feature engineering pipeline.
    """
    # variables for defining pipeline
    estimator = DecisionTreeRegressor(min_samples_split=10, min_samples_leaf=5, random_state=30)
    # Orginally intended to include the categories parameter used previously for OrdinalEncoder in the
    # feature engineering notebook. However causes problems when not all category options are
    # present in the train or test set. Will use the 'auto' option instead. 

    pipeline = Pipeline([
                        # Data cleaning:
                        # Missing value imputation:
                        ('IndependentKNNImputer', tf.IndependentKNNImputer()),
                        ('EqualFrequencyImputer', tf.EqualFrequencyImputer()),
                        #feature engineering:
                        # encoding:
                        ('OrdinalEncoder', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1, dtype='int64')),
                        # feature number reduction
                        ('DropFeatures', DropFeatures(features_to_drop=features_to_drop)),
                        # feature scaling:
                        ('CompositeNormaliser', tf.CompositeNormaliser())
                        ])
    return pipeline

In [None]:
data_cleaning_and_feature_engineering_pipeline_refined = data_cleaning_and_feature_engineering_refined()
data_cleaning_and_feature_engineering_pipeline_refined.set_output(transform='pandas')

In [None]:
x_train_copy = x_train.copy(deep=True)
y_train_copy = y_train.copy(deep=True)

In [None]:
y_train_copy = scale_target(y_train.copy(deep=True))[0]

In [None]:
x_train_copy = data_cleaning_and_feature_engineering_pipeline_refined.fit(x_train_copy, y_train_copy).transform(x_train_copy)

In [None]:
search_new = HyperparameterOptimizationSearch(models, model_params)
search_new.fit(x_train_copy, y_train_copy, scoring='r2', cv=5, n_jobs=-1)

In [None]:
grid_search_results_summary_best_three, grid_search_pipelines_best_three = search_new.score_summary()

In [None]:
grid_search_results_summary_best_three.head(10)

In [None]:
best_model = grid_search_results_summary_best_three.iloc[0, 0]
best_model_params = grid_search_pipelines_best_three[best_model].best_params_
print('Best model:', best_model)
print('Best hyperparameter combination:', best_model_params)

In [None]:
best_regressor_best_three = grid_search_pipelines_best_three[best_model].best_estimator_
best_regressor_best_three

In [None]:
pipeline_features_out = data_cleaning_and_feature_engineering_pipeline_refined['DropFeatures'].get_feature_names_out()
regressor_feature_importances_best_three = best_regressor_best_three.feature_importances_
feature_importances_best_three_df = pd.DataFrame(data=regressor_feature_importances_best_three,
                                                index=pipeline_features_out, columns=['importance']).sort_values(by='importance', ascending=False)
feature_importances_best_three_df

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
fig, ax = plt.subplots(tight_layout=True, figsize=(13,5))
ax.set_title('Feature Importance')
sns.barplot(data=feature_importances_best_three_df, x=feature_importances_best_three_df.index, y=feature_importances_best_three_df['importance'], ax=ax)

In [None]:
model_pipeline = Pipeline([
    ('RandomForestRegressor', best_regressor_best_three)
])

model_pipeline

Evaluating model on the train and test sets


In [None]:
predictions = model_evaluation(x_train, y_train, x_test, y_test, {'data_cleaning_and_feature_engineering': data_cleaning_and_feature_engineering_pipeline_refined,
                                                                  'model': model_pipeline})

---