# **Modelling and Evaluation**

## Objectives

**Perform Business requirement 2 user story tasks: model selection, pipeline creation, hyperparameter tuning, model evaluation.**
* Create initial data cleaning and engineering pipeline using information from previous notebooks.
* Create initial modelling and evaluation pipeline using information from previous notebooks.
* Find best model candidate.
* Optimise chosen model through tuning and feature selection using feature importance. 
* Evaluate the model performance using performance metrics.
* Successfully achieve $R^2 \ge 0.75$ for the final model, to satisfy the client's success criteria, and thereby satisfy business requirement 2.


## Inputs
* house prices dataset: outputs/datasets/collection/house_prices.csv.
* Information regarding the steps to include in the various pipelines, as indicated in the conclusion sections of the data cleaning and feature engineering notebooks.
* Outlier indices list: outputs/ml/outlier_indices.pkl

## Outputs

---

## Change working directory

Working directory changed to its parent folder.

In [None]:
import os
current_dir = os.getcwd()
current_dir

In [None]:
os.chdir(os.path.dirname(current_dir))
os.getcwd()

---

## Load house price dataset

In [None]:
import pandas as pd

house_prices_df = pd.read_csv(filepath_or_buffer='outputs/datasets/collection/house_prices.csv')

---

## Removing known outliers from the whole dataset

Loading outlier indices list

In [None]:
import joblib
outlier_indices = joblib.load('outputs/ml/outlier_indices.pkl')
outlier_indices

Removing the instances

In [None]:
house_prices_df.drop(labels=outlier_indices, inplace=True)

---

## Create data cleaning and feature engineering pipeline

In [None]:
import numpy as np
import src.transformers_and_functions as tf
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OrdinalEncoder, MinMaxScaler
from feature_engine.selection import SmartCorrelatedSelection, DropFeatures
from sklearn.tree import DecisionTreeRegressor

In [None]:
def data_cleaning_and_feature_engineering():
    """
    Constructs and returns data cleaning and feature engineering pipeline.
    """
    # variables for defining pipeline
    estimator = DecisionTreeRegressor(min_samples_split=10, min_samples_leaf=5, random_state=30)
    # Orginally intended to include the categories parameter used previously for OrdinalEncoder in the
    # feature engineering notebook. However causes problems when not all category options are
    # present in the train or test set. Will use the 'auto' option instead. 

    pipeline = Pipeline([
                        # Data cleaning:
                        # Missing value imputation:
                        ('IndependentKNNImputer', tf.IndependentKNNImputer()),
                        ('EqualFrequencyImputer', tf.EqualFrequencyImputer()),
                        #feature engineering:
                        # encoding:
                        ('OrdinalEncoder', OrdinalEncoder(dtype='int64')),
                        # feature number reduction
                        ('CompositeSelectKBest', tf.CompositeSelectKBest()),
                        ('SmartCorrelatedSelection', SmartCorrelatedSelection(method='spearman',
                                                                              threshold=0.8, selection_method='model_performance',
                                                                              estimator=estimator, scoring='r2', cv=5)),
                        # #feature scaling:
                        #('CompositeNormaliSer', tf.CompositeNormaliser())
                        ])
    return pipeline

As commented inside the data cleaning and feature engineering function above, it was intended to manually specify the ordinal encoding mapping using ordered arrays, as was done in the feature engineering notebook. However the train and test sets might not have all feature options for all features. For example for the train set the feature 'KitchenQual' does not have the value 'Po'. Therefore the encoding will be done automatically, and consequently the natural ranking of the feature values for a feature may not be preserved. 

In [None]:
data_cleaning_and_feature_engineering_pipeline = data_cleaning_and_feature_engineering()
data_cleaning_and_feature_engineering_pipeline.set_output(transform='pandas')

## Split dataset

In [None]:
from sklearn.model_selection import train_test_split

(train_set_df, test_set_df) = train_test_split(house_prices_df, test_size=0.25, random_state=30)

Splitting the train and test sets in to their features and target

In [None]:
x_train = train_set_df.drop('SalePrice', axis=1)
y_train = train_set_df['SalePrice']
x_test = test_set_df.drop('SalePrice', axis=1)
y_test = test_set_df['SalePrice']

---

## Create Scale target function

In [None]:
def scale_target(y_train, y_test=None):
    """
    """
    y_train = pd.DataFrame(data=y_train)
    min_max_scaler = MinMaxScaler()
    min_max_scaler.set_output(transform='pandas')
    min_max_scaler.fit(y_train)

    y_train = min_max_scaler.transform(y_train)

    if y_test:
        y_test = min_max_scaler.transform(y_test)
        return y_train, y_test

    return y_train.iloc[:, 0]


---

## Model Grid Search CV

Initially a search will be done to find the most suitable algorithm using sklearn's 'GridSearchCV', using only the default hyperparameters for each algorithm.
Hyperparmeter tuning will then be performed for this best candidate algorithm, again using 'GridSearchCV', but with multiple hyperparameter value combinations.

### Best algorithm search

Creating a search class to handle the searches.

In [None]:
from sklearn.model_selection import GridSearchCV

# taken from code-Institute-Solutions/churnometer (https://github.com/Code-Institute-Solutions/churnometer)
class HyperparameterOptimizationSearch:

    def __init__(self, models, params):
        self.models = models
        self.params = params
        self.keys = models.keys()
        self.grid_searches = {}

    def fit(self, X, y, cv, n_jobs, verbose=1, scoring=None, refit=False):
        for key in self.keys:
            print(f"\nRunning GridSearchCV for {key} \n")
            model = self.models[key]

            params = self.params[key]
            gs = GridSearchCV(model, params, cv=cv, n_jobs=n_jobs,
                              verbose=verbose, scoring=scoring)
            gs.fit(X, y)
            self.grid_searches[key] = gs

    def score_summary(self, sort_by='mean_score'):
        def row(key, scores, params):
            d = {
                'estimator': key,
                'min_score': min(scores),
                'max_score': max(scores),
                'mean_score': np.mean(scores),
                'std_score': np.std(scores),
            }
            return pd.Series({**params, **d})

        rows = []
        for k in self.grid_searches:
            params = self.grid_searches[k].cv_results_['params']
            scores = []
            for i in range(self.grid_searches[k].cv):
                key = "split{}_test_score".format(i)
                r = self.grid_searches[k].cv_results_[key]
                scores.append(r.reshape(len(params), 1))

            all_scores = np.hstack(scores)
            for p, s in zip(params, all_scores):
                rows.append((row(k, s, p)))

        df = pd.concat(rows, axis=1).T.sort_values([sort_by], ascending=False)

        columns = ['estimator', 'min_score',
                   'mean_score', 'max_score', 'std_score']
        columns = columns + [c for c in df.columns if c not in columns]

        return df[columns], self.grid_searches

Preparing parameters for conducting search.

Creating a dictionary of candidate models.

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor, ExtraTreeRegressor
from sklearn.ensemble import RandomForestRegressor, BaggingRegressor, AdaBoostRegressor

models = {'LinearRegression': LinearRegression(),
          'DecisionTreeRegressor': DecisionTreeRegressor(random_state=30),
          'RandomForestRegressor': RandomForestRegressor(random_state=30),
          'ExtraTreeRegressor': ExtraTreeRegressor(random_state=30),
          'AdaBoostRegressor': AdaBoostRegressor(random_state=30),
          'BaggingRegressor': BaggingRegressor(random_state=0)}

Defining the model parameters for each model; in this case there are no specified parameters meaning the default parameters will be used only as intended.

In [None]:
default_model_params = {'LinearRegression': {},
                        'DecisionTreeRegressor': {},
                        'RandomForestRegressor': {},
                        'ExtraTreeRegressor': {},
                        'AdaBoostRegressor': {},
                        'BaggingRegressor': {}}

Applying the data cleaning and engineering pipeline to a copy of the train set features, and scaling a copy of the train target

In [None]:
x_train_copy = x_train.copy(deep=True)
y_train_copy = y_train.copy(deep=True)

In [None]:
x_train_copy = data_cleaning_and_feature_engineering_pipeline.fit(x_train_copy, y_train_copy).transform(x_train_copy)

In [None]:
y_train_copy = scale_target(y_train.copy(deep=True))

Performing the search with the created parameters.

In [None]:
search = HyperparameterOptimizationSearch(models, default_model_params)
search.fit(x_train_copy, y_train_copy, scoring='r2', cv=5, n_jobs=-1)

In [None]:
grid_search_results_summary, grid_search_pipelines = search.score_summary()

In [None]:
grid_search_results_summary

All estimators have small or fairly small (<0.1*mean) standard deviations. The best max and mean score was achieved by the 'RandomForestRegressor', closely followed by the 'BaggingRegressor'. The top four estimators all achieved min scores better than the desired minimum of $R^2=0.75$.

All things considered the 'RandomForestRegressor' seems to be the best model candidate. Hyperparameter tuning will now be performed for this model using GridSearchCV with multiple
hyperparameter combinations, aided by the use of the HyperparameterOptimizationSearch class.
