# **Modelling and Evaluation**

## Objectives

**Perform Business requirement 2 user story tasks: model selection, pipeline creation, hyperparameter tuning, model evaluation.**
* Create initial data cleaning and engineering pipeline using information from previous notebooks.
* Create initial modelling and evaluation pipeline using information from previous notebooks.
* Find best model candidate.
* Optimise chosen model through tuning and feature selection using feature importance. 
* Evaluate the model performance using performance metrics.
* Successfully achieve $R^2 \ge 0.75$ for the final model, to satisfy the client's success criteria, and thereby satisfy business requirement 2.


## Inputs
* house prices dataset: outputs/datasets/collection/house_prices.csv.
* Information regarding the steps to include in the various pipelines, as indicated in the conclusion sections of the data cleaning and feature engineering notebooks.
* Outlier indices list: outputs/ml/outlier_indices.pkl

## Outputs

---

## Change working directory

Working directory changed to its parent folder.

In [None]:
import os
current_dir = os.getcwd()
current_dir

In [None]:
os.chdir(os.path.dirname(current_dir))
os.getcwd()

---

## Load house price dataset

In [None]:
import pandas as pd

house_prices_df = pd.read_csv(filepath_or_buffer='outputs/datasets/collection/house_prices.csv')

---

## Removing know outliers from whole dataset

Loading outlier indices list

In [None]:
import joblib
outlier_indices = joblib.load('outputs/ml/outlier_indices.pkl')
outlier_indices

Removing the instances

In [None]:
house_prices_df.drop(labels=outlier_indices, inplace=True)

---

## Create data cleaning and feature engineering pipeline

In [None]:
import numpy as np
import src.transformers_and_functions as tf
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OrdinalEncoder, MinMaxScaler
from feature_engine.selection import SmartCorrelatedSelection, DropFeatures
from sklearn.tree import DecisionTreeRegressor

In [None]:
def data_cleaning_and_feature_engineering():
    """
    Constructs and returns data cleaning and feature engineering pipeline.
    """
    # variables for defining pipeline
    bsmt_fin_type1_cat = np.array(list(reversed(['GLQ', 'ALQ', 'BLQ', 'Rec', 'LwQ', 'Unf', 'None'])))
    bsmt_exposure_cat = np.array(['None', 'No', 'Mn', 'Av', 'Gd'])
    garage_finish_cat = np.array(['None', 'Unf', 'RFn', 'Fin'])
    kitchen_quality_cat = np.array(['Po', 'Fa', 'TA', 'Gd', 'Ex'])
    categories = [bsmt_exposure_cat, bsmt_fin_type1_cat, garage_finish_cat, kitchen_quality_cat]

    estimator = DecisionTreeRegressor(min_samples_split=10, min_samples_leaf=5, random_state=30)
    smart_variables = ['GrLivArea', 'GarageYrBlt', 'TotalBsmtSF', 'GarageArea', 'BsmtFinSF1',
                       'EnclosedPorchSF', 'GarageFinish', 'KitchenQual', 'YearBuilt', 'YearRemodAdd',
                       'OverallQual', '1stFlrSF', '2ndFlrSF']

    pipeline = Pipeline([
                        # Data cleaning:
                        # Missing value imputation:
                        ('IndependentKNNImputer', tf.IndependentKNNImputer()),
                        ('EqualFrequencyImputer', tf.EqualFrequencyImputer()),
                        #feature engineering:
                        # encoding:
                        ('OrdinalEncoder', OrdinalEncoder(categories=categories, dtype='int64')),
                        # feature number reduction
                        ('CompositeSelectKBest'), tf.CompositeSelectKBest(),
                        ('SmartCorrelatedSelection', SmartCorrelatedSelection(variables=smart_variables, method='spearman',
                                                                              threshold=0.8, selection_method='model_performance',
                                                                              estimator=estimator, scoring='r2', cv=5)),
                        ('DropFeatures', DropFeatures('EnclosedPorchSF')),
                        #feature scaling:
                        ('CompositeNormaliSer', tf.CompositeNormaliser())])
    return pipeline

In [None]:
data_cleaning_and_feature_engineering_pipeline = data_cleaning_and_feature_engineering()

## Split dataset

In [None]:
from sklearn.model_selection import train_test_split

(train_set_df, test_set_df) = train_test_split(house_prices_df, test_size=0.25, random_state=30)

---

## Scale target

In [None]:
min_max_scaler = MinMaxScaler()
min_max_scaler.set_output(transform='pandas')
min_max_scaler.fit(train_set_df[['SalePrice']])

Transform train set target

In [None]:
transformed_df = min_max_scaler.transform(train_set_df[['SalePrice']])
train_set_df[['SalePrice']] = transformed_df

Transform test set target

In [None]:
transformed_df = min_max_scaler.transform(test_set_df[['SalePrice']])
test_set_df[['SalePrice']] = transformed_df

---

## Model Grid Search CV

Initially a search will be done to find the most suitable algorithm using sklearn's 'GridSearchCV', using only the default hyperparameters for each algorithm.
Hyperparmeter tuning will then be performed for this best candidate algorithm, again using 'GridSearchCV', but with multiple hyperparameter value combinations.

### Best algorithm search

Creating a search class to handle the searches.

In [15]:
from sklearn.model_selection import GridSearchCV

# taken from code-Institute-Solutions/churnometer (https://github.com/Code-Institute-Solutions/churnometer)
class HyperparameterOptimizationSearch:

    def __init__(self, models, params):
        self.models = models
        self.params = params
        self.keys = models.keys()
        self.grid_searches = {}

    def fit(self, X, y, cv, n_jobs, verbose=1, scoring=None, refit=False):
        for key in self.keys:
            print(f"\nRunning GridSearchCV for {key} \n")
            model = self.models[key]

            params = self.params[key]
            gs = GridSearchCV(model, params, cv=cv, n_jobs=n_jobs,
                              verbose=verbose, scoring=scoring)
            gs.fit(X, y)
            self.grid_searches[key] = gs

    def score_summary(self, sort_by='mean_score'):
        def row(key, scores, params):
            d = {
                'estimator': key,
                'min_score': min(scores),
                'max_score': max(scores),
                'mean_score': np.mean(scores),
                'std_score': np.std(scores),
            }
            return pd.Series({**params, **d})

        rows = []
        for k in self.grid_searches:
            params = self.grid_searches[k].cv_results_['params']
            scores = []
            for i in range(self.grid_searches[k].cv):
                key = "split{}_test_score".format(i)
                r = self.grid_searches[k].cv_results_[key]
                scores.append(r.reshape(len(params), 1))

            all_scores = np.hstack(scores)
            for p, s in zip(params, all_scores):
                rows.append((row(k, s, p)))

        df = pd.concat(rows, axis=1).T.sort_values([sort_by], ascending=False)

        columns = ['estimator', 'min_score',
                   'mean_score', 'max_score', 'std_score']
        columns = columns + [c for c in df.columns if c not in columns]

        return df[columns], self.grid_searches