# **Part 5 - Modelling & Evaluation (Predict SalePrice)**

## Objectives

* Fit and evaluate a Regression Model to predict sale price for our data

## Inputs

* X_train (Features)
* y_train (Target)
* X_test (Features)
* y_test (Target)
* Selected transformers for data cleaning and feature engineering (details in respective notebooks)

## Outputs

* Feature Engineering Pipeline
* Modelling & Evalutation Pipeline
* Plots for feature importance

## Additional Comments

* In case you have any additional comments that don't fit in the previous bullets, please state them here. 


---

# Change working directory

* As the notebooks are stored in a subfolder, when running the notebook in the editor, the working directory will need to be adjusted.

The working directory will be changed from its current folder to its parent folder
* access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/Project5-PredictiveAnalytics-HeritageHousing/jupyter_notebooks'

The parent of the current directory needs to be made the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


* Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/workspace/Project5-PredictiveAnalytics-HeritageHousing'

---

# Load Data

In [4]:
import numpy as np
import pandas as pd 
X_train = pd.read_csv('./outputs/datasets/clean/X_train.csv')
y_train = pd.read_csv('./outputs/datasets/clean/y_train.csv')

In [5]:
print(f'Features Train Set: {X_train.shape}')
print(f'Target Train Set: {y_train.shape}')


Features Train Set: (1168, 23)
Target Train Set: (1168, 1)


### Features - Train Set

In [6]:
X_train.head()

Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,EnclosedPorch,GarageArea,GarageFinish,...,LotArea,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,WoodDeckSF,YearBuilt,YearRemodAdd
0,1222,698.0,4.0,No,0,Unf,570,0.0,487,RFn,...,10192,102.0,143.0,98,6,7,570,0.0,1968,1992
1,1165,0.0,4.0,No,0,Unf,1141,0.0,420,Fin,...,12090,93.0,650.0,123,5,8,1141,144.0,1998,1998
2,698,430.0,2.0,No,0,Unf,698,0.0,528,RFn,...,7500,60.0,0.0,0,4,4,698,0.0,1920,1950
3,1844,0.0,2.0,No,976,GLQ,868,0.0,620,Fin,...,10994,88.0,366.0,44,5,8,1844,0.0,2005,2006
4,1419,0.0,2.0,Av,945,Unf,474,0.0,567,RFn,...,8089,60.0,0.0,0,6,8,1419,0.0,2007,2007


### Target - Train Set

In [7]:
y_train.head()

Unnamed: 0,SalePrice
0,170000
1,258000
2,68400
3,257000
4,392000


---

# ML Pipeline

### Initial Pipeline Creation

First we will import the neccesary packages, including the feature engineering transformers, scalers, and ML algorithms. We do not need to include the data cleaning steps within our pipeline, as our data has already been cleaned and split into our test and train sets. However, it is useful to note that including those cleaning steps within the pipeline, along with the feature engineering steps, and inputing the raw data, before then splitting into our training and test sets is also a valid approach. 

In [8]:
from sklearn.pipeline import Pipeline

# Feature Engineering 
from feature_engine.encoding import OrdinalEncoder
from feature_engine import transformation as vt 
from feature_engine.outliers import Winsorizer, OutlierTrimmer
from feature_engine.selection import SmartCorrelatedSelection

# Feature Preprocessing (Scaling & Selection)
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectFromModel

# Machine Learning Algorithms
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor, ExtraTreesRegressor, AdaBoostRegressor
from sklearn.linear_model import LinearRegression
from xgboost import XGBRegressor

  from pandas import MultiIndex, Int64Index


Next we create our pipeline using the transformers and parameters decided up during our feature engineering notebook, we wont go into detail again here, so please refer back to that notebook for more information. 

In [9]:
def PipelineReg(model):
    pipeline = Pipeline([
        ('Ordinal Categorical Encoder', OrdinalEncoder(encoding_method='ordered', variables=['BsmtExposure', 'BsmtFinType1', 'GarageFinish', 'KitchenQual'])),
        ('Log "e" Transformer', vt.LogTransformer(variables=['GrLivArea', 'LotArea'])),
        ('Power Transformer', vt.PowerTransformer(variables=['BsmtFinSF1', 'BsmtUnfSF', 'MasVnrArea', 'OpenPorchSF'])),
        ('Box Cox Transformer', vt.BoxCoxTransformer(variables=['1stFlrSF'])),
        ('Yeo Johnson Transformer', vt.YeoJohnsonTransformer(variables=['EnclosedPorch', 'LotFrontage', 'TotalBsmtSF', 'WoodDeckSF'])),
        ('Winsorizer', Winsorizer(capping_method='gaussian', tail='right', variables=['LotFrontage', 'TotalBsmtSF'])),
        # ('Outlier Trimmer', OutlierTrimmer(capping_method='gaussian', tail='right', fold=1.5, variables=['GrLivArea'])),
        ("SmartCorrelatedSelection", SmartCorrelatedSelection(variables=None, method="spearman", threshold=0.8, selection_method="variance")),
        # The Scaler standardizes features by removing the mean and scaling unit to unit variance 
        ('Feature Scaling', StandardScaler()),
        # Feature selection uses the meta-transformer to select the best features to use within the model based on their importance weights.
        ('Feature Selection', SelectFromModel(model)),
        ('Model', model)
    ])

    return pipeline

### Hyperparameter Optimization

Next we will explore hyperparameter optimization, selecting the best algorithm and its best hyperparameters to use within our pipeline.

In [10]:
from sklearn.model_selection import GridSearchCV

We will again implement some custom code from CodeInstitute to create a class to search for the optimal Hyperparameters, using the SciKit Learn package GridSearchCV we have just imported.

In [11]:
class HyperparameterOptimizationSearch:

    def __init__(self, models, params):
        self.models = models
        self.params = params
        self.keys = models.keys()
        self.grid_searchs = {}

    def fit(self, features, target, cross_validation, n_jobs, verbose=1, scoring=None, refit=False):
        for key in self.keys:
            print(f"\n **Running GridSearchCV for {key}** \n")
            import warnings
            warnings.simplefilter(action='ignore', category=FutureWarning)
            model = PipelineReg(self.models[key])
            params = self.params[key]
            gridsearch = GridSearchCV(model, params, cv=cross_validation, n_jobs=n_jobs, verbose=verbose, scoring=scoring)
            gridsearch.fit(features, target)
            self.grid_searchs[key] = gridsearch
    
    def score_summary(self, sort_by='mean_score'):
        def row(key, scores, params):
            d = {
                'estimator': key,
                'min_score': min(scores),
                'max_score': max(scores),
                'mean_score': np.mean(scores),
                'std_score': np.std(scores),
            }
            return pd.Series({**params, **d})
        
        rows = []
        for k in self.grid_searchs:
            params = self.grid_searchs[k].cv_results_['params']
            scores = []
            for i in range(self.grid_searchs[k].cv):
                key = "split{}_test_score".format(i)
                r = self.grid_searchs[k].cv_results_[key]
                scores.append(r.reshape(len(params), 1))
            
            all_scores = np.hstack(scores)
            for p, s in zip(params, all_scores):
                rows.append(row(k, s, p))

        df = pd.concat(rows, axis=1).T.sort_values([sort_by], ascending=False)

        columns = [
            'estimator',
            'min_score',
            'mean_score',
            'max_score',
            'std_score'
            ]
        columns = columns + [c for c in df.columns if c not in columns]

        return df[columns], self.grid_searchs

First we will perform a grid search with the default hyperparameters to find the most appropriate algorithm for our model.
* *Again, for reproducibility we will use the random state attribute*

In [12]:
quick_search_models = {
    'LinearRegression': LinearRegression(),
    "DecisionTreeRegressor": DecisionTreeRegressor(random_state=23),
    "RandomForestRegressor": RandomForestRegressor(random_state=23),
    "ExtraTreesRegressor": ExtraTreesRegressor(random_state=23),
    "AdaBoostRegressor": AdaBoostRegressor(random_state=23),
    "GradientBoostingRegressor": GradientBoostingRegressor(random_state=23),
    "XGBRegressor": XGBRegressor(random_state=23),
}

quick_search_params = {
    'LinearRegression': {},
    "DecisionTreeRegressor": {},
    "RandomForestRegressor": {},
    "ExtraTreesRegressor": {},
    "AdaBoostRegressor": {},
    "GradientBoostingRegressor": {},
    "XGBRegressor": {},
}

*To keep our notebook a little tidier we have made some minor changes to subvert the warnings that would be thrown but always remember its good practice to keep note of all warnings and handle accordingly*
* <code>FutureWarning: Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.</code>
    * To fix this warning we change n_jobs from -1 to 1 
* <code>DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().</code>
    * To fix this warning we handle some slight data conversion, by using the numpy function <code>ravel()</code> on the values of our y_train, and convert to a pandas series as a numpy array is not an accepted input 

In [13]:
import numpy
search = HyperparameterOptimizationSearch(models=quick_search_models, params=quick_search_params)
search.fit(features=X_train, target=pd.Series(y_train.values.ravel()), scoring='r2', n_jobs=1, cross_validation=5)


 **Running GridSearchCV for LinearRegression** 

Fitting 5 folds for each of 1 candidates, totalling 5 fits

 **Running GridSearchCV for DecisionTreeRegressor** 

Fitting 5 folds for each of 1 candidates, totalling 5 fits

 **Running GridSearchCV for RandomForestRegressor** 

Fitting 5 folds for each of 1 candidates, totalling 5 fits

 **Running GridSearchCV for ExtraTreesRegressor** 

Fitting 5 folds for each of 1 candidates, totalling 5 fits

 **Running GridSearchCV for AdaBoostRegressor** 

Fitting 5 folds for each of 1 candidates, totalling 5 fits

 **Running GridSearchCV for GradientBoostingRegressor** 

Fitting 5 folds for each of 1 candidates, totalling 5 fits

 **Running GridSearchCV for XGBRegressor** 

Fitting 5 folds for each of 1 candidates, totalling 5 fits


Now we can check the results

In [14]:
grid_search_summary, grid_search_pipelines = search.score_summary(sort_by='mean_score')
grid_search_summary

Unnamed: 0,estimator,min_score,mean_score,max_score,std_score
3,ExtraTreesRegressor,0.715738,0.801543,0.872317,0.064537
0,LinearRegression,0.728023,0.794907,0.84446,0.046269
2,RandomForestRegressor,0.725459,0.784715,0.828253,0.044948
5,GradientBoostingRegressor,0.695142,0.775957,0.85106,0.057544
4,AdaBoostRegressor,0.71867,0.758176,0.790711,0.026214
6,XGBRegressor,0.514374,0.652946,0.760647,0.080995
1,DecisionTreeRegressor,0.488662,0.628737,0.715119,0.081611


We will also import the a metrics function to use within GridSearchCV for use as our scoring strategy. With more time and resources an investigation could be led on the effect of different possible scoring strategies and the effect they will have on the grid search cross validation.

---

NOTE

* You may add as many sections as you want, as long as they support your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* If you do not need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [15]:
import os
try:
  # create here your folder
  # os.makedirs(name='')
except Exception as e:
  print(e)


IndentationError: expected an indented block (2852421808.py, line 5)