# **5. House Price Predictor Notebook**

## Objectives

* Develop a working model for the House Price Predictor based on the cleaned and feature engineered dataset

## Inputs

* The cleaned TrainSetCleaned, tested with TestSetClean. Path: /workspace/milestone-project-housing-issues/outputs/datasets/cleaned/TrainSetCleaned.csv

## Outputs

* A working model for the House Price predictor

## Additional Comments

* As per the business case, the required performance for the model is an R2 value of at least 0.75 for both train and test set


---

# Change working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [13]:
import os

# Get the current directory
current_dir = os.getcwd()
print("Current Directory:", current_dir)

# Change the directory to the new path
os.chdir('/workspace/milestone-project-housing-issues')

# Get the updated current directory
current_dir = os.getcwd()
print("New Current Directory:", current_dir)

Current Directory: /workspace/milestone-project-housing-issues
New Current Directory: /workspace/milestone-project-housing-issues


In [28]:
# Loading dataset HousePricesClean from /workspace/milestone-project-housing-issues/outputs/datasets/cleaned/HousePricesClean.csv

import pandas as pd
df_houseprices_allvars = pd.read_csv(f"/workspace/milestone-project-housing-issues/inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/house_prices_records.csv").drop(labels=['SalePrice', 'EnclosedPorch', 'WoodDeckSF'], axis=1)
print(df_houseprices_allvars.shape)
df_houseprices_allvars.head()

(1460, 21)


Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,GarageArea,GarageFinish,GarageYrBlt,...,KitchenQual,LotArea,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,YearBuilt,YearRemodAdd
0,856,854.0,3.0,No,706,GLQ,150,548,RFn,2003.0,...,Gd,8450,65.0,196.0,61,5,7,856,2003,2003
1,1262,0.0,3.0,Gd,978,ALQ,284,460,RFn,1976.0,...,TA,9600,80.0,0.0,0,8,6,1262,1976,1976
2,920,866.0,3.0,Mn,486,GLQ,434,608,RFn,2001.0,...,Gd,11250,68.0,162.0,42,5,7,920,2001,2002
3,961,,,No,216,ALQ,540,642,Unf,1998.0,...,Gd,9550,60.0,0.0,35,5,7,756,1915,1970
4,1145,,4.0,Av,655,GLQ,490,836,RFn,2000.0,...,Gd,14260,84.0,350.0,84,5,8,1145,2000,2000


In [29]:
df_houseprices_allvars.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 21 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   1stFlrSF      1460 non-null   int64  
 1   2ndFlrSF      1374 non-null   float64
 2   BedroomAbvGr  1361 non-null   float64
 3   BsmtExposure  1460 non-null   object 
 4   BsmtFinSF1    1460 non-null   int64  
 5   BsmtFinType1  1346 non-null   object 
 6   BsmtUnfSF     1460 non-null   int64  
 7   GarageArea    1460 non-null   int64  
 8   GarageFinish  1298 non-null   object 
 9   GarageYrBlt   1379 non-null   float64
 10  GrLivArea     1460 non-null   int64  
 11  KitchenQual   1460 non-null   object 
 12  LotArea       1460 non-null   int64  
 13  LotFrontage   1201 non-null   float64
 14  MasVnrArea    1452 non-null   float64
 15  OpenPorchSF   1460 non-null   int64  
 16  OverallCond   1460 non-null   int64  
 17  OverallQual   1460 non-null   int64  
 18  TotalBsmtSF   1460 non-null 

---

# Prepare Modelling Stage with Original Dataset

## ML Pipeline with combined data from TrainSetCleaned and TestSetCleaned

* Variables have been combined to add binary variables created in the DataCleaning notebook, alongside the original variables
* The amended target variables (SalePriceBand) has also been retained

In [30]:
from sklearn.pipeline import Pipeline

# Feature Engineering
from feature_engine.selection import SmartCorrelatedSelection
from feature_engine.encoding import OrdinalEncoder


def PipelineDataCleaningAndFeatureEngineering():
    pipeline_base = Pipeline([
        ("OrdinalCategoricalEncoder", OrdinalEncoder(encoding_method='arbitrary',
                                                     variables=['1stFlrSF', '2ndFlrSF' 'BedroomAbvGr',
                                                                'BsmtExposure', 'BsmtFinSF1', 'BsmtFinType1',
                                                                'BsmtUnfSF', 'GarageArea', 'GarageFinish',
                                                                'GarageYrBlt', 'GrLivArea',  'KitchenQual',
                                                                'LotArea', 'LotFrontage', 'MasVnrArea',
                                                                'OpenPorchSF', 'OverallCond', 'OverallQual',
                                                                'TotalBsmtSF', 'YearBuilt', 'YearRemodAdd'])),

        ("SmartCorrelatedSelection", SmartCorrelatedSelection(variables=None,
         method="spearman", threshold=0.6, selection_method="variance")),

    ])

    return pipeline_base


PipelineDataCleaningAndFeatureEngineering()

Pipeline(steps=[('OrdinalCategoricalEncoder',
                 OrdinalEncoder(encoding_method='arbitrary',
                                variables=['1stFlrSF', '2ndFlrSFBedroomAbvGr',
                                           'BsmtExposure', 'BsmtFinSF1',
                                           'BsmtFinType1', 'BsmtUnfSF',
                                           'GarageArea', 'GarageFinish',
                                           'GarageYrBlt', 'GrLivArea',
                                           'KitchenQual', 'LotArea',
                                           'LotFrontage', 'MasVnrArea',
                                           'OpenPorchSF', 'OverallCond',
                                           'OverallQual', 'TotalBsmtSF',
                                           'YearBuilt', 'YearRemodAdd'])),
                ('SmartCorrelatedSelection',
                 SmartCorrelatedSelection(method='spearman',
                                          selection_metho

## ML Pipeline for Modelling and Hyperparameter Optimisation

In [31]:
# Feat Scaling
from sklearn.preprocessing import StandardScaler

# Feat Selection
from sklearn.feature_selection import SelectFromModel

# ML algorithms
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import AdaBoostClassifier
from xgboost import XGBClassifier


def PipelineClf(model):
    pipeline_base = Pipeline([
        ("scaler", StandardScaler()),
        ("feat_selection", SelectFromModel(model)),
        ("model", model),
    ])

    return pipeline_base

  from pandas import MultiIndex, Int64Index


#### Custom Class for Hyperparameter Optimisation

In [32]:
from sklearn.model_selection import GridSearchCV


class HyperparameterOptimizationSearch:

    def __init__(self, models, params):
        self.models = models
        self.params = params
        self.keys = models.keys()
        self.grid_searches = {}

    def fit(self, X, y, cv, n_jobs, verbose=1, scoring=None, refit=False):
        for key in self.keys:
            print(f"\nRunning GridSearchCV for {key} \n")

            model = PipelineClf(self.models[key])
            params = self.params[key]
            gs = GridSearchCV(model, params, cv=cv, n_jobs=n_jobs,
                              verbose=verbose, scoring=scoring, )
            gs.fit(X, y)
            self.grid_searches[key] = gs

    def score_summary(self, sort_by='mean_score'):
        def row(key, scores, params):
            d = {
                'estimator': key,
                'min_score': min(scores),
                'max_score': max(scores),
                'mean_score': np.mean(scores),
                'std_score': np.std(scores),
            }
            return pd.Series({**params, **d})

        rows = []
        for k in self.grid_searches:
            params = self.grid_searches[k].cv_results_['params']
            scores = []
            for i in range(self.grid_searches[k].cv):
                key = "split{}_test_score".format(i)
                r = self.grid_searches[k].cv_results_[key]
                scores.append(r.reshape(len(params), 1))

            all_scores = np.hstack(scores)
            for p, s in zip(params, all_scores):
                rows.append((row(k, s, p)))

        df = pd.concat(rows, axis=1).T.sort_values([sort_by], ascending=False)
        columns = ['estimator', 'min_score',
                   'mean_score', 'max_score', 'std_score']
        columns = columns + [c for c in df.columns if c not in columns]
        return df[columns], self.grid_searches

## Split Train and Test Set

---

# Section 2

Section 2 content

---

NOTE

* You may add as many sections as you want, as long as they support your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* If you do not need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
  # create here your folder
  # os.makedirs(name='')
except Exception as e:
  print(e)
