# **(Predict House Price Nootebook)**

## Objectives

* Develop and assess a predictive model for estimating the sale values of inherited properties.

## Inputs

* outputs/datasets/cleaned/HousePricesCleaned.csvk

## Outputs

* Train set (features and target)
* Test set (features and target)
* ML pipeline to predict house prices
* Feature Importance Plot
* Model performance plot

## Additional Comments

* In the begining of the project we made an hypothesis, after the taken steps we can make an conclusion that the hypothesis was true, we see that size, quality and the year the house was built matters on the price. I will also credit coce institute and https://github.com/Amareteklay/ who i followed. 


---

# Change working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/housepricepred2/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/workspace/housepricepred2'

## Load Data

Start by loading data

In [4]:
import numpy as np
import pandas as pd
df = pd.read_csv("outputs/datasets/collection/HousePrices.csv") 

print(df.shape)
df.head(3)

(1460, 24)


Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,EnclosedPorch,GarageArea,GarageFinish,...,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,WoodDeckSF,YearBuilt,YearRemodAdd,SalePrice
0,856,854.0,3.0,No,706,GLQ,150,0.0,548,RFn,...,65.0,196.0,61,5,7,856,0.0,2003,2003,208500
1,1262,0.0,3.0,Gd,978,ALQ,284,,460,RFn,...,80.0,0.0,0,8,6,1262,,1976,1976,181500
2,920,866.0,3.0,Mn,486,GLQ,434,0.0,608,RFn,...,68.0,162.0,42,5,7,920,,2001,2002,223500


---

## MP Pipeline: Regressor

* taken from code institute

In [5]:
from sklearn.pipeline import Pipeline

# Data Cleaning
from feature_engine.imputation import MeanMedianImputer
from feature_engine.selection import DropFeatures
from feature_engine.imputation import CategoricalImputer

# Feature Engineering
from feature_engine.encoding import OrdinalEncoder
from feature_engine.selection import SmartCorrelatedSelection
from feature_engine import transformation as vt
from feature_engine.outliers import Winsorizer

# Feature Scaling
from sklearn.preprocessing import StandardScaler

# Feature Selection
from sklearn.feature_selection import SelectFromModel

# Models
from sklearn.tree import DecisionTreeRegressor
from xgboost import XGBRegressor
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import ExtraTreesRegressor

def OptimizeModelPipeline(model):
    pipeline = Pipeline([
        ('impute_mean', MeanMedianImputer(imputation_method='mean', variables=['LotFrontage', 'BedroomAbvGr'])),
        ('impute_median', MeanMedianImputer(imputation_method='median', variables=['2ndFlrSF', 'MasVnrArea'])),
        ('impute_categorical', CategoricalImputer(imputation_method='frequent', variables=['GarageFinish', 'BsmtFinType1', 'BsmtExposure'])),
        ('drop_features', DropFeatures(features_to_drop=['EnclosedPorch', 'GarageYrBlt', 'WoodDeckSF'])), 
        ('encoder', OrdinalEncoder(encoding_method='arbitrary', variables=['BsmtExposure', 'BsmtFinType1', 'GarageFinish', 'KitchenQual'])),
        ('log_transformer', vt.LogTransformer(variables=['GrLivArea', 'LotArea', 'LotFrontage'])),
        ('power_transformer', vt.PowerTransformer(variables=['GarageArea', 'MasVnrArea', 'OpenPorchSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF'])),
        ('outlier_handler', Winsorizer(capping_method='iqr', tail='both', fold=1.5, variables=['GarageArea', 'LotArea', 'LotFrontage', 'MasVnrArea', 'OpenPorchSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF'])),  
        ('smart_corr_sel', SmartCorrelatedSelection(variables=None, method="spearman", threshold=0.6, selection_method="variance")),
        ('feat_scaling', StandardScaler()),
        ('feat_selection', SelectFromModel(model)),
        ('model', model)
    ])

    return pipeline

  from pandas import MultiIndex, Int64Index


In [6]:
from sklearn.model_selection import GridSearchCV
class HyperparameterOptimizationSearch:

    def __init__(self, models, params):
        self.models = models
        self.params = params
        self.keys = models.keys()
        self.grid_searches = {}

    def fit(self, X, y, cv, n_jobs, verbose=1, scoring=None, refit=False):
        for key in self.models.keys():
            try:
                print(f"\nRunning GridSearchCV for {key}\n")
                model = self.models[key]
                params = self.params[key]
                gs = GridSearchCV(model, params, cv=cv, n_jobs=n_jobs, verbose=verbose, scoring=scoring, error_score='raise')
                gs.fit(X, y)
                self.grid_searches[key] = gs
            except Exception as e:
                print(f"Error encountered for model {key}: {e}")
                continue

        return self.grid_searches
 

    def score_summary(self, sort_by='mean_score'):
        def row(key, scores, params):
            d = {
                 'estimator': key,
                 'min_score': min(scores),
                 'max_score': max(scores),
                 'mean_score': np.mean(scores),
                 'std_score': np.std(scores),
            }
            return pd.Series({**params,**d})

        rows = []
        for k in self.grid_searches:
            params = self.grid_searches[k].cv_results_['params']
            scores = []
            for i in range(self.grid_searches[k].cv):
                key = "split{}_test_score".format(i)
                r = self.grid_searches[k].cv_results_[key]        
                scores.append(r.reshape(len(params),1))

            all_scores = np.hstack(scores)
            for p, s in zip(params,all_scores):
                rows.append((row(k, s, p)))
        
        df = pd.concat(rows, axis=1).T.sort_values([sort_by], ascending=False)

        columns = ['estimator', 'min_score', 'mean_score', 'max_score', 'std_score']
        columns = columns + [c for c in df.columns if c not in columns]

        return df[columns], self.grid_searches

## Split Train and Test Set

In [7]:
from sklearn.model_selection import train_test_split

# Splitting the dataset into features (X) and target (y)
features = df.drop(columns=['SalePrice'])
target = df['SalePrice']

# Dividing the data into training and testing sets
X_train, X_test,y_train, y_test = train_test_split(
    features, 
    target, 
    test_size=0.2, 
    random_state=0
)

# Displaying the dimensions of the splits
print("* Dimensions of Training Data:", X_train.shape, y_train.shape)
print("* Dimensions of Testing Data:", X_test.shape, y_test.shape)

* Dimensions of Training Data: (1168, 23) (1168,)
* Dimensions of Testing Data: (292, 23) (292,)


## Grid Search CV - Sklearn

In [14]:
# Set up a dictionary of various regression models with default settings
initial_models = {
    "Linear_Reg": LinearRegression(),
    "Decision_Tree": DecisionTreeRegressor(random_state=0),
    "Random_Forest": RandomForestRegressor(random_state=0),
    "Extra_Trees": ExtraTreesRegressor(random_state=0),
    "AdaBoost": AdaBoostRegressor(random_state=0),
    "Gradient_Boosting": GradientBoostingRegressor(random_state=0),
    "XGBoost": XGBRegressor(random_state=0),
}

# Define hyperparameters for a quick comparison of models
model_hyperparams = {
    "Linear_Reg": {},

    "Decision_Tree": {
        'max_depth': [None, 4, 15],
        'min_samples_split': [2, 50],
        'min_samples_leaf': [1, 50],
        'max_leaf_nodes': [None, 50],
    },

    "Random_Forest": {
        'n_estimators': [100, 50, 140],
        'max_depth': [None, 4, 15],
        'min_samples_split': [2, 50],
        'min_samples_leaf': [1, 50],
        'max_leaf_nodes': [None, 50],
    },

    "Extra_Trees": {
        'n_estimators': [100, 50, 150],
        'max_depth': [None, 3, 15],
        'min_samples_split': [2, 50],
        'min_samples_leaf': [1, 50],
    },

    "AdaBoost": {
        'n_estimators': [50, 25, 80, 150],
        'learning_rate': [1, 0.1, 2],
        'loss': ['linear', 'square', 'exponential'],
    },

    "Gradient_Boosting": {
        'n_estimators': [100, 50, 140],
        'learning_rate': [0.1, 0.01, 0.001],
        'max_depth': [3, 15, None],
        'min_samples_split': [2, 50],
        'min_samples_leaf': [1, 50],
        'max_leaf_nodes': [None, 50],
    },

    "XGBoost": {
        'n_estimators': [30, 80, 200],
        'max_depth': [None, 3, 15],
                    'learning_rate': [0.01,0.1,0.001],
                    'gamma': [0, 0.1],
        },
}

In [15]:
search = HyperparameterOptimizationSearch(models=initial_models, params=model_hyperparams)
search.fit(X_train, y_train, cv=5, n_jobs=-1, verbose=1, scoring='r2')



Running GridSearchCV for Linear_Reg

Fitting 5 folds for each of 1 candidates, totalling 5 fits
Error encountered for model Linear_Reg: could not convert string to float: 'Mn'

Running GridSearchCV for Decision_Tree

Fitting 5 folds for each of 24 candidates, totalling 120 fits
Error encountered for model Decision_Tree: could not convert string to float: 'Av'

Running GridSearchCV for Random_Forest

Fitting 5 folds for each of 72 candidates, totalling 360 fits
Error encountered for model Random_Forest: could not convert string to float: 'Mn'

Running GridSearchCV for Extra_Trees

Fitting 5 folds for each of 36 candidates, totalling 180 fits
Error encountered for model Extra_Trees: could not convert string to float: 'Mn'

Running GridSearchCV for AdaBoost

Fitting 5 folds for each of 36 candidates, totalling 180 fits
Error encountered for model AdaBoost: Input contains NaN

Running GridSearchCV for Gradient_Boosting

Fitting 5 folds for each of 216 candidates, totalling 1080 fits
Error

  from pandas import MultiIndex, Int64Index
  from pandas import MultiIndex, Int64Index
  from pandas import MultiIndex, Int64Index
  from pandas import MultiIndex, Int64Index
  from pandas import MultiIndex, Int64Index
  from pandas import MultiIndex, Int64Index
  from pandas import MultiIndex, Int64Index
  from pandas import MultiIndex, Int64Index
  from pandas import MultiIndex, Int64Index
  from pandas import MultiIndex, Int64Index
  from pandas import MultiIndex, Int64Index


Error encountered for model XGBoost: DataFrame.dtypes for data must be int, float or bool.
                Did not expect the data types in fields BsmtExposure, BsmtFinType1, GarageFinish, KitchenQual


  from pandas import MultiIndex, Int64Index


{}

We run a summary and check results

---

NOTE

* You may add as many sections as you want, as long as they support your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* If you do not need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
  # create here your folder
  # os.makedirs(name='')
except Exception as e:
  print(e)
