# **(Predict House Price Nootebook)**

## Objectives

* Develop and assess a predictive model for estimating the sale values of inherited properties.

## Inputs

* outputs/datasets/cleaned/HousePricesCleaned.csvk

## Outputs

* Train set (features and target)
* Test set (features and target)
* ML pipeline to predict house prices
* Feature Importance Plot
* Model performance plot

## Additional Comments

* In the begining of the project we made an hypothesis, after the taken steps we can make an conclusion that the hypothesis was true, we see that size, quality and the year the house was built matters on the price. I will also credit coce institute and https://github.com/Amareteklay/ who i followed. 


---

# Change working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/housepricepred2/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/workspace/housepricepred2'

## Load Data

Start by loading data

In [4]:
import numpy as np
import pandas as pd
df = pd.read_csv("outputs/datasets/collection/HousePrices.csv") 

print(df.shape)
df.head()

(1460, 24)


Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,EnclosedPorch,GarageArea,GarageFinish,...,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,WoodDeckSF,YearBuilt,YearRemodAdd,SalePrice
0,856,854.0,3.0,No,706,GLQ,150,0.0,548,RFn,...,65.0,196.0,61,5,7,856,0.0,2003,2003,208500
1,1262,0.0,3.0,Gd,978,ALQ,284,,460,RFn,...,80.0,0.0,0,8,6,1262,,1976,1976,181500
2,920,866.0,3.0,Mn,486,GLQ,434,0.0,608,RFn,...,68.0,162.0,42,5,7,920,,2001,2002,223500
3,961,,,No,216,ALQ,540,,642,Unf,...,60.0,0.0,35,5,7,756,,1915,1970,140000
4,1145,,4.0,Av,655,GLQ,490,0.0,836,RFn,...,84.0,350.0,84,5,8,1145,,2000,2000,250000


---

## MP Pipeline: Regressor

In [5]:
from sklearn.pipeline import Pipeline

# Feature Engineering
from feature_engine.selection import SmartCorrelatedSelection
from feature_engine.encoding import OrdinalEncoder
# Data Cleaning
from feature_engine.imputation import MeanMedianImputer
from feature_engine.selection import DropFeatures
from feature_engine.imputation import CategoricalImputer

# Feature Engineering
from feature_engine.encoding import OrdinalEncoder
from feature_engine.selection import SmartCorrelatedSelection
from feature_engine import transformation as vt
from feature_engine.outliers import Winsorizer


def PipelineDataCleaningAndFeatureEngineering():
    pipeline = Pipeline([
        ('impute_mean', MeanMedianImputer(imputation_method='mean', variables=['LotFrontage', 'BedroomAbvGr'])),
        ('impute_median', MeanMedianImputer(imputation_method='median', variables=['2ndFlrSF', 'MasVnrArea'])),
        ('impute_categorical', CategoricalImputer(imputation_method='frequent', variables=['GarageFinish', 'BsmtFinType1', 'BsmtExposure'])),
        ('drop_features', DropFeatures(features_to_drop=['EnclosedPorch', 'GarageYrBlt', 'WoodDeckSF'])), 
        ('encoder', OrdinalEncoder(encoding_method='arbitrary', variables=['BsmtExposure', 'BsmtFinType1', 'GarageFinish', 'KitchenQual'])),
        ('log_transformer', vt.LogTransformer(variables=['GrLivArea', 'LotArea', 'LotFrontage'])),
        ('power_transformer', vt.PowerTransformer(variables=['GarageArea', 'MasVnrArea', 'OpenPorchSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF'])),
        ('outlier_handler', Winsorizer(capping_method='iqr', tail='both', fold=1.5, variables=['GarageArea', 'LotArea', 'LotFrontage', 'MasVnrArea', 'OpenPorchSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF'])),  
        ('smart_corr_sel', SmartCorrelatedSelection(variables=None, method="spearman", threshold=0.6, selection_method="variance")),
    ])

    return pipeline
    


PipelineDataCleaningAndFeatureEngineering()

Pipeline(steps=[('impute_mean',
                 MeanMedianImputer(imputation_method='mean',
                                   variables=['LotFrontage', 'BedroomAbvGr'])),
                ('impute_median',
                 MeanMedianImputer(variables=['2ndFlrSF', 'MasVnrArea'])),
                ('impute_categorical',
                 CategoricalImputer(imputation_method='frequent',
                                    variables=['GarageFinish', 'BsmtFinType1',
                                               'BsmtExposure'])),
                ('drop_features',
                 DropFeatures(f...
                 PowerTransformer(variables=['GarageArea', 'MasVnrArea',
                                             'OpenPorchSF', 'TotalBsmtSF',
                                             '1stFlrSF', '2ndFlrSF'])),
                ('outlier_handler',
                 Winsorizer(capping_method='iqr', fold=1.5, tail='both',
                            variables=['GarageArea', 'LotArea', 'LotFr

In [6]:
from sklearn.pipeline import Pipeline

# Feature Scaling
from sklearn.preprocessing import StandardScaler

# Feature Selection
from sklearn.feature_selection import SelectFromModel

# Models
from sklearn.tree import DecisionTreeRegressor
from xgboost import XGBRegressor
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import ExtraTreesRegressor

def OptimizeModelPipeline(model):
    pipeline = Pipeline([
        ('feat_scaling', StandardScaler()),
        ('feat_selection', SelectFromModel(model)),
        ('model', model)
    ])

    return pipeline

  from pandas import MultiIndex, Int64Index


* taken from code institute

In [7]:
from sklearn.model_selection import GridSearchCV
class HyperparameterOptimizationSearch:

    def __init__(self, models, params):
        self.models = models
        self.params = params
        self.keys = models.keys()
        self.grid_searches = {}

    def fit(self, X, y, cv, n_jobs, verbose=1, scoring=None, refit=False):
        for key in self.models.keys():
            try:
                print(f"\nRunning GridSearchCV for {key}\n")
                model = self.models[key]
                params = self.params[key]
                gs = GridSearchCV(model, params, cv=cv, n_jobs=n_jobs, verbose=verbose, scoring=scoring, error_score='raise')
                gs.fit(X, y)
                self.grid_searches[key] = gs
            except Exception as e:
                print(f"Error encountered for model {key}: {e}")
                continue

        return self.grid_searches
 

    def score_summary(self, sort_by='mean_score'):
        def row(key, scores, params):
            d = {
                 'estimator': key,
                 'min_score': min(scores),
                 'max_score': max(scores),
                 'mean_score': np.mean(scores),
                 'std_score': np.std(scores),
            }
            return pd.Series({**params,**d})

        rows = []
        for k in self.grid_searches:
            params = self.grid_searches[k].cv_results_['params']
            scores = []
            for i in range(self.grid_searches[k].cv):
                key = "split{}_test_score".format(i)
                r = self.grid_searches[k].cv_results_[key]        
                scores.append(r.reshape(len(params),1))

            all_scores = np.hstack(scores)
            for p, s in zip(params,all_scores):
                rows.append((row(k, s, p)))
        
        df = pd.concat(rows, axis=1).T.sort_values([sort_by], ascending=False)

        columns = ['estimator', 'min_score', 'mean_score', 'max_score', 'std_score']
        columns = columns + [c for c in df.columns if c not in columns]

        return df[columns], self.grid_searches

## Split Train and Test Set

In [8]:


from sklearn.model_selection import train_test_split
X_train, X_test,y_train, y_test = train_test_split(
                                    df.drop(['SalePrice'], axis=1) ,
                                    df['SalePrice'],
                                    test_size=0.2,
                                    random_state=0
                                    )

print("* Train set:", X_train.shape, y_train.shape, "\n* Test set:",  X_test.shape, y_test.shape)

* Train set: (1168, 23) (1168,) 
* Test set: (292, 23) (292,)


In [9]:
pipeline_data_cleaning_feat_eng = PipelineDataCleaningAndFeatureEngineering()
X_train = pipeline_data_cleaning_feat_eng.fit_transform(X_train)
X_test = pipeline_data_cleaning_feat_eng.transform(X_test)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

(1168, 17) (1168,) (292, 17) (292,)


  f = X[feature_group].std().sort_values(ascending=False).index[0]
  f = X[feature_group].std().sort_values(ascending=False).index[0]
  f = X[feature_group].std().sort_values(ascending=False).index[0]


## Grid Search CV - Sklearn

In [10]:
# Set up a dictionary of various regression models with default settings
initial_models = {
    "Linear_Reg": LinearRegression(),
    "Decision_Tree": DecisionTreeRegressor(random_state=0),
    "Random_Forest": RandomForestRegressor(random_state=0),
    "Extra_Trees": ExtraTreesRegressor(random_state=0),
    "AdaBoost": AdaBoostRegressor(random_state=0),
    "Gradient_Boosting": GradientBoostingRegressor(random_state=0),
    "XGBoost": XGBRegressor(random_state=0),
}

# Define hyperparameters for a quick comparison of models
model_hyperparams = {
    "Linear_Reg": {},

    "Decision_Tree": {
        'max_depth': [None, 4, 15],
        'min_samples_split': [2, 50],
        'min_samples_leaf': [1, 50],
        'max_leaf_nodes': [None, 50],
    },

    "Random_Forest": {
        'n_estimators': [100, 50, 140],
        'max_depth': [None, 4, 15],
        'min_samples_split': [2, 50],
        'min_samples_leaf': [1, 50],
        'max_leaf_nodes': [None, 50],
    },

    "Extra_Trees": {
        'n_estimators': [100, 50, 150],
        'max_depth': [None, 3, 15],
        'min_samples_split': [2, 50],
        'min_samples_leaf': [1, 50],
    },

    "AdaBoost": {
        'n_estimators': [50, 25, 80, 150],
        'learning_rate': [1, 0.1, 2],
        'loss': ['linear', 'square', 'exponential'],
    },

    "Gradient_Boosting": {
        'n_estimators': [100, 50, 140],
        'learning_rate': [0.1, 0.01, 0.001],
        'max_depth': [3, 15, None],
        'min_samples_split': [2, 50],
        'min_samples_leaf': [1, 50],
        'max_leaf_nodes': [None, 50],
    },

    "XGBoost": {
        'n_estimators': [30, 80, 200],
        'max_depth': [None, 3, 15],
                    'learning_rate': [0.01,0.1,0.001],
                    'gamma': [0, 0.1],
        },
}

In [11]:
search = HyperparameterOptimizationSearch(models=initial_models, params=model_hyperparams)
search.fit(X_train, y_train, cv=5, n_jobs=-1, verbose=1, scoring='r2')



Running GridSearchCV for Linear_Reg

Fitting 5 folds for each of 1 candidates, totalling 5 fits

Running GridSearchCV for Decision_Tree

Fitting 5 folds for each of 24 candidates, totalling 120 fits

Running GridSearchCV for Random_Forest

Fitting 5 folds for each of 72 candidates, totalling 360 fits

Running GridSearchCV for Extra_Trees

Fitting 5 folds for each of 36 candidates, totalling 180 fits

Running GridSearchCV for AdaBoost

Fitting 5 folds for each of 36 candidates, totalling 180 fits

Running GridSearchCV for Gradient_Boosting

Fitting 5 folds for each of 216 candidates, totalling 1080 fits

Running GridSearchCV for XGBoost

Fitting 5 folds for each of 54 candidates, totalling 270 fits


  from pandas import MultiIndex, Int64Index
  from pandas import MultiIndex, Int64Index
  from pandas import MultiIndex, Int64Index
  from pandas import MultiIndex, Int64Index
  from pandas import MultiIndex, Int64Index
  from pandas import MultiIndex, Int64Index
  from pandas import MultiIndex, Int64Index
  from pandas import MultiIndex, Int64Index
  from pandas import MultiIndex, Int64Index
  from pandas import MultiIndex, Int64Index
  from pandas import MultiIndex, Int64Index
  from pandas import MultiIndex, Int64Index
  from pandas import MultiIndex, Int64Index
  from pandas import MultiIndex, Int64Index
  from pandas import MultiIndex, Int64Index
  from pandas import MultiIndex, Int64Index
  from pandas import MultiIndex, Int64Index
  from pandas import MultiIndex, Int64Index
  from pandas import MultiIndex, Int64Index
  from pandas import MultiIndex, Int64Index
  from pandas import MultiIndex, Int64Index
  from pandas import MultiIndex, Int64Index
  from pandas import MultiIndex,

{'Linear_Reg': GridSearchCV(cv=5, error_score='raise', estimator=LinearRegression(), n_jobs=-1,
              param_grid={}, scoring='r2', verbose=1),
 'Decision_Tree': GridSearchCV(cv=5, error_score='raise',
              estimator=DecisionTreeRegressor(random_state=0), n_jobs=-1,
              param_grid={'max_depth': [None, 4, 15],
                          'max_leaf_nodes': [None, 50],
                          'min_samples_leaf': [1, 50],
                          'min_samples_split': [2, 50]},
              scoring='r2', verbose=1),
 'Random_Forest': GridSearchCV(cv=5, error_score='raise',
              estimator=RandomForestRegressor(random_state=0), n_jobs=-1,
              param_grid={'max_depth': [None, 4, 15],
                          'max_leaf_nodes': [None, 50],
                          'min_samples_leaf': [1, 50],
                          'min_samples_split': [2, 50],
                          'n_estimators': [100, 50, 140]},
              scoring='r2', verbose=1),
 'E

We run a summary and check results

In [12]:
grid_search_summary, grid_search_pipelines = search.score_summary(sort_by='mean_score')
grid_search_summary

Unnamed: 0,estimator,min_score,mean_score,max_score,std_score,max_depth,max_leaf_nodes,min_samples_leaf,min_samples_split,n_estimators,learning_rate,loss,gamma
99,Extra_Trees,0.714201,0.818785,0.871247,0.055429,,,1,2,150,,,
123,Extra_Trees,0.719246,0.81637,0.865197,0.052675,15,,1,2,150,,,
121,Extra_Trees,0.715323,0.815421,0.865929,0.053533,15,,1,2,100,,,
97,Extra_Trees,0.703878,0.815397,0.870162,0.05894,,,1,2,100,,,
192,Gradient_Boosting,0.721547,0.81457,0.874405,0.055771,3,50,50,50,140,0.1,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
433,XGBoost,-6.297266,-5.096869,-4.399388,0.639843,3,,,,30,0.001,,0.1
430,XGBoost,-6.3037,-5.098871,-4.399887,0.642175,,,,,30,0.001,,0.1
403,XGBoost,-6.3037,-5.098871,-4.399887,0.642175,,,,,30,0.001,,0
409,XGBoost,-6.304674,-5.099095,-4.399837,0.642522,15,,,,30,0.001,,0


## Do an extensive search on the most suitable model to find the best hyperparameter configuration.


* The first step we take is to create a model with parameters

In [17]:
initial_models = {
    "Extra_Trees": ExtraTreesRegressor(random_state=0),
}
model_hyperparams = {
    "Extra_Trees":{'n_estimators': [50,100,150],
        'max_depth': [None, 3, 15],
        'min_samples_split': [2, 50],
        'min_samples_leaf': [1,50],
        },
}

* Then we do as before and running an extensive GridSearch CV

In [18]:
search = HyperparameterOptimizationSearch(models=initial_models, params= model_hyperparams)
search.fit(X_train, y_train, scoring = 'r2', n_jobs=-1, cv=5)


Running GridSearchCV for Extra_Trees

Fitting 5 folds for each of 36 candidates, totalling 180 fits


{'Extra_Trees': GridSearchCV(cv=5, error_score='raise',
              estimator=ExtraTreesRegressor(random_state=0), n_jobs=-1,
              param_grid={'max_depth': [None, 3, 15],
                          'min_samples_leaf': [1, 50],
                          'min_samples_split': [2, 50],
                          'n_estimators': [50, 100, 150]},
              scoring='r2', verbose=1)}

In [19]:
grid_search_summary, grid_search_pipelines = search.score_summary(sort_by='mean_score')
grid_search_summary

Unnamed: 0,estimator,min_score,mean_score,max_score,std_score,max_depth,min_samples_leaf,min_samples_split,n_estimators
2,Extra_Trees,0.714201,0.818785,0.871247,0.055429,,1,2,150
26,Extra_Trees,0.719246,0.81637,0.865197,0.052675,15.0,1,2,150
25,Extra_Trees,0.715323,0.815421,0.865929,0.053533,15.0,1,2,100
1,Extra_Trees,0.703878,0.815397,0.870162,0.05894,,1,2,100
0,Extra_Trees,0.697427,0.811536,0.872221,0.061066,,1,2,50
24,Extra_Trees,0.717306,0.811155,0.859755,0.052548,15.0,1,2,50
5,Extra_Trees,0.723733,0.779456,0.819367,0.041943,,1,50,150
29,Extra_Trees,0.723502,0.779419,0.819453,0.042099,15.0,1,50,150
28,Extra_Trees,0.718758,0.778994,0.819847,0.043359,15.0,1,50,100
4,Extra_Trees,0.719065,0.778945,0.819745,0.0432,,1,50,100


* We looking for the optimize model

In [25]:
optimal_model = grid_search_summary.iloc[0]['estimator']
optimal_model

'Extra_Trees'

* Extract the best parameters for the top-performing model

In [21]:
optimal_parameters = grid_search_pipelines[optimal_model].best_params_
optimal_parameters

{'max_depth': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'n_estimators': 150}

* Assign the most effective regression model from the grid search results

In [22]:
optimal_regression_pipeline = grid_search_pipelines[optimal_model].best_estimator_
optimal_regression_pipeline

ExtraTreesRegressor(n_estimators=150, random_state=0)

### Assess feature importance

In [26]:
X_train.head(3)

Unnamed: 0,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,GarageArea,GarageFinish,KitchenQual,LotArea,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,TotalBsmtSF,YearBuilt,YearRemodAdd
618,0.0,2.883272,0,48,0,1774,27.820855,0,0,9.366831,4.49981,21.260292,10.392305,5,42.684892,2007,2007
870,0.0,2.0,1,0,0,894,17.549929,0,1,8.794825,4.094345,0.0,0.0,5,29.899833,1962,1962
92,0.0,2.0,1,713,1,163,20.78461,0,1,9.50002,4.382027,0.0,0.0,7,29.597297,1921,2006


In [23]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Setting the style for the plot
sns.set_theme(style="whitegrid")

# Define the number of steps in the pipeline related to data cleaning and feature engineering
initial_pipeline_steps = 9

# Extract the feature names after the initial steps of the pipeline
featured_columns_post_processing = Pipeline(optimal_regression_pipeline.steps[:initial_pipeline_steps]).transform(X_train).columns

# Identifying the significant features based on the selection from the model
important_features = featured_columns_post_processing[optimal_regression_pipeline['feat_selection'].get_support()]

# Creating a DataFrame for the importance of each feature
df_feature_importance = pd.DataFrame({
    'Feature': important_features,
    'Importance': optimal_regression_pipeline.named_steps['model'].feature_importances_
}).sort_values(by='Importance', ascending=False)

# Displaying the key features and their importance
print(f"The model focuses on these {len(important_features)} key features, listed in order of importance: \n{df_feature_importance['Feature'].to_list()}")

# Plotting the feature importance
df_feature_importance.set_index('Feature').plot(kind='bar')
plt.title("Feature Importance in the Model")
plt.ylabel("Importance")
plt.show()

AttributeError: 'ExtraTreesRegressor' object has no attribute 'steps'

---

NOTE

* You may add as many sections as you want, as long as they support your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* If you do not need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
  # create here your folder
  # os.makedirs(name='')
except Exception as e:
  print(e)
