# **Predicting House Sales Price**

## Objectives

* The client is interested in predicting the house sale prices from her 4 inherited houses, and any other house in Ames, Iowa.
  * We need a way of checking the inherited houses vs the the selected variables and reliably pridict an outcome. 
  * We will likely use a conventional ML model to map the relationship between features and the target.
  * We will likely need hyperparameter optimization due to the conventional ML models used.

## Inputs

* outputs/datasets/cleaned/TestSetCleaned.csv
* outputs/datasets/cleaned/TrainSetCleaned.csv
* /workspace/PP5-ML/inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/inherited_houses.csv

## Outputs

* /workspace/PP5-ML/outputs/datasets/collection

## Additional Comments

* In case you have any additional comments that don't fit in the previous bullets, please state them here. 


---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/PP5-ML/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/workspace/PP5-ML'

# Section 1 - Import the transformed data

In [4]:
import numpy as np
import pandas as pd
df = (pd.read_csv("/workspace/PP5-ML/outputs/datasets/collection/Housing_prices_transformed.csv"))

print(df.shape)
df.head(3)


(1460, 24)


Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,EnclosedPorch,GarageArea,GarageFinish,...,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,WoodDeckSF,YearBuilt,YearRemodAdd,SalePrice
0,856,854.0,3.0,2.0,706,7,150,0.0,548,3,...,65.0,196.0,61,5,7,856,0.0,2003,2003,208500
1,1262,0.0,3.0,5.0,978,6,284,,460,3,...,80.0,0.0,0,8,6,1262,,1976,1976,181500
2,920,866.0,3.0,3.0,486,7,434,0.0,608,3,...,68.0,162.0,42,5,7,920,,2001,2002,223500


## Create an ML pipeline

In [5]:
from sklearn.pipeline import Pipeline

# Feature Engineering
from feature_engine.encoding import OrdinalEncoder
from feature_engine.selection import SmartCorrelatedSelection

# Feat Scaling
from sklearn.preprocessing import StandardScaler

# Feat Selection
from sklearn.feature_selection import SelectFromModel

# ML algorithms
from sklearn.tree import DecisionTreeRegressor
from xgboost import XGBRegressor
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import ExtraTreesRegressor


def PipelineOptimization(model):
    pipeline_base = Pipeline([

        ("OrdinalCategoricalEncoder", OrdinalEncoder(encoding_method='arbitrary',
                                                     variables=['GarageArea', 'GrLivArea', 'KitchenQual', 'OverallQual',
                                                                '1stFlrSF', 'TotalBsmtSF', 'YearBuilt',])),


        ("SmartCorrelatedSelection", SmartCorrelatedSelection(variables=None,
         method="spearman", threshold=0.6, selection_method="variance")),

        ("feat_scaling", StandardScaler()),

        ("feat_selection",  SelectFromModel(model)),

        ("model", model),

    ])

    return pipeline_base

In [6]:
from sklearn.model_selection import GridSearchCV


class HyperparameterOptimizationSearch:

    def __init__(self, models, params):
        self.models = models
        self.params = params
        self.keys = models.keys()
        self.grid_searches = {}

    def fit(self, X, y, cv, n_jobs, verbose=1, scoring=None, refit=False):
        for key in self.keys:
            print(f"\nRunning GridSearchCV for {key} \n")
            model = PipelineOptimization(self.models[key])

            params = self.params[key]
            gs = GridSearchCV(model, params, cv=cv, n_jobs=n_jobs,
                              verbose=verbose, scoring=scoring)
            gs.fit(X, y)
            self.grid_searches[key] = gs

    def score_summary(self, sort_by='mean_score'):
        def row(key, scores, params):
            d = {
                'estimator': key,
                'min_score': min(scores),
                'max_score': max(scores),
                'mean_score': np.mean(scores),
                'std_score': np.std(scores),
            }
            return pd.Series({**params, **d})

        rows = []
        for k in self.grid_searches:
            params = self.grid_searches[k].cv_results_['params']
            scores = []
            for i in range(self.grid_searches[k].cv):
                key = "split{}_test_score".format(i)
                r = self.grid_searches[k].cv_results_[key]
                scores.append(r.reshape(len(params), 1))

            all_scores = np.hstack(scores)
            for p, s in zip(params, all_scores):
                rows.append((row(k, s, p)))

        df = pd.concat(rows, axis=1).T.sort_values([sort_by], ascending=False)

        columns = ['estimator', 'min_score',
                   'mean_score', 'max_score', 'std_score']
        columns = columns + [c for c in df.columns if c not in columns]

        return df[columns], self.grid_searches

## Split for Test and Train Set

In [7]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    df.drop(['SalePrice'], axis=1),
    df['SalePrice'],
    test_size=0.2,
    random_state=0
)

print("* Train set:", X_train.shape, y_train.shape,
      "\n* Test set:",  X_test.shape, y_test.shape)


* Train set: (1168, 23) (1168,) 
* Test set: (292, 23) (292,)


Check for missing data

In [8]:
null_counts = X_train.isnull().sum() 
print(null_counts)

1stFlrSF            0
2ndFlrSF           60
BedroomAbvGr       80
BsmtExposure        0
BsmtFinSF1          0
BsmtFinType1        0
BsmtUnfSF           0
EnclosedPorch    1056
GarageArea          0
GarageFinish        0
GarageYrBlt        58
GrLivArea           0
KitchenQual         0
LotArea             0
LotFrontage       212
MasVnrArea          6
OpenPorchSF         0
OverallCond         0
OverallQual         0
TotalBsmtSF         0
WoodDeckSF       1034
YearBuilt           0
YearRemodAdd        0
dtype: int64


In [9]:
df_method = X_train.drop(columns=['EnclosedPorch', 'WoodDeckSF'])

In [10]:
from sklearn.impute import SimpleImputer

fillable_columns = ['2ndFlrSF', 'BedroomAbvGr', 'GarageYrBlt', 'LotFrontage', 'MasVnrArea']

imputer_num = SimpleImputer(strategy='mean')
df_method[fillable_columns] = imputer_num.fit_transform(df_method[fillable_columns])

imputer_cat = SimpleImputer(strategy='most_frequent')

categorical_cols = ['GarageFinish']
df_method[categorical_cols] = imputer_cat.fit_transform(df_method[categorical_cols])


In [11]:
null_counts = df_method.isnull().sum() 
print(null_counts)

1stFlrSF        0
2ndFlrSF        0
BedroomAbvGr    0
BsmtExposure    0
BsmtFinSF1      0
BsmtFinType1    0
BsmtUnfSF       0
GarageArea      0
GarageFinish    0
GarageYrBlt     0
GrLivArea       0
KitchenQual     0
LotArea         0
LotFrontage     0
MasVnrArea      0
OpenPorchSF     0
OverallCond     0
OverallQual     0
TotalBsmtSF     0
YearBuilt       0
YearRemodAdd    0
dtype: int64


In [12]:
Cleaned_data = df.drop(columns=['EnclosedPorch', 'WoodDeckSF'])


In [13]:
from sklearn.impute import SimpleImputer

fillable_columns = ['2ndFlrSF', 'BedroomAbvGr', 'GarageYrBlt', 'LotFrontage', 'MasVnrArea']

imputer_num = SimpleImputer(strategy='mean')
Cleaned_data[fillable_columns] = imputer_num.fit_transform(Cleaned_data[fillable_columns])


imputer_cat = SimpleImputer(strategy='most_frequent')

categorical_cols = ['GarageFinish']
Cleaned_data[categorical_cols] = imputer_cat.fit_transform(Cleaned_data[categorical_cols])


In [14]:
null_counts_Cleaned_data = Cleaned_data.isnull().sum() 
print(null_counts_Cleaned_data)
print(f'Below is the Cleaned_data')

1stFlrSF        0
2ndFlrSF        0
BedroomAbvGr    0
BsmtExposure    0
BsmtFinSF1      0
BsmtFinType1    0
BsmtUnfSF       0
GarageArea      0
GarageFinish    0
GarageYrBlt     0
GrLivArea       0
KitchenQual     0
LotArea         0
LotFrontage     0
MasVnrArea      0
OpenPorchSF     0
OverallCond     0
OverallQual     0
TotalBsmtSF     0
YearBuilt       0
YearRemodAdd    0
SalePrice       0
dtype: int64
Below is the Cleaned_data


## Split the data

In [15]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    Cleaned_data.drop(['SalePrice'], axis=1),
    Cleaned_data['SalePrice'],
    test_size=0.2,
    random_state=0
)

print("* Train set:", X_train.shape, y_train.shape,
      "\n* Test set:",  X_test.shape, y_test.shape)

* Train set: (1168, 21) (1168,) 
* Test set: (292, 21) (292,)


GridSearch CV - SKLearn

* We will start with the default parameters

In [16]:
models_quick_search = {
    'LinearRegression': LinearRegression(),
    "DecisionTreeRegressor": DecisionTreeRegressor(random_state=0),
    "RandomForestRegressor": RandomForestRegressor(random_state=0),
    "ExtraTreesRegressor": ExtraTreesRegressor(random_state=0),
    "AdaBoostRegressor": AdaBoostRegressor(random_state=0),
    "GradientBoostingRegressor": GradientBoostingRegressor(random_state=0),
    "XGBRegressor": XGBRegressor(random_state=0),
}

params_quick_search = {
    'LinearRegression': {},
    "DecisionTreeRegressor": {},
    "RandomForestRegressor": {},
    "ExtraTreesRegressor": {},
    "AdaBoostRegressor": {},
    "GradientBoostingRegressor": {},
    "XGBRegressor": {},
}

In [17]:
from sklearn.model_selection import GridSearchCV

# Define a function to perform grid search for each model
def perform_grid_search(model, params, X_train, y_train):
    grid_search = GridSearchCV(estimator=model, param_grid=params, scoring='r2', n_jobs=-1, cv=5)
    grid_search.fit(X_train, y_train)
    return grid_search.best_estimator_, grid_search.best_score_

# Loop through models and perform grid search
best_models = {}
best_scores = {}

for name, model in models_quick_search.items():
    print(f"Running GridSearchCV for {name}...")
    best_model, best_score = perform_grid_search(model, params_quick_search[name], X_train, y_train)
    best_models[name] = best_model
    best_scores[name] = best_score

# Display the best scores
for name, score in best_scores.items():
    print(f"{name}: Best R^2 Score = {score:.4f}")


Running GridSearchCV for LinearRegression...


Running GridSearchCV for DecisionTreeRegressor...
Running GridSearchCV for RandomForestRegressor...
Running GridSearchCV for ExtraTreesRegressor...
Running GridSearchCV for AdaBoostRegressor...
Running GridSearchCV for GradientBoostingRegressor...
Running GridSearchCV for XGBRegressor...


  if is_sparse(dtype):
  elif is_categorical_dtype(dtype) and enable_categorical:
  if is_sparse(dtype):
  if is_categorical_dtype(dtype)
  elif is_categorical_dtype(dtype) and enable_categorical:
  return is_int or is_bool or is_float or is_categorical_dtype(dtype)
  if is_sparse(dtype):
  elif is_categorical_dtype(dtype) and enable_categorical:
  if is_categorical_dtype(dtype)
  if is_categorical_dtype(dtype)
  return is_int or is_bool or is_float or is_categorical_dtype(dtype)
  if is_sparse(data):
  return is_int or is_bool or is_float or is_categorical_dtype(dtype)
  if is_sparse(dtype):
  elif is_categorical_dtype(dtype) and enable_categorical:
  if is_sparse(data):
  if is_sparse(data):
  if is_categorical_dtype(dtype)
  return is_int or is_bool or is_float or is_categorical_dtype(dtype)
  if is_sparse(data):
  if is_sparse(dtype):
  elif is_categorical_dtype(dtype) and enable_categorical:
  if is_categorical_dtype(dtype)
  return is_int or is_bool or is_float or is_categorical_

LinearRegression: Best R^2 Score = 0.8368
DecisionTreeRegressor: Best R^2 Score = 0.6973
RandomForestRegressor: Best R^2 Score = 0.8577
ExtraTreesRegressor: Best R^2 Score = 0.8586
AdaBoostRegressor: Best R^2 Score = 0.7952
GradientBoostingRegressor: Best R^2 Score = 0.8447
XGBRegressor: Best R^2 Score = 0.8405


## Here's a quick summary of the R² scores for comparison:

* RandomForestRegressor: 0.8573
* ExtraTreesRegressor: 0.8569
* GradientBoostingRegressor: 0.8444
* XGBRegressor: 0.8395
* LinearRegression: 0.8369
* AdaBoostRegressor: 0.7961
* DecisionTreeRegressor: 0.7071

We will start with the RamdomForestRegressor - Now time to investigate the parameters

---

# Section 2 - RandomForestRegressor 

In [18]:
pipeline = Pipeline(steps=[ 
    ('imputer', SimpleImputer(strategy='mean')), 
    ('model', RandomForestRegressor(random_state=0)) 
])

In [19]:
models_search = {
    "RandomForestRegressor": RandomForestRegressor(random_state=0),
}

params_search = {
    "RandomForestRegressor": {
        'model__n_estimators': [100, 300],
        'model__max_depth': [3, 10, None],
        'model__min_samples_split': [2, 10],
        'model__min_samples_leaf': [1, 4],
        'model__max_features': ['auto', 'sqrt', 'log2']
    }
}


In [20]:
y_train = df['SalePrice']

In [22]:
print(f"Length of TrainingSet: {len(X_train)}") 
print(f"Length of y_train: {len(y_train)}")

Length of TrainingSet: 1168
Length of y_train: 1460


In [76]:
search = HyperparameterOptimizationSearch(models=models_search, params=params_search)
search.fit(TrainingSet, y_train, scoring = 'r2', n_jobs=-1, cv=5)


Running GridSearchCV for RandomForestRegressor 



ValueError: Found input variables with inconsistent numbers of samples: [1168, 1460]

---

NOTE

* You may add as many sections as you want, as long as they support your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* If you do not need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
  # create here your folder
  # os.makedirs(name='')
except Exception as e:
  print(e)
