# **Refining the model to predict the House Sale Price**

## Objectives

* The client is interested in predicting the house sale prices from her 4 inherited houses, and any other house in Ames, Iowa.
  * We need a way of checking the inherited houses vs the the selected variables and reliably pridict an outcome. 
  * We will likely use a conventional ML model to map the relationship between features and the target.
  * We will likely need hyperparameter optimization due to the conventional ML models used.

## Inputs

* outputs/datasets/cleaned/TestSetCleaned.csv
* outputs/datasets/cleaned/TrainSetCleaned.csv
* /workspace/PP5-ML/inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/inherited_houses.csv

## Outputs

* /workspace/PP5-ML/outputs/datasets/collection

## Additional Comments

* In case you have any additional comments that don't fit in the previous bullets, please state them here. 


---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [46]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/PP5-ML'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [47]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [48]:
current_dir = os.getcwd()
current_dir

'/workspace'

# Section 1 - Inport the data

cleaned data from section 3

In [49]:
import pandas as pd
df_cleaned = (pd.read_csv('/workspace/PP5-ML/outputs/datasets/cleaned/CleanedDataset.csv'))
df_cleaned.head(3)

Unnamed: 0,1stFlrSF,2ndFlrSF,BsmtFinSF1,GarageArea,GrLivArea,LotArea,OverallCond,OverallQual,TotalBsmtSF,YearBuilt,SalePrice
0,856,854.0,706,548,1710,8450,5,7,856,2003,208500
1,1262,0.0,978,460,1262,9600,8,6,1262,1976,181500
2,920,866.0,486,608,1786,11250,5,7,920,2001,223500


## Make the pipeline

In [50]:
from sklearn.linear_model import LinearRegression, Lasso
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor, AdaBoostRegressor, GradientBoostingRegressor
from xgboost import XGBRegressor
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OrdinalEncoder
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import SelectFromModel
from feature_engine.selection import SmartCorrelatedSelection
import pandas as pd
import numpy as np

# Additional imports for evaluation and plotting
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')



In [51]:
# Define numerical and categorical features
num_features = ['OverallCond', 'GrLivArea', 'GarageArea', 'TotalBsmtSF', 'OverallQual', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'LotArea']

# Preprocessor for the pipeline
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), num_features),
    ]
)


In [52]:
models_quick_search = { 
    'LinearRegression': LinearRegression(), 
    "DecisionTreeRegressor": DecisionTreeRegressor(random_state=0), 
    "RandomForestRegressor": RandomForestRegressor(random_state=0), 
    "ExtraTreesRegressor": ExtraTreesRegressor(random_state=0), 
    "AdaBoostRegressor": AdaBoostRegressor(random_state=0), 
    "GradientBoostingRegressor": GradientBoostingRegressor(random_state=0), 
    "XGBRegressor": XGBRegressor(random_state=0)
}

params_quick_search = {
    'LinearRegression': {},
    "DecisionTreeRegressor": {},
    "RandomForestRegressor": {},
    "ExtraTreesRegressor": {},
    "AdaBoostRegressor": {},
    "GradientBoostingRegressor": {},
    "XGBRegressor": {},
}
lasso_params = {'alpha': 2} 


In [53]:
# Define pipeline with Lasso and LinearRegression
def final_pipeline(lasso_params, model):
    the_pipeline_base = Pipeline([
        ('preprocessor', preprocessor),
        ('feature_selection', SelectFromModel(Lasso(**lasso_params, random_state=0))),
        ('model', model)
    ])
    return the_pipeline_base

# Create the pipeline with the specified parameters 
pipeline = final_pipeline(lasso_params, models_quick_search)
    
pipeline


## HyperParameter Optimization

In [54]:
from sklearn.model_selection import GridSearchCV


class HyperparameterOptimizationSearch:

    def __init__(self, models, params, lasso_params):
        self.models = models
        self.params = params
        self.lasso_params = lasso_params
        self.keys = models.keys()
        self.grid_searches = {}

    def fit(self, X, y, cv, n_jobs, verbose=1, scoring=None, refit=False):
        for key in self.keys:
            print(f"\nRunning GridSearchCV for {key} \n")
            model = final_pipeline(self.lasso_params, self.models[key])

            params = self.params[key]
            gs = GridSearchCV(model, params, cv=cv, n_jobs=n_jobs,
                              verbose=verbose, scoring=scoring)
            gs.fit(X, y)
            self.grid_searches[key] = gs

    def score_summary(self, sort_by='mean_score'):
        def row(key, scores, params):
            d = {
                'estimator': key,
                'min_score': min(scores),
                'max_score': max(scores),
                'mean_score': np.mean(scores),
                'std_score': np.std(scores),
            }
            return pd.Series({**params, **d})

        rows = []
        for k in self.grid_searches:
            params = self.grid_searches[k].cv_results_['params']
            scores = []
            for i in range(self.grid_searches[k].cv):
                key = "split{}_test_score".format(i)
                r = self.grid_searches[k].cv_results_[key]
                scores.append(r.reshape(len(params), 1))

            all_scores = np.hstack(scores)
            for p, s in zip(params, all_scores):
                rows.append((row(k, s, p)))

        df = pd.concat(rows, axis=1).T.sort_values([sort_by], ascending=False)

        columns = ['estimator', 'min_score',
                   'mean_score', 'max_score', 'std_score']
        columns = columns + [c for c in df.columns if c not in columns]

        return df[columns], self.grid_searches


## Split The Data

In [55]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    df_cleaned.drop(['SalePrice'], axis=1),
    df_cleaned['SalePrice'],
    test_size=0.2,
    random_state=0
)

print("* Train set:", X_train.shape, y_train.shape,
      "\n* Test set:",  X_test.shape, y_test.shape)


* Train set: (1168, 10) (1168,) 
* Test set: (292, 10) (292,)


In [56]:
search = HyperparameterOptimizationSearch(models=models_quick_search, params=params_quick_search, lasso_params=lasso_params)
search.fit(X_train, y_train, scoring='r2', n_jobs=-1, cv=5)


Running GridSearchCV for LinearRegression 

Fitting 5 folds for each of 1 candidates, totalling 5 fits

Running GridSearchCV for DecisionTreeRegressor 

Fitting 5 folds for each of 1 candidates, totalling 5 fits

Running GridSearchCV for RandomForestRegressor 

Fitting 5 folds for each of 1 candidates, totalling 5 fits

Running GridSearchCV for ExtraTreesRegressor 

Fitting 5 folds for each of 1 candidates, totalling 5 fits

Running GridSearchCV for AdaBoostRegressor 

Fitting 5 folds for each of 1 candidates, totalling 5 fits

Running GridSearchCV for GradientBoostingRegressor 

Fitting 5 folds for each of 1 candidates, totalling 5 fits

Running GridSearchCV for XGBRegressor 

Fitting 5 folds for each of 1 candidates, totalling 5 fits


  if is_sparse(data):
  if is_sparse(data):
  if is_sparse(data):
  if is_sparse(data):
  if is_sparse(data):
  if is_sparse(data):


In [57]:
grid_search_summary, grid_search_pipelines = search.score_summary(sort_by='mean_score')
grid_search_summary

Unnamed: 0,estimator,min_score,mean_score,max_score,std_score
2,RandomForestRegressor,0.750737,0.845073,0.883747,0.050575
3,ExtraTreesRegressor,0.712865,0.841069,0.895328,0.067535
5,GradientBoostingRegressor,0.680357,0.824365,0.890755,0.07699
0,LinearRegression,0.733786,0.810433,0.84494,0.040062
6,XGBRegressor,0.672389,0.784432,0.866785,0.090982
4,AdaBoostRegressor,0.683114,0.773687,0.821324,0.047313
1,DecisionTreeRegressor,0.563141,0.711782,0.780324,0.079017


In [58]:
models_search = {
    "RandomForestRegressor": RandomForestRegressor(random_state=0), 
}


params_search = {
    "RandomForestRegressor": { 
        'model__n_estimators': [100, 300], 
        'model__max_depth': [10, 20, None], 
        'model__min_samples_split': [2, 5], 
        'model__min_samples_leaf': [1, 2] 
    }, 
}


In [59]:
search = HyperparameterOptimizationSearch(models=models_search, params=params_search, lasso_params=lasso_params)
search.fit(X_train, y_train, scoring = 'r2', n_jobs=-1, cv=5)


Running GridSearchCV for RandomForestRegressor 

Fitting 5 folds for each of 24 candidates, totalling 120 fits


KeyboardInterrupt: 

In [None]:
grid_search_summary, grid_search_pipelines = search.score_summary(sort_by='mean_score')
grid_search_summary

In [None]:
best_model = grid_search_summary.iloc[0, 0]
best_model

In [None]:
grid_search_pipelines[best_model].best_params_

In [18]:
best_model = search.grid_searches['RandomForestRegressor'].best_estimator_

In [19]:
# Extract the pipeline components 
preprocessor = best_model.named_steps['preprocessor'] 
feature_selector = best_model.named_steps['feature_selection'] 
final_model = best_model.named_steps['model'] 

# Get the original feature names 
all_features = num_features

In [None]:
selected_mask = feature_selector.get_support() 
selected_features = [feature for feature, selected in zip(all_features, selected_mask) if selected] 

print("Selected Features:", selected_features)

In [None]:
# Visualize Feature Importances (for tree-based models) 
if hasattr(final_model, 'feature_importances_'): 
    df_feature_importance = pd.DataFrame(data={ 
        'Feature': selected_features, 
        'Importance': final_model.feature_importances_ 
    }).sort_values(by='Importance', ascending=False) 

    print(f"* These are the most important features in descending order. The model was trained on them: \n{df_feature_importance['Feature'].to_list()}") 

    df_feature_importance.plot(kind='bar', x='Feature', y='Importance') 
    plt.title('Feature Importance') 
    plt.xlabel('Features') 
    plt.ylabel('Importance') 
    plt.show() 
else: 
    print("Selected Features:", selected_features)

## Look at that. Square one again.

In [None]:
df_cleaned

---

In [None]:
df_cpy = df_cleaned.copy()
df_cpy.head()

In [24]:
from sklearn.preprocessing import StandardScaler, OrdinalEncoder
from sklearn.compose import ColumnTransformer

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), num_features),
    ]
)

In [25]:
from sklearn.pipeline import Pipeline
from sklearn.ensemble import AdaBoostRegressor, GradientBoostingRegressor
from xgboost import XGBRegressor

def final_pipeline2(model):
    the_pipeline_base = Pipeline([
        ("SmartCorrelatedSelection", SmartCorrelatedSelection(variables=None,
         method="spearman", threshold=0.6, selection_method="variance")),

        ("feat_scaling", StandardScaler()),

        ("feat_selection",  SelectFromModel(model)),

        ("model", model),
    ])
    return the_pipeline_base


In [26]:
class HyperparameterOptimizationSearch:

    def __init__(self, models, params):
        self.models = models
        self.params = params
        self.keys = models.keys()
        self.grid_searches = {}

    def fit(self, X, y, cv, n_jobs, verbose=1, scoring=None, refit=False):
        for key in self.keys:
            print(f"\nRunning GridSearchCV for {key} \n")
            model = final_pipeline2(self.models[key])

            params = self.params[key]
            gs = GridSearchCV(model, params, cv=cv, n_jobs=n_jobs, verbose=verbose, scoring=scoring)
            gs.fit(X, y)
            self.grid_searches[key] = gs

    def score_summary(self, sort_by='mean_score'):
        def row(key, scores, params):
            d = {
                'estimator': key,
                'min_score': min(scores),
                'max_score': max(scores),
                'mean_score': np.mean(scores),
                'std_score': np.std(scores),
            }
            return pd.Series({**params, **d})

        rows = []
        for k in self.grid_searches:
            params = self.grid_searches[k].cv_results_['params']
            scores = []
            for i in range(self.grid_searches[k].cv):
                key = "split{}_test_score".format(i)
                r = self.grid_searches[k].cv_results_[key]
                scores.append(r.reshape(len(params), 1))

            all_scores = np.hstack(scores)
            for p, s in zip(params, all_scores):
                rows.append((row(k, s, p)))

        df = pd.concat(rows, axis=1).T.sort_values([sort_by], ascending=False)

        columns = ['estimator', 'min_score', 'mean_score', 'max_score', 'std_score']
        columns = columns + [c for c in df.columns if c not in columns]

        return df[columns], self.grid_searches


In [None]:
df_cpy.head()

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    df_cpy.drop(['SalePrice'], axis=1),
    df_cpy['SalePrice'],
    test_size=0.2,
    random_state=0
)

print("* Train set:", X_train.shape, y_train.shape,
      "\n* Test set:",  X_test.shape, y_test.shape)


In [None]:
X_train.head()

In [30]:
models_quick_search = {
    "AdaBoostRegressor": AdaBoostRegressor(random_state=0),
    "GradientBoostingRegressor": GradientBoostingRegressor(random_state=0),
    "XGBRegressor": XGBRegressor(random_state=0)
}

params_quick_search = {
    "AdaBoostRegressor": {},
    "GradientBoostingRegressor": {},
    "XGBRegressor": {}
}

In [None]:
from sklearn.model_selection import GridSearchCV

def perform_grid_search(model, params, X_train, y_train):
    grid_search = GridSearchCV(estimator=model, param_grid=params, scoring='r2', n_jobs=-1, cv=5)
    grid_search.fit(X_train, y_train)
    return grid_search.best_estimator_, grid_search.best_score_

best_models = {}
best_scores = {}

for name, model in models_quick_search.items():
    print(f"Running GridSearchCV for {name}...")
    best_model, best_score = perform_grid_search(model, params_quick_search[name], X_train, y_train)
    best_models[name] = best_model
    best_scores[name] = best_score

for name, score in best_scores.items():
    print(f"{name}: Best R^2 Score = {score:.4f}")


In [32]:
models_search = {
    "GradientBoostingRegressor": GradientBoostingRegressor(random_state=0),
}

params_search = {
    "GradientBoostingRegressor": {
        'model__n_estimators': [100,300],
        'model__learning_rate': [1e-1,1e-2,1e-3], 
        'model__max_depth': [3,10,None],
    }
}

In [None]:
search = HyperparameterOptimizationSearch(models=models_search, params=params_search)
search.fit(X_train, y_train, scoring = 'r2', n_jobs=-1, cv=5)

In [None]:
grid_search_summary, grid_search_pipelines = search.score_summary(sort_by='mean_score')
grid_search_summary.head(5)

In [None]:
best_model = grid_search_summary.iloc[0, 0]
best_model

In [None]:
grid_search_pipelines[best_model].best_params_

In [None]:
best_regressor_pipeline = grid_search_pipelines[best_model].best_estimator_
best_regressor_pipeline

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')

# after data cleaning and feature engineering, the features may have changes
# how many data cleaning and feature engineering steps does your pipeline have?
data_cleaning_feat_eng_steps = 1
columns_after_data_cleaning_feat_eng = (Pipeline(best_regressor_pipeline.steps[:data_cleaning_feat_eng_steps])
                                        .transform(X_train)
                                        .columns)

best_features = columns_after_data_cleaning_feat_eng[best_regressor_pipeline['feat_selection'].get_support(
)].to_list()

# create DataFrame to display feature importance
df_feature_importance = (pd.DataFrame(data={
    'Feature': columns_after_data_cleaning_feat_eng[best_regressor_pipeline['feat_selection'].get_support()],
    'Importance': best_regressor_pipeline['model'].feature_importances_})
    .sort_values(by='Importance', ascending=False)
)

# Most important features statement and plot
print(f"* These are the {len(best_features)} most important features in descending order. "
      f"The model was trained on them: \n{df_feature_importance['Feature'].to_list()}")

df_feature_importance.plot(kind='bar', x='Feature', y='Importance')
plt.show()


In [39]:
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
import numpy as np


def regression_performance(X_train, y_train, X_test, y_test, pipeline):
    print("Model Evaluation \n")
    print("* Train Set")
    regression_evaluation(X_train, y_train, pipeline)
    print("* Test Set")
    regression_evaluation(X_test, y_test, pipeline)


def regression_evaluation(X, y, pipeline):
    prediction = pipeline.predict(X)
    print('R2 Score:', r2_score(y, prediction).round(3))
    print('Mean Absolute Error:', mean_absolute_error(y, prediction).round(3))
    print('Mean Squared Error:', mean_squared_error(y, prediction).round(3))
    print('Root Mean Squared Error:', np.sqrt(
        mean_squared_error(y, prediction)).round(3))
    print("\n")


def regression_evaluation_plots(X_train, y_train, X_test, y_test, pipeline, alpha_scatter=0.5):
    pred_train = pipeline.predict(X_train)
    pred_test = pipeline.predict(X_test)

    fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(12, 6))
    sns.scatterplot(x=y_train, y=pred_train, alpha=alpha_scatter, ax=axes[0])
    sns.lineplot(x=y_train, y=y_train, color='red', ax=axes[0])
    axes[0].set_xlabel("Actual")
    axes[0].set_ylabel("Predictions")
    axes[0].set_title("Train Set")

    sns.scatterplot(x=y_test, y=pred_test, alpha=alpha_scatter, ax=axes[1])
    sns.lineplot(x=y_test, y=y_test, color='red', ax=axes[1])
    axes[1].set_xlabel("Actual")
    axes[1].set_ylabel("Predictions")
    axes[1].set_title("Test Set")

    plt.show()


In [None]:
regression_performance(X_train, y_train, X_test, y_test, best_regressor_pipeline)
regression_evaluation_plots(X_train, y_train, X_test, y_test, best_regressor_pipeline)

In [None]:
best_regressor_pipeline

## Lets save this as well

In [None]:
import joblib

version = 'v3'
file_path = f'/workspace/PP5-ML/outputs/ml_pipeline/predict_SalePrice/{version}'

try:
  os.makedirs(name=file_path)
except Exception as e:
  print(e)

In [None]:
joblib.dump(value=best_regressor_pipeline, filename=f"{file_path}/best_regressor_pipeline.pkl")

In [None]:
X_train.head()

--------------

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OrdinalEncoder
from sklearn.compose import ColumnTransformer
from sklearn.decomposition import PCA

# Define numerical and categorical features
num_features = ['1stFlrSF', '2ndFlrSF', 'BsmtFinSF1', 'GarageArea', 'GrLivArea', 'LotArea', 'TotalBsmtSF', 'YearBuilt', 'OverallCond', 'OverallQual']
cat_features = ['KitchenQual_encoded',]

# Update the preprocessor
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), num_features),
        ('cat', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1), cat_features)
    ]
)

# Create a pipeline with preprocessing and PCA steps
pipeline_pca = Pipeline([
    ('preprocessor', preprocessor),
    ('pca', PCA(n_components=5))
    
])

# Fit and transform the data using the pipeline without dropping 'YearBuilt'
df_pca = pipeline_pca.fit_transform(df_cpy)

print(df_pca.shape, '\n', type(df_pca))


In [None]:
import numpy as np
from sklearn.decomposition import PCA

n_components = 5


def pca_components_analysis(df_pca, n_components):
    pca = PCA(n_components=n_components).fit(df_pca)
    x_PCA = pca.transform(df_pca)  # array with transformed PCA

    ComponentsList = ["Component " + str(number)
                      for number in range(n_components)]
    dfExplVarRatio = pd.DataFrame(
        data=np.round(100 * pca.explained_variance_ratio_, 3),
        index=ComponentsList,
        columns=['Explained Variance Ratio (%)'])

    dfExplVarRatio['Accumulated Variance'] = dfExplVarRatio['Explained Variance Ratio (%)'].cumsum(
    )

    PercentageOfDataExplained = dfExplVarRatio['Explained Variance Ratio (%)'].sum(
    )

    print(
        f"* The {n_components} components explain {round(PercentageOfDataExplained,2)}% of the data \n")
    plt.figure(figsize=(12, 5))
    sns.lineplot(data=dfExplVarRatio,  marker="o")
    plt.xticks(rotation=90)
    plt.yticks(np.arange(0, 110, 10))
    plt.show()


pca_components_analysis(df_pca=df_pca, n_components=n_components)


In [None]:
n_components = 3
pca_components_analysis(df_pca=df_pca, n_components=n_components)

In [87]:
# PCA
from sklearn.decomposition import PCA


def PipelineOptimization(model):
    pipeline_base = Pipeline([

        ("SmartCorrelatedSelection", SmartCorrelatedSelection(variables=None,
         method="spearman", threshold=0.6, selection_method="variance")),


        ("feat_scaling", StandardScaler()),

        # PCA replace Feature Selection
        ("PCA", PCA(n_components=3, random_state=0)),

        ("model", model),

    ])

    return pipeline_base


In [None]:
print("* Train set:", X_train.shape, y_train.shape, "\n* Test set:",  X_test.shape, y_test.shape)

In [89]:
class HyperparameterOptimizationSearch:

    def __init__(self, models, params):
        self.models = models
        self.params = params
        self.keys = models.keys()
        self.grid_searches = {}

    def fit(self, X, y, cv, n_jobs, verbose=1, scoring=None, refit=False):
        for key in self.keys:
            print(f"\nRunning GridSearchCV for {key} \n")
            model = PipelineOptimization(self.models[key])

            params = self.params[key]
            gs = GridSearchCV(model, params, cv=cv, n_jobs=n_jobs, verbose=verbose, scoring=scoring)
            gs.fit(X, y)
            self.grid_searches[key] = gs

    def score_summary(self, sort_by='mean_score'):
        def row(key, scores, params):
            d = {
                'estimator': key,
                'min_score': min(scores),
                'max_score': max(scores),
                'mean_score': np.mean(scores),
                'std_score': np.std(scores),
            }
            return pd.Series({**params, **d})

        rows = []
        for k in self.grid_searches:
            params = self.grid_searches[k].cv_results_['params']
            scores = []
            for i in range(self.grid_searches[k].cv):
                key = "split{}_test_score".format(i)
                r = self.grid_searches[k].cv_results_[key]
                scores.append(r.reshape(len(params), 1))

            all_scores = np.hstack(scores)
            for p, s in zip(params, all_scores):
                rows.append((row(k, s, p)))

        df = pd.concat(rows, axis=1).T.sort_values([sort_by], ascending=False)

        columns = ['estimator', 'min_score', 'mean_score', 'max_score', 'std_score']
        columns = columns + [c for c in df.columns if c not in columns]

        return df[columns], self.grid_searches


## Use the standard hyperparameters

In [90]:
models_quick_search = {
    'LinearRegression': LinearRegression(),
    "DecisionTreeRegressor": DecisionTreeRegressor(random_state=0),
    "RandomForestRegressor": RandomForestRegressor(random_state=0),
    "ExtraTreesRegressor": ExtraTreesRegressor(random_state=0),
    "AdaBoostRegressor": AdaBoostRegressor(random_state=0),
    "GradientBoostingRegressor": GradientBoostingRegressor(random_state=0),
    "XGBRegressor": XGBRegressor(random_state=0),
}

params_quick_search = {
    'LinearRegression': {},
    "DecisionTreeRegressor": {},
    "RandomForestRegressor": {},
    "ExtraTreesRegressor": {},
    "AdaBoostRegressor": {},
    "GradientBoostingRegressor": {},
    "XGBRegressor": {},
}


In [None]:
quick_search = HyperparameterOptimizationSearch(models=models_quick_search, params=params_quick_search)
quick_search.fit(X_train, y_train, scoring='r2', n_jobs=-1, cv=5)

In [None]:
regression_performance(X_train, y_train, X_test, y_test,best_regressor_pipeline)
regression_evaluation_plots(X_train, y_train, X_test, y_test,
                            best_regressor_pipeline)

Convert Regression to Classification?

In [None]:
import pandas as pd
df_cleaned = (pd.read_csv('/workspace/PP5-ML/outputs/datasets/cleaned/CleanedDataset.csv'))
df_cleaned.head(3)

In [None]:
df_cpy = df_cleaned.copy()
df_cpy.head(3)

In [None]:
# Create the new column 'HealthCheck' as the sum of 'OverallCond' and 'OverallQual'
df_cpy['HealthCheck'] = df_cpy['OverallCond'] + df_cpy['OverallQual'] 

# Drop the original columns 'OverallCond' and 'OverallQual'
df_cpy.drop(columns=['OverallCond', 'OverallQual', '1stFlrSF', '2ndFlrSF', 'KitchenQual_encoded', 'BsmtFinSF1'], inplace=True)

# Verify the updated DataFrame
print(df_cpy.head())

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
from feature_engine.discretisation import EqualFrequencyDiscretiser

disc = EqualFrequencyDiscretiser(q=5, variables=['GrLivArea', 'LotArea', 'HealthCheck', 'TotalBsmtSF', 'GarageArea', 'YearBuilt', 'SalePrice'])
df_clf = disc.fit_transform(df_cpy)

print(f"* The classes represent the following ranges: \n{disc.binner_dict_} \n")

print(df_clf.head())


In [None]:
# Create subplots to show multiple graphs
fig, axes = plt.subplots(nrows=4, ncols=3, figsize=(18, 12))

# List of variables to plot
variables = ['GrLivArea', 'YearBuilt', 'TotalBsmtSF', 'LotArea', 'GarageArea', 'HealthCheck', 'SalePrice', 'totalFlrSF']

# Plot each variable in a subplot
for i, (variable) in enumerate(variables):
    row = i // 3
    col = i % 3
    sns.countplot(data=df_clf, x=variable, ax=axes[row, col])
    axes[row, col].set_title(variable)

# Adjust layout for better spacing
plt.tight_layout()
plt.show()


In [None]:
df_clf

In [None]:
print(df_clf.columns)

## Re-Write the pipeline again, again, again, again

In [358]:
def final_pipeline3(model):
    the_pipeline_base = Pipeline([
        ("SmartCorrelatedSelection", SmartCorrelatedSelection(variables=None,
         method="spearman", threshold=0.6, selection_method="variance")),

        ("feat_scaling", StandardScaler()),

        ("feat_selection",  SelectFromModel(model)),

        ("model", model),
    ])
    return the_pipeline_base


In [359]:
from sklearn.tree import DecisionTreeClassifier 
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier 
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import AdaBoostClassifier
from xgboost import XGBClassifier
from sklearn.pipeline import Pipeline
from feature_engine.selection import SmartCorrelatedSelection
from sklearn.feature_selection import SelectFromModel
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
import numpy as np
import pandas as pd

In [360]:
class HyperparameterOptimizationSearch:

    def __init__(self, models, params):
        self.models = models
        self.params = params
        self.keys = models.keys()
        self.grid_searches = {}

    def fit(self, X, y, cv, n_jobs, verbose=1, scoring=None, refit=False):
        for key in self.keys:
            print(f"\nRunning GridSearchCV for {key} \n")
            model = final_pipeline3(self.models[key])

            params = self.params[key]
            gs = GridSearchCV(model, params, cv=cv, n_jobs=n_jobs, verbose=verbose, scoring=scoring)
            gs.fit(X, y)
            self.grid_searches[key] = gs

    def score_summary(self, sort_by='mean_score'):
        def row(key, scores, params):
            d = {
                'estimator': key,
                'min_score': min(scores),
                'max_score': max(scores),
                'mean_score': np.mean(scores),
                'std_score': np.std(scores),
            }
            return pd.Series({**params, **d})

        rows = []
        for k in self.grid_searches:
            params = self.grid_searches[k].cv_results_['params']
            scores = []
            for i in range(self.grid_searches[k].cv):
                key = "split{}_test_score".format(i)
                r = self.grid_searches[k].cv_results_[key]
                scores.append(r.reshape(len(params), 1))

            all_scores = np.hstack(scores)
            for p, s in zip(params, all_scores):
                rows.append((row(k, s, p)))

        df = pd.concat(rows, axis=1).T.sort_values([sort_by], ascending=False)

        columns = ['estimator', 'min_score', 'mean_score', 'max_score', 'std_score']
        columns = columns + [c for c in df.columns if c not in columns]

        return df[columns], self.grid_searches

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    df_clf.drop(['SalePrice'], axis=1),
    df_clf['SalePrice'],
    test_size=0.2,
    random_state=0
)

print("* Train set:", X_train.shape, y_train.shape,
      "\n* Test set:",  X_test.shape, y_test.shape)


In [362]:
models_quick_search = {
    "DecisionTreeClassifier": DecisionTreeClassifier(random_state=0),
    "RandomForestClassifier": RandomForestClassifier(random_state=0),
    "GradientBoostingClassifier": GradientBoostingClassifier(random_state=0),
    "ExtraTreesClassifier": ExtraTreesClassifier(random_state=0),
    "AdaBoostClassifier": AdaBoostClassifier(random_state=0),
}

params_quick_search = {
    "DecisionTreeClassifier":{},
    "RandomForestClassifier":{},
    "GradientBoostingClassifier":{},
    "ExtraTreesClassifier":{},
    "AdaBoostClassifier":{},
}

In [None]:
from sklearn.metrics import make_scorer, recall_score


quick_search = HyperparameterOptimizationSearch(models=models_quick_search, params=params_quick_search)
quick_search.fit(X_train, y_train,
                 scoring=make_scorer(recall_score, labels=[0], average=None),
                 n_jobs=-1, cv=3)

df_summary, grid_searches = quick_search.score_summary()
print(df_summary)

---

In [None]:
grid_search_summary, grid_search_pipelines = quick_search.score_summary(sort_by='mean_score')
grid_search_summary

In [370]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

# Define your models
base_estimator = DecisionTreeClassifier(random_state=0)
models_search = {
    "AdaBoostClassifier": AdaBoostClassifier(base_estimator=base_estimator, random_state=0),
}

# Define your parameter grid
params_search = {
    "AdaBoostClassifier": {
        "model__n_estimators": [50, 100],  
        "model__learning_rate": [0.01, 0.1],  
        "model__algorithm": ["SAMME.R"], 
        "model__base_estimator__max_depth": [None, 5],  
        "model__base_estimator__min_samples_split": [2, 10],  
        "model__base_estimator__min_samples_leaf": [1, 2],  
        "model__base_estimator__max_features": ["sqrt"], 
        "model__base_estimator__min_impurity_decrease": [0.0], 
    }
}



In [None]:
from sklearn.metrics import make_scorer, recall_score

# Ensure HyperparameterOptimizationSearch is updated to use the keys correctly
search = HyperparameterOptimizationSearch(models=models_search, params=params_search)
search.fit(X_train, y_train,
           scoring=make_scorer(recall_score, labels=[0], average=None),
           n_jobs=-1, cv=3)


In [None]:
grid_search_summary, grid_search_pipelines = search.score_summary(sort_by='mean_score')
grid_search_summary.head(5)

In [None]:
best_model = grid_search_summary.iloc[0,0]
best_model

In [None]:
best_parameters = grid_search_pipelines[best_model].best_params_
best_parameters

In [None]:
pipeline_clf = grid_search_pipelines[best_model].best_estimator_
pipeline_clf

In [None]:
data_cleaning_feat_eng_steps = 1
columns_after_data_cleaning_feat_eng = (Pipeline(pipeline_clf.steps[:data_cleaning_feat_eng_steps])
                                        .transform(X_train)
                                        .columns)

best_features = columns_after_data_cleaning_feat_eng[pipeline_clf['feat_selection'].get_support(
)].to_list()

df_feature_importance = (pd.DataFrame(data={
    'Feature': columns_after_data_cleaning_feat_eng[pipeline_clf['feat_selection'].get_support()],
    'Importance': pipeline_clf['model'].feature_importances_})
    .sort_values(by='Importance', ascending=False)
)

best_features = df_feature_importance['Feature'].to_list()

print(f"* These are the {len(best_features)} most important features in descending order. "
      f"The model was trained on them: \n{best_features}")

df_feature_importance.plot(kind='bar', x='Feature', y='Importance')
plt.show()


In [377]:
from sklearn.metrics import classification_report, confusion_matrix


def confusion_matrix_and_report(X, y, pipeline, label_map):

    prediction = pipeline.predict(X)

    print('---  Confusion Matrix  ---')
    print(pd.DataFrame(confusion_matrix(y_true=prediction, y_pred=y),
          columns=[["Actual " + sub for sub in label_map]],
          index=[["Prediction " + sub for sub in label_map]]
          ))
    print("\n")

    print('---  Classification Report  ---')
    print(classification_report(y, prediction, target_names=label_map), "\n")


def clf_performance(X_train, y_train, X_test, y_test, pipeline, label_map):
    print("#### Train Set #### \n")
    confusion_matrix_and_report(X_train, y_train, pipeline, label_map)

    print("#### Test Set ####\n")
    confusion_matrix_and_report(X_test, y_test, pipeline, label_map)


In [378]:
label_map = ['Class 0', 'Class 1', 'Class 2', 'Class 3', 'Class 4']

In [None]:
clf_performance(X_train=X_train, y_train=y_train,
                        X_test=X_test, y_test=y_test,
                        pipeline=pipeline_clf,
                        label_map= label_map )

In [380]:
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
import numpy as np


def regression_performance(X_train, y_train, X_test, y_test, pipeline):
    print("Model Evaluation \n")
    print("* Train Set")
    regression_evaluation(X_train, y_train, pipeline_clf)
    print("* Test Set")
    regression_evaluation(X_test, y_test, pipeline_clf)


def regression_evaluation(X, y, pipeline):
    prediction = pipeline_clf.predict(X)
    print('R2 Score:', r2_score(y, prediction).round(3))
    print('Mean Absolute Error:', mean_absolute_error(y, prediction).round(3))
    print('Mean Squared Error:', mean_squared_error(y, prediction).round(3))
    print('Root Mean Squared Error:', np.sqrt(
        mean_squared_error(y, prediction)).round(3))
    print("\n")


def regression_evaluation_plots(X_train, y_train, X_test, y_test, pipeline, alpha_scatter=0.5):
    pred_train = pipeline_clf.predict(X_train)
    pred_test = pipeline_clf.predict(X_test)

    fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(12, 6))
    sns.scatterplot(x=y_train, y=pred_train, alpha=alpha_scatter, ax=axes[0])
    sns.lineplot(x=y_train, y=y_train, color='red', ax=axes[0])
    axes[0].set_xlabel("Actual")
    axes[0].set_ylabel("Predictions")
    axes[0].set_title("Train Set")

    sns.scatterplot(x=y_test, y=pred_test, alpha=alpha_scatter, ax=axes[1])
    sns.lineplot(x=y_test, y=y_test, color='red', ax=axes[1])
    axes[1].set_xlabel("Actual")
    axes[1].set_ylabel("Predictions")
    axes[1].set_title("Test Set")

    plt.show()


In [None]:
regression_performance(X_train, y_train, X_test, y_test, pipeline_clf)
regression_evaluation_plots(X_train, y_train, X_test, y_test, pipeline_clf)

----------

# We found a model that seems to perform well enough using a variety of factors in its evaulation. Time to save it up. 

In [None]:
pipeline_clf

In [335]:
import joblib

version = 'v2'
file_path = f'outputs/ml_pipeline/predict_SalePrice/{version}'

try:
  os.makedirs(name=file_path)
except Exception as e:
  print(e)

In [None]:
X_train.head()

In [337]:
X_train.to_csv(f"{file_path}/X_train.csv", index=False)

In [None]:
y_train.head()

In [339]:
y_train.to_csv(f"{file_path}/y_train.csv", index=False)

In [None]:
X_test.head()

In [341]:
X_test.to_csv(f"{file_path}/X_test.csv", index=False)

In [None]:
y_test.head()

In [343]:
y_test.to_csv(f"{file_path}/y_test.csv", index=False)

In [None]:
pipeline_clf

In [None]:
joblib.dump(value=pipeline_clf, filename=f"{file_path}/clf_pipeline.pkl")

In [None]:
label_map

In [None]:
joblib.dump(value=label_map, filename=f"{file_path}/label_map.pkl")

In [None]:
df_feature_importance.plot(kind='bar', x='Feature', y='Importance')
plt.show()

In [None]:
df_feature_importance.plot(kind='bar',x='Feature',y='Importance')
plt.savefig(f'{file_path}/features_importance.png', bbox_inches='tight')

---

NOTE

* You may add as many sections as you want, as long as they support your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---