# Regression 

## objectives 


* Fit and assess the performance of a regression model aimed at predicting SalePrice levels in a housing market dataset.

**Steps to achieve the objective:**

* **Data Preparation**

* **Feature Engineering**
* **Model Selection**
* **Model Training**
* **Model Evaluation**
* **Fine-Tuning**
* **Prediction**
* **Evaluation**

By following these steps, we aim to develop a sturdy regression model that accurately predicts SalePrice levels, providing valuable insights for stakeholders in the housing market.

### Inputs

* outputs/datasets/collection/house-price-2021.csv
* Instructions on which variables to use for data cleaning and feature engineering.
### Outputs
    * Train set (features and target)
    * Test set (features and target)
    * ML pipeline to predict tenure
    * labels map
    * Feature Importance Plot
---
## Change working directory

We need to change the working directory from its current folder to its parent folder

* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory

    * os.path.dirname() gets the parent directory
    * os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

## Load Data

In [None]:
import pandas as pd 
import seaborn as sns
import numpy as np

df_reg = (pd.read_csv("/workspace/housing-market-analysis/outputs/datasets/collection/house-price-2021.csv")
     .drop(labels=['WoodDeckSF', 'EnclosedPorch'], axis=1))
df_reg['GarageYrBlt'] = df_reg['GarageYrBlt'].astype(int)
print(df_reg.shape)
df_reg.head()

In [None]:
from sklearn.preprocessing import OrdinalEncoder

# Select categorical columns
categorical_columns = ['BsmtExposure', 'BsmtFinType1', 'GarageFinish', 'KitchenQual']

# Create an instance of OrdinalEncoder
ordinal_encoder = OrdinalEncoder()

# Apply Ordinal Encoding to the selected categorical columns
df_reg[categorical_columns] = ordinal_encoder.fit_transform(df_reg[categorical_columns])


### **Split Train Test Set**

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
                                    df_reg.drop(['SalePrice'], axis=1),
                                    df_reg['SalePrice'],
                                    test_size=0.2,
                                    random_state=101
)

print("* Train set:", X_train.shape, y_train.shape, "\n* Test set:", X_test.shape, y_test.shape )

## **Model, Pipeline: Regressor**

**Create a ML pipeline for Data Cleaning and Feature Engineering**

In [None]:
from sklearn.pipeline import Pipeline

# Feature Engineering
from feature_engine.encoding import OrdinalEncoder
from feature_engine.selection import SmartCorrelatedSelection
from feature_engine.imputation import MeanMedianImputer

# Feat Scaling
from sklearn.preprocessing import StandardScaler

# Feat Selection
from sklearn.feature_selection import SelectFromModel

# ML algorithms
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import ExtraTreesRegressor
from xgboost import XGBRegressor


def PipelineOptimization(model):
    pipeline_base = Pipeline([
        ('median', MeanMedianImputer()),

        ("SmartCorrelatedSelection", SmartCorrelatedSelection(variables=None,
         method="spearman", threshold=0.6, selection_method="variance")),

        ("feat_scaling", StandardScaler()),

        ("feat_selection",  SelectFromModel(model)),

        ("model", model),

    ])

    return pipeline_base

**ML Pipeline for Modellng and Hyperparameter Optimisation**

In [None]:
from sklearn.model_selection import GridSearchCV


class HyperparameterOptimizationSearch:

    def __init__(self, models, params):
        self.models = models
        self.params = params
        self.keys = models.keys()
        self.grid_searches = {}

    def fit(self, X, y, cv, n_jobs, verbose=1, scoring=None, refit=False):
        for key in self.keys:
            print(f"\nRunning GridSearchCV for {key} \n")
            model = PipelineOptimization(self.models[key])

            params = self.params[key]
            gs = GridSearchCV(model, params, cv=cv, n_jobs=n_jobs,
                              verbose=verbose, scoring=scoring)
            gs.fit(X, y)
            self.grid_searches[key] = gs

    def score_summary(self, sort_by='mean_score'):
        def row(key, scores, params):
            d = {
                'estimator': key,
                'min_score': min(scores),
                'max_score': max(scores),
                'mean_score': np.mean(scores),
                'std_score': np.std(scores),
            }
            return pd.Series({**params, **d})

        rows = []
        for k in self.grid_searches:
            params = self.grid_searches[k].cv_results_['params']
            scores = []
            for i in range(self.grid_searches[k].cv):
                key = "split{}_test_score".format(i)
                r = self.grid_searches[k].cv_results_[key]
                scores.append(r.reshape(len(params), 1))

            all_scores = np.hstack(scores)
            for p, s in zip(params, all_scores):
                rows.append((row(k, s, p)))

        df = pd.concat(rows, axis=1).T.sort_values([sort_by], ascending=False)

        columns = ['estimator', 'min_score',
                   'mean_score', 'max_score', 'std_score']
        columns = columns + [c for c in df.columns if c not in columns]

        return df[columns], self.grid_searches

### **Grid Search CV - Sklearn** 

**Use default hyperparameters to find most suitable algorithm**

In [None]:
models_search = {
    'LinearRegression': LinearRegression(),
    "DecisionTreeRegressor": DecisionTreeRegressor(random_state=101),
    "RandomForestRegressor": RandomForestRegressor(random_state=101),
    "ExtraTreesRegressor": ExtraTreesRegressor(random_state=101),
    "AdaBoostRegressor": AdaBoostRegressor(random_state=101),
    "GradientBoostingRegressor": GradientBoostingRegressor(random_state=101),
    "XGBRegressor": XGBRegressor(random_state=101)
}

params_search = {
    'LinearRegression': {},
    "DecisionTreeRegressor": {},
    "RandomForestRegressor": {},
    "ExtraTreesRegressor": {},
    "AdaBoostRegressor": {},
    "GradientBoostingRegressor": {},
    "XGBRegressor": {}
}

Do a hyperparameter optimisatiion search using default hyperparameters 

In [None]:
search = HyperparameterOptimizationSearch(models=models_search, params=params_search)
search.fit(X_train, y_train,
           scoring='r2',
           n_jobs=-1, # use all processors, but one
           cv=2)

Check hyperparameter optimisation search results

In [None]:
grid_search_summary, grid_search_pipelines = search.score_summary(sort_by='mean_score')
grid_search_summary

**In which algorithms would you spend time doing an extensive hyperparameter search?**

* The top three performers performed the best.
* I would certainly select ExtraTreesClassifier and would give a second chance to RandomForestClassifier and AdaBoostRegressor, since its performance was not so far from ExtraTress.
* I wouldn't give a second chance to the rest of the algorithms as it is quite far off from the top Three algorithms.

**Let's define the new hyperparameters for the extensive search**

In [None]:
models_search = {
    "ExtraTreesRegressor": ExtraTreesRegressor(random_state=101),
    "RandomForestRegressor": RandomForestRegressor(random_state=101),
    "AdaBoostRegressor": AdaBoostRegressor(random_state=101),
}

params_search = {
    "ExtraTreesRegressor": {"model__n_estimators": [100, 150, 200]},
    "RandomForestRegressor": {"model__n_estimators": [100, 150, 170]},
    "AdaBoostRegressor": {"model__n_estimators": [50, 100, 120]},
}

Now I fit again using our HyperparameterOptimizationSearch class and our updated information on models_search and params_search.

* The objective is to conduct an exhaustive search on the algorithms that performed better than default hyperparameter optimization.

In [None]:
search = HyperparameterOptimizationSearch(models=models_search, params=params_search)
search.fit(X_train, y_train,
           scoring='r2',
           n_jobs=-1,
           cv=2)

**Check the results summary with .score_summary**

In [None]:
grid_search_summary, grid_search_pipelines = search.score_summary(sort_by='mean_score')
grid_search_summary

Grab the best model name, I done this by using .iloc[ ] on the first row and column from the previous data frame

In [None]:
best_model = grid_search_summary.iloc[0,0]
best_model

Now I'll get the best model parameters

In [None]:
grid_search_pipelines[best_model].best_params_

The variable grid_search_pipelines holds all the trained pipelines. Start by selecting the pipelines from the model that performed the best (with best_model), then use .best_estimator_ to get the pipeline that has the model and hyperparameter settings that work best with the data.

In [None]:
best_pipeline = grid_search_pipelines[best_model].best_estimator_
best_pipeline

### **Data cleaning and feature engine**

Transforming the training data using the pipeline defined by "best_pipeline", extracting the columns after the data cleaning and feature engineering steps, then identifying the best features selected by the feature selection step, finally displaying the feature importance using a bar plot.

In [None]:
import matplotlib.pyplot as plt
# There are 2 data cleaning and feature engineering steps: median and standard scaler

data_cleaning_feat_eng_steps = 2
# we get these steps with .steps[] starting from 0 until the value we assigned above
# then we .transform() to the train set and extract the columns
columns_after_data_cleaning_feat_eng = (Pipeline(best_pipeline.steps[:data_cleaning_feat_eng_steps])
                                        .transform(X_train)
                                        .columns)

# we get the boolean list indicating the best features with best_pipeline['feat_selection'].get_support()
# and use this list to sbuset columns_after_data_cleaning_feat_eng
best_features = columns_after_data_cleaning_feat_eng[best_pipeline['feat_selection'].get_support()].to_list()


# create DataFrame to display feature importance
df_feature_importance = (pd.DataFrame(data={
          'Feature': best_features,
          'Importance': best_pipeline['model'].feature_importances_})
  .sort_values(by='Importance', ascending=False)
  )

# Most important features statement and plot
print(f"* These are the {len(best_features)} most important features in descending order."
      f" The model was trained on them: \n{df_feature_importance['Feature'].to_list()}")


df_feature_importance.plot(kind='bar', x='Feature', y='Importance')
plt.show()

After fitting the pipeline using hyperparameter optimization, you'll want to evaluate its performance. This involves testing it on unseen data (test set) and analyzing various metrics such as **mean squared error**, **mean absolute error**, **R-squared**, **r2** and others. To observe how well the model generalizes to new unseen data. Moreover, To assist with this evaluation, we'll import a custom function designed specifically for assessing regression models.

In [None]:
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error 
import numpy as np

def regression_performance(X_train, y_train, X_test, y_test, pipeline):
	print("Model Evaluation \n")
	print("* Train Set")
	regression_evaluation(X_train,y_train,pipeline)
	print("* Test Set")
	regression_evaluation(X_test,y_test,pipeline)



def regression_evaluation(X, y, pipeline):
  prediction = pipeline.predict(X)
  print('R2 Score:', r2_score(y, prediction).round(3))  
  print('Mean Absolute Error:', mean_absolute_error(y, prediction).round(3))  
  print('Mean Squared Error:', mean_squared_error(y, prediction).round(3))  
  print('Root Mean Squared Error:', np.sqrt(mean_squared_error(y, prediction)).round(3))
  print("\n")

  

def regression_evaluation_plots(X_train, y_train, X_test, y_test, pipeline, alpha_scatter=0.5):
  pred_train = pipeline.predict(X_train)
  pred_test = pipeline.predict(X_test)


  fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(12,6))
  sns.scatterplot(x=y_train , y=pred_train, alpha=alpha_scatter, ax=axes[0])
  sns.lineplot(x=y_train , y=y_train, color='red', ax=axes[0])
  axes[0].set_xlabel("Actual")
  axes[0].set_ylabel("Predictions")
  axes[0].set_title("Train Set")

  sns.scatterplot(x=y_test, y=pred_test, alpha=alpha_scatter, ax=axes[1])
  sns.lineplot(x=y_test , y=y_test, color='red', ax=axes[1])
  axes[1].set_xlabel("Actual")
  axes[1].set_ylabel("Predictions")
  axes[1].set_title("Test Set")
  plt.show()

It looks like the model performs exceptionally well on the training set, achieving a perfect R2 score and minimal errors. However, on the test set, while the R2 score is still quite good at 0.742, there are significant errors in terms of mean absolute error, mean squared error, and root mean squared error. This indicates that while the model generalizes well, it may be overfitting to some extent and could benefit from further optimization or regularization to improve its performance on unseen data.

In [None]:
regression_performance(X_train, y_train, X_test, y_test,best_pipeline)
regression_evaluation_plots(X_train, y_train, X_test, y_test, 
                            best_pipeline, alpha_scatter=0.5)

## **Regressor with PCA**
Let's explore potential values for PCA n_components.

In the next cells we are going to:

        * Transform the data using PCA and understand how many components to consider
        * Visualize the data after the PCA transformation

In [None]:
from sklearn.decomposition import PCA
import pandas as pd 

df_pca = pd.read_csv("/workspace/housing-market-analysis/outputs/datasets/collection/house-price-2021.csv")
df_pca = df_pca.sample(frac=0.8, random_state=101)
df_pca = df_pca.drop(labels=['WoodDeckSF', 'EnclosedPorch'], axis=1) # Drop columns 'WoodDeckSF' and 'EnclosedPorch'
# df_pca['GarageYrBlt'] = df_pca['GarageYrBlt'].astype(int)
print(df_pca.shape)
df_pca.head()

In [None]:
df_SP = df_pca[['SalePrice']]
X = df_pca.drop(['SalePrice'], axis=1)
print(X.shape)
X.head(3)

In [None]:
from sklearn.preprocessing import OrdinalEncoder

# Select categorical columns
categorical_columns = ['BsmtExposure', 'BsmtFinType1', 'GarageFinish', 'KitchenQual']

# Create an instance of OrdinalEncoder
ordinal_encoder = OrdinalEncoder()

# Apply Ordinal Encoding to the selected categorical columns
df_pca[categorical_columns] = ordinal_encoder.fit_transform(df_pca[categorical_columns])


### Create pipeline steps

To use PCA, it's important to scale the data first. Hence, we're setting up a pipeline to handle data preparation, feature engineering, and scaling.

In our setup, this pipeline will handle data cleaning by filling in missing values using median imputation, and then it will scale the features.

In [None]:
from sklearn.pipeline import Pipeline
### Data Cleaning
from feature_engine.imputation import MeanMedianImputer
### Feat Scaling
from sklearn.preprocessing import StandardScaler


def PipelineDataCleaningFeatEngFeatScaling():
  pipeline_base = Pipeline([
                            
      ( 'MeanMedianImputer', MeanMedianImputer(imputation_method='median') ),

      ( 'feature_scaling', StandardScaler() ),
  ])

  return pipeline_base

PipelineDataCleaningFeatEngFeatScaling()

We aim to apply PCA solely to the features, excluding the SalePrice. We create two separate DataFrames: X, containing the features, and df_target, which holds the SalePrice information the capital range. It's important to note that X comprises 21 features. We will utilize X for PCA application and df_target at a subsequent stage when visualizing the data.

Apply PCA separately to the scaled data

In [None]:
import numpy as np
from sklearn.decomposition import PCA

n_components = 21 # number of components as all columns in the data


pca = PCA(n_components=n_components).fit(df_pca)  # set PCA object and fit to the data
x_PCA = pca.transform(df_pca) # array with transformed PCA


# the Principal Component Analysis object has .explained_variance_ratio_ attribute, which tells 
# how much variance each component has 
# We store that to a DataFrame relating each component to its variance explanation
ComponentsList = ["Component " + str(number) for number in range(n_components)]
dfExplVarRatio = pd.DataFrame(
    data= np.round(100 * pca.explained_variance_ratio_ ,4),
    index=ComponentsList,
    columns=['Explained Variance Ratio (%)'])

# prints how much of the dataset these components explain
PercentageOfDataExplained = dfExplVarRatio['Explained Variance Ratio (%)'].sum()

print(f"* The {n_components} components explain {round(PercentageOfDataExplained,2)}% of the data \n")
print(dfExplVarRatio)

In the next cell we just copied the code from the cell above and changed n_components to 2.

 - With 2 components we achieved 99.98

Overall, this indicates that the first two components contribute significantly to explaining the variability in the data, while the remaining components contribute very little.

In [None]:
n_components = 5

pca = PCA(n_components=n_components).fit(df_pca)
x_PCA = pca.transform(df_pca) # array with transformed PCA

ComponentsList = ["Component " + str(number) for number in range(n_components)]
dfExplVarRatio = pd.DataFrame(
    data= np.round(100 * pca.explained_variance_ratio_ , 4),
    index=ComponentsList,
    columns=['Explained Variance Ratio (%)'])

PercentageOfDataExplained = dfExplVarRatio['Explained Variance Ratio (%)'].sum()

print(f"* The {n_components} components explain {round(PercentageOfDataExplained, 4)}% of the data \n")
print(dfExplVarRatio)

The x_PCA.shape method returns the shape of the transformed PCA data array, which represents the data after applying principal component analysis (PCA). The shape is a tuple indicating the dimensions of the array, where the first element represents the number of samples or rows and the second element represents the number of features or components or columns in the numpy array.

In [None]:
print(x_PCA.shape)
x_PCA

### Rewrite ML Pipeline for Modelling

In [None]:
from sklearn.decomposition import PCA
from feature_engine.imputation import MeanMedianImputer


def PipelineOptimization(model):
    pipeline_base = Pipeline([

        ("median_imputer", MeanMedianImputer(imputation_method='median')),

        ("SmartCorrelatedSelection", SmartCorrelatedSelection(variables=None,
         method="spearman", threshold=0.6, selection_method="variance")),


        ("feat_scaling", StandardScaler()),

        # PCA replace Feature Selection
        ("PCA", PCA(n_components=7, random_state=0)),

        ("model", model),

    ])

    return pipeline_base

## Grid Search CV – Sklearn

In [None]:
print("* Train set:", X_train.shape, y_train.shape, "\n* Test set:",  X_test.shape, y_test.shape)

**Use standard hyperparameters to find the most suitable model**.

In [None]:
models_search = {
    'LinearRegression': LinearRegression(),
    "DecisionTreeRegressor": DecisionTreeRegressor(random_state=101),
    "RandomForestRegressor": RandomForestRegressor(random_state=101),
    "ExtraTreesRegressor": ExtraTreesRegressor(random_state=101),
    "AdaBoostRegressor": AdaBoostRegressor(random_state=101),
    "GradientBoostingRegressor": GradientBoostingRegressor(random_state=101),
    "XGBRegressor": XGBRegressor(random_state=101),
}

params_search = {
    'LinearRegression': {},
    "DecisionTreeRegressor": {},
    "RandomForestRegressor": {},
    "ExtraTreesRegressor": {},
    "AdaBoostRegressor": {},
    "GradientBoostingRegressor": {},
    "XGBRegressor": {},
}

Optimisation search

In [None]:
quick_search = HyperparameterOptimizationSearch(models=models_search, params=params_search)
quick_search.fit(X_train, y_train, scoring='r2', n_jobs=-1, cv=5)

Observe results

In [None]:
grid_search_summary, grid_search_pipelines = quick_search.score_summary(sort_by='mean_score')
grid_search_summary

**Do an extensive search on the most suitable model to find the best hyperparameter configuration**.

Define model and parameters for extensive search

In [None]:
models_search = {
    "GradientBoostingRegressor":GradientBoostingRegressor(random_state=101),
}


params_search = {
    "GradientBoostingRegressor":{
        'model__n_estimators': [100,150,200],
        'model__learning_rate': [1,2,3], 
        'model__max_depth': [3,5,7], # Limits the number of nodes in the tree
    }
}

Extensive GridSearch CV

In [None]:
search = HyperparameterOptimizationSearch(models=models_search, params=params_search)
search.fit(X_train, y_train, scoring = 'r2', n_jobs=-1, cv=5)

Check results

In [None]:
grid_search_summary, grid_search_pipelines = search.score_summary(sort_by='mean_score')
grid_search_summary

Check the best model

In [None]:
best_model = grid_search_summary.iloc[0,0]
best_model

Check parameters for best model

In [None]:
grid_search_pipelines[best_model].best_params_

Defining the best regressor

In [None]:
best_regressor_pipeline = grid_search_pipelines[best_model].best_estimator_
best_regressor_pipeline

### **Evaluate Regressor on Train and Tests Sets**

In [None]:
regression_performance(X_train, y_train, X_test, y_test,best_regressor_pipeline)
regression_evaluation_plots(X_train, y_train, X_test, y_test,
                            best_regressor_pipeline)

## **Convert Regression to Classification**
### Convert numerical target to bins, and check if it is balanced

In [None]:
from feature_engine.discretisation import EqualFrequencyDiscretiser
disc = EqualFrequencyDiscretiser(q=4, variables=['OverallQual'])  # we will try q as 3, 4, and 7
df_clf = disc.fit_transform(df_pca)

print(f"* The classes represent the following ranges: \n{disc.binner_dict_} \n")
sns.countplot(data=df_clf, x='OverallQual')
plt.show()

In [None]:
df_clf.head()

### Rewrite ML Pipeline for Modelling

In [None]:
def PipelineOptimization(model):
    pipeline_base = Pipeline([

        ("OrdinalCategoricalEncoder", OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)),


        ("SmartCorrelatedSelection", SmartCorrelatedSelection(variables=None,
         method="spearman", threshold=0.6, selection_method="variance")),

        ("feat_scaling", StandardScaler()),

        ("feat_selection",  SelectFromModel(model)),

        ("model", model),

    ])

    return pipeline_base

## Load algorithms for classification

In [None]:
from sklearn.tree import DecisionTreeClassifier 
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier 
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import AdaBoostClassifier
from xgboost import XGBClassifier

### Split Train Test Sets

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    df_clf.drop(['OverallQual'], axis=1),
    df_clf['OverallQual'],
    test_size=0.2,
    random_state=0
)

print("* Train set:", X_train.shape, y_train.shape,
      "\n* Test set:",  X_test.shape, y_test.shape)

## Grid Seach CV – Sklearn
### Use standard hyper parameters to find most suitable model

In [None]:
models_search = {
    "XGBClassifier": XGBClassifier(random_state=101),
    "DecisionTreeClassifier": DecisionTreeClassifier(random_state=101),
    "RandomForestClassifier": RandomForestClassifier(random_state=101),
    "GradientBoostingClassifier": GradientBoostingClassifier(random_state=101),
    "ExtraTreesClassifier": ExtraTreesClassifier(random_state=101),
    "AdaBoostClassifier": AdaBoostClassifier(random_state=101, algorithm='SAMME'),
}

params_search = {
    "XGBClassifier":{},
    "DecisionTreeClassifier":{},
    "RandomForestClassifier":{},
    "GradientBoostingClassifier":{},
    "ExtraTreesClassifier":{},
    "AdaBoostClassifier":{},
}

GridSearch CV

In [None]:
from sklearn.metrics import make_scorer, recall_score
quick_search = HyperparameterOptimizationSearch(models=models_search, params=params_search)
quick_search.fit(X_train, y_train,
                 scoring = make_scorer(recall_score, labels=[0], average=None),
                 n_jobs=-1,
                 cv=6)

Check results

In [None]:
grid_search_summary, grid_search_pipelines = quick_search.score_summary(sort_by='mean_score')
grid_search_summary

### Do an extensive search on the most suitable model to find the best hyperparameter configuration.

Define models and parameters

In [None]:
models_search = {
    "AdaBoostClassifier": AdaBoostClassifier(random_state=101, algorithm='SAMME'),
}

# documentation to help on hyperparameter list:
# https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html
params_search = {
    "AdaBoostClassifier": {
        'model__n_estimators': [50, 100, 150],
        'model__learning_rate': [0.1, 0.01, 0.001],
    }
}

Extensive GridSearch CV

In [None]:
from sklearn.metrics import make_scorer,  recall_score
search = HyperparameterOptimizationSearch(
    models=models_search, params=params_search)
search.fit(X_train, y_train,
           scoring=make_scorer(recall_score, labels=[0], average=None),
           n_jobs=-1, cv=5)

Check results

In [None]:
grid_search_summary, grid_search_pipelines = search.score_summary(sort_by='mean_score')
grid_search_summary

Check the best model

In [None]:
best_model = grid_search_summary.iloc[0,0]
best_model

Parameters for best model

 - We are saving this content for later

In [None]:
best_parameters = grid_search_pipelines[best_model].best_params_
best_parameters

Define the best clf pipeline

In [None]:
pipeline_clf = grid_search_pipelines[best_model].best_estimator_
pipeline_clf

Observe feature importance
We can assess feature importance for this model with .feature_importances_

In [None]:
data_cleaning_feat_eng_steps = 2
columns_after_data_cleaning_feat_eng = (Pipeline(pipeline_clf.steps[:data_cleaning_feat_eng_steps])
                                        .transform(X_train)
                                        .columns)

# best_features = columns_after_data_cleaning_feat_eng
best_features = columns_after_data_cleaning_feat_eng[pipeline_clf['feat_selection'].get_support(
)].to_list()

# create DataFrame to display feature importance
df_feature_importance = (pd.DataFrame(data={
    'Feature': columns_after_data_cleaning_feat_eng[pipeline_clf['feat_selection'].get_support()],
    'Importance': pipeline_clf['model'].feature_importances_})
    .sort_values(by='Importance', ascending=False)
)

# reassign best features in order
best_features = df_feature_importance['Feature'].to_list()

# Most important features statement and plot
print(f"* These are the {len(best_features)} most important features in descending order. "
      f"The model was trained on them: \n{best_features}")

df_feature_importance.plot(kind='bar', xlabel='Feature', ylabel='Importance')
plt.show()

### Evaluate Classifier on Train and Test Sets

Custom Function

In [None]:
from sklearn.metrics import classification_report, confusion_matrix


def confusion_matrix_and_report(X, y, pipeline, label_map):

    prediction = pipeline.predict(X)

    print('---  Confusion Matrix  ---')
    print(pd.DataFrame(confusion_matrix(y_true=prediction, y_pred=y),
          columns=[["Actual " + sub for sub in label_map]],
          index=[["Prediction " + sub for sub in label_map]]
          ))
    print("\n")

    print('---  Classification Report  ---')
    print(classification_report(y, prediction, target_names=label_map, zero_division=0), "\n")


def clf_performance(X_train, y_train, X_test, y_test, pipeline, label_map):
    print("#### Train Set #### \n")
    confusion_matrix_and_report(X_train, y_train, pipeline, label_map)

    print("#### Test Set ####\n")
    confusion_matrix_and_report(X_test, y_test, pipeline, label_map)

List that relates the classes and OverallQual interval

In [None]:
disc.binner_dict_['OverallQual']

This mapping can be useful for tasks such as discretization, classification, or data visualization

In [None]:
label_map = ['<5', '5 to 6','7', '+7']
label_map

In [None]:
clf_performance(X_train=X_train, y_train=y_train,
                        X_test=X_test, y_test=y_test,
                        pipeline=pipeline_clf,
                        label_map= label_map)

In [None]:
pipeline_clf

## Push files to the repo
We will generate the following files

- Train set
- Test set
- Modeling pipeline
- label map
- features importance plot

In [None]:
import joblib
import os

version = 'v1'
file_path = f'outputs/ml_pipeline/predict_OverallQual/{version}'

try:
  os.makedirs(name=file_path)
except Exception as e:
  print(e)

### **Train Set: features and target**

In [None]:
X_train.head()

In [None]:
X_train.to_csv(f"{file_path}/X_train.csv", index=False)

In [None]:
y_train

In [None]:
y_train.to_csv(f"{file_path}/y_train.csv", index=False)

### Test Set: features and target

In [None]:
X_test.head()

In [None]:
X_test.to_csv(f"{file_path}/X_test.csv", index=False)

In [None]:
y_test

In [None]:

y_test.to_csv(f"{file_path}/y_test.csv", index=False)

### List mapping target levels to ranges
Map for converting numerical variable to categorical variable

In [None]:
label_map

In [None]:
joblib.dump(value=label_map, filename=f"{file_path}/label_map.pkl")

### Feature importance plot

In [None]:
df_feature_importance.plot(kind='bar', xlabel='Feature', ylabel='Importance')
plt.show()

This code snippet generates a bar plot using pandas to visualize the importance of the *features* in a dataset, with *features* represented on the x-axis and their respective importance values on the y-axis. It then saves the plot as an image file named "features_importance.png" in a specified directory.

In [None]:
df_feature_importance.plot(kind='bar', xlabel='Feature', ylabel='Importance')
plt.savefig(f'{file_path}/features_importance.png', bbox_inches='tight')

This code creates a scatter plot to help us understand how two specific features, **'OverallQual'** and **'GrLivArea'**, relate to the target variable **'SalePrice'**. By plotting **'OverallQual'** on the x-axis and **'GrLivArea'** on the y-axis, each data point represents a combination of these two features, and its color indicates the corresponding **'SalePrice'**. This visualization allows us to explore potential patterns or correlations between these features and the target variable. 
The larger the ground living area and the higher the over all quality of the house, the higher the SalePrice/cost of the housing.

In [None]:
var1, var2 = 'OverallQual' , 'GrLivArea'
sns.scatterplot(x=X[var1], y=X[var2], hue=df_SP['SalePrice'])
sns.despine()
plt.xlabel(var1)
plt.ylabel(var2)
plt.show()

plotting a scatter plot using the first two principal components (Component 0 and Component 1) generated by PCA. The x-coordinate represents the values of Component 0, while the y-coordinate represents the values of Component 1.

In [None]:
sns.scatterplot(x=x_PCA[:,0], y=x_PCA[:,1], hue=df_SP['SalePrice'], alpha=0.8)
plt.xlabel('Component-0')
plt.ylabel('Component-1')
sns.despine()
plt.show()

used Plotly Express to create a 3D scatter plot. The x, y, and z axes represent the first three principal components obtained from PCA. Each point in the plot corresponds to a data sample, and its color indicates the SalePrice label (e.g, the capital range). This visualization helps us explore the distribution of samples in the three-dimensional space defined by the principal components and understand how they are separated or clustered based on their price labels.

In [None]:
import plotly.express as px
fig = px.scatter_3d(x=x_PCA[:,0], y=x_PCA[:,1], z= x_PCA[:,2] , color=df_SP['SalePrice'],
                    labels=dict(x="Component 0", y="Component 1", z='Component 2'),
                    color_continuous_scale='RdYlBu',
                    width=750, height=500)
fig.update_traces(marker_size=5)
fig.show()