# Predicting SalePrice

## Objectives

Create and evaluate model to predict SalePrice of building

## Inputs:
* outputs/datasets/cleaned/test.parquet.gzip
* outputs/datasets/cleaned/train.parquet.gzip
* Conclusions from Feature Engineering jupyter_notebooks/04_Feature_Engineering.ipynb

## Outputs
* Train Set: Features and Target
* Test Set: Features and Target
* Feature Engineering Pipeline
* Modeling Pipeline
* Features Importance Plot

## Change working directory
In This section we will get location of current directory and move one step up, to parent folder, so App will be accessing project folder.

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os

current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chdir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("you have set a new current directory")

Confirm new current directory

In [None]:
current_dir = os.getcwd()
current_dir

## Loading Dataset

In [136]:
import pandas as pd

df = pd.read_csv("outputs/datasets/collection/HousePricesRecords.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,EnclosedPorch,GarageArea,...,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,WoodDeckSF,YearBuilt,YearRemodAdd,SalePrice
0,0,856,854.0,3.0,No,706,GLQ,150,0.0,548,...,65.0,196.0,61,5,7,856,0.0,2003,2003,208500
1,1,1262,0.0,3.0,Gd,978,ALQ,284,,460,...,80.0,0.0,0,8,6,1262,,1976,1976,181500
2,2,920,866.0,3.0,Mn,486,GLQ,434,0.0,608,...,68.0,162.0,42,5,7,920,,2001,2002,223500
3,3,961,,,No,216,ALQ,540,,642,...,60.0,0.0,35,5,7,756,,1915,1970,140000
4,4,1145,,4.0,Av,655,GLQ,490,0.0,836,...,84.0,350.0,84,5,8,1145,,2000,2000,250000


## Data Exploration
Before exploring data and doing transformations, as we decided earlier, we drop features:

In [137]:
drop_features = ['Unnamed: 0']
df.drop(columns=drop_features, inplace=True)

### Cleaning Dataset

In [138]:
df.loc[:, 'LotFrontage'] = df['LotFrontage'].fillna(70)

# Lists of columns grouped by their fill values and type conversions
fill_zero_and_convert = ['1stFlrSF', '2ndFlrSF', 'GarageArea', 'GarageYrBlt',
                         'EnclosedPorch', 'MasVnrArea', 'WoodDeckSF', 'BedroomAbvGr']
fill_none = ['BsmtExposure', 'BsmtFinType1', 'GarageFinish']

# Fill missing values with zero and convert to integers for numerical columns
df[fill_zero_and_convert] = df[fill_zero_and_convert].fillna(0).astype(int)

# Fill missing values with 'None' for categorical columns
df[fill_none] = df[fill_none].fillna('None')
df['LotFrontage'] = df['LotFrontage'].round().astype(int)

df.loc[df['2ndFlrSF'] == 0, 'BedroomAbvGr'] = df['BedroomAbvGr'].replace(0, 2)
df.loc[df['2ndFlrSF'] > 0, 'BedroomAbvGr'] = df['BedroomAbvGr'].replace(0, 3)

# Swap values where '2ndFlrSF' is greater than '1stFlrSF'
swap_idx = df['2ndFlrSF'] > df['1stFlrSF']
df.loc[swap_idx, ['1stFlrSF', '2ndFlrSF']] = df.loc[swap_idx, ['2ndFlrSF', '1stFlrSF']].values

# Define features and their 'no presence' values
basement_features = ['BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1', 'BsmtUnfSF', 'TotalBsmtSF']
features_and_values = {"BsmtExposure": "None", "BsmtFinType1": "None", "BsmtFinSF1": 0, "BsmtUnfSF": 0,
                       "TotalBsmtSF": 0}

# Check and update inconsistencies for each feature
for feature in basement_features:
    primary_value = features_and_values[feature]
    df['Consistency'] = df.apply(
        lambda row: all(row[f] == v for f, v in features_and_values.items()) if row[feature] == primary_value else True,
        axis=1
    )
    inconsistent_idx = df[~df['Consistency']].index
    if feature in ['BsmtExposure', 'BsmtFinType1']:
        correction = 'No' if feature == 'BsmtExposure' else 'Unf'
        df.loc[inconsistent_idx, feature] = correction

# Dropping new created column Consistency
df = df.drop(columns=['Consistency'])

# Correct zero values and adjust inconsistent records using vectorized operations
df.loc[df['BsmtUnfSF'] == 0, 'BsmtUnfSF'] = df['TotalBsmtSF'] - df['BsmtFinSF1']
df.loc[df['BsmtFinSF1'] == 0, 'BsmtFinSF1'] = df['TotalBsmtSF'] - df['BsmtUnfSF']
df.loc[df['TotalBsmtSF'] == 0, 'TotalBsmtSF'] = df['BsmtUnfSF'] + df['BsmtFinSF1']

# Identify and adjust records with inconsistent basement measurements using a ratio (example: 3)
mask = df['BsmtFinSF1'] + df['BsmtUnfSF'] != df['TotalBsmtSF']
df.loc[mask, 'BsmtUnfSF'] = (df.loc[mask, 'TotalBsmtSF'] / 3).astype(int)
df.loc[mask, 'BsmtFinSF1'] = df.loc[mask, 'TotalBsmtSF'] - df.loc[mask, 'BsmtUnfSF']

# Define a dictionary for checking consistency based on 'GarageFinish'
features_and_values = {"GarageArea": 0, "GarageFinish": 'None', "GarageYrBlt": 0}


def check_consistency(df, primary_feature):
    primary_value = features_and_values[primary_feature]
    return df.apply(
        lambda row: all(row[feature] == value for feature, value in features_and_values.items())
        if row[primary_feature] == primary_value else True, axis=1
    )


# Apply consistency check and correct 'GarageFinish'
consistency_mask = check_consistency(df, 'GarageFinish')
df.loc[~consistency_mask, 'GarageFinish'] = 'Unf'

# Correct garage years that are earlier than the house build year
df.loc[df['GarageYrBlt'] < df['YearBuilt'], 'GarageYrBlt'] = df['YearBuilt']

## Splitting to data and test dataframe

In [139]:
from sklearn.model_selection import train_test_split

X = df.drop(columns='SalePrice')
y = df['SalePrice']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

## Machine Learning

### Pre-Transformations

In [140]:
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin
import pandas as pd
import category_encoders as ce

# Define custom FeatureCreator class
from sklearn.base import BaseEstimator, TransformerMixin
import numpy as np


class FeatureCreator(BaseEstimator, TransformerMixin):
    """Custom feature creator for pipeline integration.

    This class extends sklearn's TransformerMixin to allow for custom feature
    creation during preprocessing pipelines. It handles various mathematical
    transformations and feature interactions explicitly detailed within the
    transform method, ensuring all features are appropriately processed and added.

    """

    def fit(self, X, y=None):
        # The fit method is not used for adding features, it's just here for compatibility.
        return self

    def transform(self, X):
        """Apply a series of custom transformations to the dataframe.

        Args:
        X (pd.DataFrame): Input dataframe from which features are derived.

        Returns:
        pd.DataFrame: The dataframe with new features added.

        """
        X = X.copy()  # Work on a copy of the data to prevent changes to the original dataframe
        # Numeric and Boolean feature interactions and transformations
        X['NF_TotalBsmtSF_mul_BsmtExposure'] = X['TotalBsmtSF'] * X['BsmtExposure']
        X['NF_TotalBsmtSF_mul_BsmtFinType1'] = X['TotalBsmtSF'] * X['BsmtFinSF1']
        X['NF_BsmtFinSF1_mul_BsmtFinType1'] = X['BsmtFinType1'] * X['BsmtFinSF1']
        X['NF_GarageFinish_mul_GarageArea'] = X['GarageFinish'] * X['GarageArea']
        X['NF_TotalLivingArea'] = X['GrLivArea'] + X['1stFlrSF'] + X['2ndFlrSF']
        X['NF_TotalLivingArea_mul_OverallQual'] = X['NF_TotalLivingArea'] * X['OverallQual']
        X['NF_TotalLivingArea_mul_OverallCond'] = X['NF_TotalLivingArea'] * X['OverallCond']
        X['NF_1stFlrSF_mul_OverallQual'] = X['1stFlrSF'] * X['OverallQual']
        X['NF_2ndFlrSF_mul_OverallQual'] = X['2ndFlrSF'] * X['OverallQual']
        X['NF_Age_Garage'] = 2010 - X['GarageYrBlt']
        X['NF_Age_Build'] = 2010 - X['YearBuilt']
        X['NF_Age_Remod'] = 2010 - X['YearRemodAdd']
        X['NF_Remod_TEST'] = X.apply(
            lambda row: 0 if row['NF_Age_Build'] == row['NF_Age_Remod'] else row['NF_Age_Remod'], axis=1)
        X[('NF_Has_2nd_floor')] = X.apply(lambda row: False if row['2ndFlrSF'] == 0 else True, axis=1).astype(int)
        X[('NF_Has_basement')] = X.apply(lambda row: False if row['TotalBsmtSF'] == 0 else True, axis=1).astype(int)
        X[('NF_Has_garage')] = X.apply(lambda row: False if row['GarageArea'] == 0 else True, axis=1).astype(int)
        X[('NF_Has_Masonry_Veneer')] = X.apply(lambda row: False if row['MasVnrArea'] == 0 else True, axis=1).astype(
            int)
        X[('NF_Has_Enclosed_Porch')] = X.apply(lambda row: False if row['EnclosedPorch'] == 0 else True,
                                               axis=1).astype(int)
        X[('NF_Has_Open_Porch')] = X.apply(lambda row: False if row['OpenPorchSF'] == 0 else True, axis=1).astype(int)
        X['NF_Has_ANY_Porch'] = X['NF_Has_Enclosed_Porch'] | X['NF_Has_Open_Porch'].astype(int)
        X[('NF_Has_Wooden_Deck')] = X.apply(lambda row: False if row['WoodDeckSF'] == 0 else True, axis=1).astype(int)

        return X


# Mapping and encoder setup
encoding_dict = {
    'BsmtExposure': {'None': 0, 'No': 1, 'Mn': 2, 'Av': 3, 'Gd': 4},
    'BsmtFinType1': {'None': 0, 'Unf': 1, 'LwQ': 2, 'Rec': 3, 'BLQ': 4, 'ALQ': 5, 'GLQ': 6},
    'GarageFinish': {'None': 0, 'Unf': 1, 'RFn': 2, 'Fin': 3},
    'KitchenQual': {'None': 0, 'Po': 1, 'Fa': 2, 'TA': 3, 'Gd': 4, 'Ex': 5}
}

ordinal_encoder = ce.OrdinalEncoder(mapping=[
    {'col': k, 'mapping': v} for k, v in encoding_dict.items()
])

# Pipeline setup
pre_feature_transformations = Pipeline(steps=[
    ('ordinal_encoder', ordinal_encoder),  # Custom categorical encoding
    ('feature_creator', FeatureCreator())  # Custom feature creation
])

### Features - Columns transformations

In [141]:
from sklearn.pipeline import Pipeline
from feature_engine.transformation import YeoJohnsonTransformer, BoxCoxTransformer, PowerTransformer

# Define the columns for each transformation type
yeo_johnson_features = ['1stFlrSF', '2ndFlrSF', 'BedroomAbvGr', 'BsmtExposure', 'BsmtUnfSF', 'EnclosedPorch',
                        'GarageArea', 'GarageFinish', 'GrLivArea', 'KitchenQual', 'LotArea', 'MasVnrArea',
                        'OpenPorchSF', 'OverallCond', 'TotalBsmtSF', 'WoodDeckSF', 'NF_TotalBsmtSF_mul_BsmtExposure',
                        'NF_BsmtFinSF1_mul_BsmtFinType1', 'NF_GarageFinish_mul_GarageArea', 'NF_TotalLivingArea',
                        'NF_Age_Garage', 'NF_Age_Remod', 'NF_Remod_TEST', 'NF_TotalLivingArea_mul_OverallQual',
                        'NF_TotalLivingArea_mul_OverallCond', 'NF_1stFlrSF_mul_OverallQual',
                        'NF_2ndFlrSF_mul_OverallQual']
power_features = ['GarageYrBlt', 'LotFrontage', 'YearRemodAdd', 'NF_TotalBsmtSF_mul_BsmtFinType1', 'NF_Age_Build']
box_cox_features = []

# Create transformers for each group of features using feature_engine transformers
yeo_johnson_transformer = YeoJohnsonTransformer(variables=yeo_johnson_features)
power_transformer = PowerTransformer(variables=power_features, exp=0.5)
box_cox_transformer = BoxCoxTransformer(variables=box_cox_features)

# Combine all transformers into a single pipeline
feature_transformer = Pipeline([
    ('yeo_johnson', yeo_johnson_transformer),
    ('power', power_transformer),
])


### Features-Columns Post Transformations

In [166]:
from feature_engine.outliers import Winsorizer
from sklearn.preprocessing import StandardScaler

from sklearn.pipeline import Pipeline

# Define the columns for Winsorization
winsorize_features = ['GarageArea', 'LotArea', 'LotFrontage', 'TotalBsmtSF', 'NF_TotalBsmtSF_mul_BsmtExposure',
                      'NF_TotalLivingArea_mul_OverallCond']

# Initialize the Winsorizer transformer
# We will apply Winsorizer to features from table in jupyter_notebooks/08_Feature_Engineering_hypothesis_3.ipynb
# The ones which gad high or above outliers
winsorize_transformer = Winsorizer(capping_method='iqr', tail='both', fold=1.5, variables=winsorize_features)


# Create the post-feature transformations pipeline
post_feature_transformer = Pipeline([
    ('winsorize', winsorize_transformer),
    ('standard_scaler', StandardScaler())
])


### Target Transformations

In [167]:
from sklearn.base import BaseEstimator, TransformerMixin


class LogTransformer(BaseEstimator, TransformerMixin):
    """Applies a natural logarithm transformation to the target variable."""

    def fit(self, X, y=None):
        # This transformer does not need to learn anything from the data
        return self

    def transform(self, X):
        # Apply natural logarithm; np.log1p is used for numerical stability and handles X = 0
        return np.log1p(X)

    def inverse_transform(self, X):
        # Reverse the transformation using exponential; np.expm1 is used for numerical stability
        return np.expm1(X)


# Create a pipeline for transforming the target variable
target_transformation_pipeline = Pipeline([
    ('log_transform', LogTransformer()),  # Log transformation
])


class PassthroughTransformer(BaseEstimator, TransformerMixin):
    """A transformer that passes through the data without changing it."""

    def fit(self, X, y=None):
        # No fitting necessary for passthrough
        return self

    def transform(self, X):
        # Return the data as is
        return X

    def inverse_transform(self, X):
        # Return the data as is
        return X


# Create a pipeline for passthrough transformation
passthrough_transformation_pipeline = Pipeline([
    ('passthrough', PassthroughTransformer())  # Passthrough transformer
])



### Main Pipeline 

In [168]:
from sklearn.pipeline import Pipeline
from sklearn.compose import TransformedTargetRegressor
from sklearn.feature_selection import SelectFromModel

def create_pipeline(model, target_transformer, use_feature_selection=True, max_iter=1000, eps=1e-3):
    """
    Create a pipeline with preprocessing, feature transformation, selection, and modeling.

    Args:
        model: The model to use in the pipeline.
        target_transformer: The transformer for the target variable.
        use_feature_selection (bool): Whether to use SelectFromModel for feature selection.
        max_iter (int): Maximum number of iterations for iterative models.
        eps (float): Convergence threshold for iterative models.

    Returns:
        Pipeline: The complete pipeline.
    """
    # Set max_iter and eps for models that support these parameters
    if hasattr(model, 'max_iter'):
        model.set_params(max_iter=max_iter)
    if hasattr(model, 'tol'):
        model.set_params(tol=eps)

    steps = [
        ('pre_transformations', pre_feature_transformations),  # Preprocessing steps
        ('transformations', feature_transformer),  # Feature transformations
        ('post_transformations', post_feature_transformer),  # Post-transformations
    ]

    # Add feature selection step if the model supports feature importances
    if use_feature_selection:
        steps.append(('feat_selection', SelectFromModel(model)))

    steps.append(('model', TransformedTargetRegressor(regressor=model, transformer=target_transformer)))

    return Pipeline(steps)


## ML Pipeline for Modeling and Hyperparameters Optimization

This is custom Class Hyperparameter Optimization

In [169]:
import warnings
import numpy as np
from sklearn.exceptions import ConvergenceWarning

class grid_cv_search_hp:
    """
    Class to perform hyperparameter optimization across multiple machine learning models.

    Attributes:
        models (dict): Dictionary of models to evaluate.
        params (dict): Dictionary of hyperparameters for the models.
        grid_searches (dict): Dictionary to store the results of GridSearchCV.
    """

    def __init__(self, models, params, target_transformer):
        """
        Initializes the grid_cv_search_hp with models and parameters.

        Args:
            models (dict): A dictionary of model names and instances.
            params (dict): A dictionary of model names and their hyperparameters.
            target_transformer: The transformer for the target variable.
        """
        self.models = models
        self.params = params
        self.grid_searches = {}
        self.target_transformer = target_transformer

    def fit(self, X, y, cv, n_jobs, verbose=1, scoring='r2', refit=False):
        """
        Perform hyperparameter optimization using GridSearchCV for each model.

        Args:
            X (array-like): Training data features.
            y (array-like): Training data target values.
            cv (int): Number of cross-validation folds.
            n_jobs (int): Number of jobs to run in parallel.
            verbose (int): Controls the verbosity of the output.
            scoring (str): Scoring metric for model evaluation.
            refit (bool): Whether to refit the best model on the whole dataset after searching.

        Returns:
            None
        """
        for key in self.models:
            print(f"\nOptimizing hyperparameters for {key}...\n")
            model = self.models[key]
            # Check if the model has the necessary attributes for feature selection
            use_feature_selection = hasattr(model, 'coef_') or hasattr(model, 'feature_importances_')
            pipeline = create_pipeline(model, self.target_transformer, use_feature_selection=use_feature_selection)

            params = self.params[key]

            with warnings.catch_warnings():
                warnings.simplefilter("ignore", category=ConvergenceWarning)
                gs = GridSearchCV(pipeline, params, cv=cv, n_jobs=n_jobs, verbose=verbose, scoring=scoring, refit=refit)
                gs.fit(X, y)
                self.grid_searches[key] = gs

    def score_summary(self, sort_by='mean_score'):
        """
        Summarize the grid search results.

        Args:
            sort_by (str): The column to sort the results by.

        Returns:
            DataFrame: A pandas DataFrame containing the summary of grid search results.
            dict: The grid search results.
        """

        def row(key, scores, params):
            d = {
                'estimator': key,
                'min_score': min(scores),
                'max_score': max(scores),
                'mean_score': np.mean(scores),
                'std_score': np.std(scores),
            }
            return pd.Series({**params, **d})

        rows = []
        for k in self.grid_searches:
            params = self.grid_searches[k].cv_results_['params']
            scores = []
            for i in range(self.grid_searches[k].cv):
                key = f"split{i}_test_score"
                r = self.grid_searches[k].cv_results_[key]
                scores.append(r.reshape(len(params), 1))

            all_scores = np.hstack(scores)
            for p, s in zip(params, all_scores):
                rows.append(row(k, s, p))

        df = pd.concat(rows, axis=1).T.sort_values([sort_by], ascending=False)
        columns = ['estimator', 'min_score', 'mean_score', 'max_score', 'std_score']
        columns += [c for c in df.columns if c not in columns]
        return df[columns], self.grid_searches


### Grid Search CV

For this time being we will use default hyperparameters, just to select best algorithms

In [177]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor, AdaBoostRegressor, GradientBoostingRegressor, HistGradientBoostingRegressor
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet, BayesianRidge, QuantileRegressor, \
    RANSACRegressor, Lars, OrthogonalMatchingPursuit, HuberRegressor, TheilSenRegressor
from sklearn.svm import SVR, NuSVR, LinearSVR
from sklearn.neighbors import KNeighborsRegressor, RadiusNeighborsRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.gaussian_process import GaussianProcessRegressor
import pandas as pd
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from catboost import CatBoostRegressor
from sklearn.cross_decomposition import PLSRegression

# Initializing regression models with default parameters

# Linear Models
linear_models = {
    'LinearRegression': LinearRegression(),
    'Ridge': Ridge(),
    'Lasso': Lasso(max_iter=100000),  # Increased max_iter
    'ElasticNet': ElasticNet(max_iter=100000),  # Similarly increase max_iter
    'BayesianRidge': BayesianRidge(),
    'QuantileRegressor': QuantileRegressor(),
    'RANSACRegressor': RANSACRegressor(),
    'PLSRegression': PLSRegression(),
    'HuberRegressor': HuberRegressor(),
    'TheilSenRegressor': TheilSenRegressor()
}

# Tree-Based Models
tree_based_models = {
    'DecisionTreeRegressor': DecisionTreeRegressor(),
    'RandomForestRegressor': RandomForestRegressor(),
    'ExtraTreesRegressor': ExtraTreesRegressor(),
    'AdaBoostRegressor': AdaBoostRegressor(),
    'GradientBoostingRegressor': GradientBoostingRegressor(),
    'HistGradientBoostingRegressor': HistGradientBoostingRegressor()
}

# Gradient Boosting Frameworks
gradient_boosting_models = {
    'XGBRegressor': XGBRegressor(),
    'LGBMRegressor': LGBMRegressor(),
    'CatBoostRegressor': CatBoostRegressor()
}

# Support Vector Machines
svm_models = {
    'SVR': SVR(max_iter=100000),
    'NuSVR': NuSVR(max_iter=100000),
    'LinearSVR': LinearSVR(max_iter=100000)
}

# Nearest Neighbors
nearest_neighbors_models = {
    'KNeighborsRegressor': KNeighborsRegressor(),
    #'RadiusNeighborsRegressor': RadiusNeighborsRegressor()
}

# Bayesian Methods
bayesian_models = {
    'GaussianProcessRegressor': GaussianProcessRegressor()
}

# Combining all models into a single dictionary for quick search
models_quick_search = {
    **linear_models,
    **tree_based_models,
    **gradient_boosting_models,
    #**svm_models,
    **nearest_neighbors_models,
    **bayesian_models
}

# Define an empty hyperparameter dictionary as all models will use default parameters initially
params_quick_search = {model_name: {} for model_name in models_quick_search}


### Running Grid Search CV

In [178]:
initial_search = grid_cv_search_hp(models=models_quick_search, params=params_quick_search,
                                   target_transformer=target_transformation_pipeline)
initial_search.fit(X_train, y_train, cv=5, n_jobs=-1, scoring='r2', refit=False)


Optimizing hyperparameters for LinearRegression...

Fitting 5 folds for each of 1 candidates, totalling 5 fits

Optimizing hyperparameters for Ridge...

Fitting 5 folds for each of 1 candidates, totalling 5 fits

Optimizing hyperparameters for Lasso...

Fitting 5 folds for each of 1 candidates, totalling 5 fits

Optimizing hyperparameters for ElasticNet...

Fitting 5 folds for each of 1 candidates, totalling 5 fits

Optimizing hyperparameters for BayesianRidge...

Fitting 5 folds for each of 1 candidates, totalling 5 fits

Optimizing hyperparameters for QuantileRegressor...

Fitting 5 folds for each of 1 candidates, totalling 5 fits

Optimizing hyperparameters for RANSACRegressor...

Fitting 5 folds for each of 1 candidates, totalling 5 fits

Optimizing hyperparameters for PLSRegression...

Fitting 5 folds for each of 1 candidates, totalling 5 fits

Optimizing hyperparameters for HuberRegressor...

Fitting 5 folds for each of 1 candidates, totalling 5 fits

Optimizing hyperparameters 

In [179]:
import numpy as np

grid_search_summary, grid_search_pipelines = initial_search.score_summary()
grid_search_summary

Unnamed: 0,estimator,min_score,mean_score,max_score,std_score
4,BayesianRidge,0.684275,0.860079,0.915976,0.088297
18,CatBoostRegressor,0.727612,0.856161,0.914909,0.068414
7,PLSRegression,0.706146,0.853426,0.905449,0.074194
1,Ridge,0.644158,0.851585,0.918532,0.104117
17,LGBMRegressor,0.790567,0.851544,0.898712,0.040977
11,RandomForestRegressor,0.758169,0.84914,0.901743,0.057061
12,ExtraTreesRegressor,0.732364,0.846524,0.89955,0.063634
8,HuberRegressor,0.548699,0.845978,0.936177,0.14908
0,LinearRegression,0.598269,0.842141,0.91853,0.12244
14,GradientBoostingRegressor,0.70553,0.838658,0.89853,0.069859


### Summary of Regressors by Category

1. Linear Models
* BayesianRidge, 0.684275
* PLSRegression, 0.706146
* Ridge, 0.644158
* Lasso, -0.066369
* ElasticNet, -0.066369
* LinearRegression, 0.598269
* HuberRegressor, 0.548699
* TheilSenRegressor, 0.151906
* QuantileRegressor, -0.105349
* RANSACRegressor, -4915222501423974021332992.0

2. Tree-Based Models
* DecisionTreeRegressor, 0.539976
* RandomForestRegressor, 0.758169
* ExtraTreesRegressor, 0.732364
* AdaBoostRegressor, 0.726775
* GradientBoostingRegressor, 0.705530
* HistGradientBoostingRegressor, 0.712298

3. Gradient Boosting Frameworks
* XGBRegressor, 0.695467
* LGBMRegressor, 0.790567
* CatBoostRegressor, 0.727612


4. Nearest Neighbors
* KNeighborsRegressor, 0.734614

5. Bayesian Methods
* GaussianProcessRegressor, -10.582711

We can see that Linear is in top, same as gradient regressors.

Lets Test top 2 regressors from each category. 
We need testing them, even they show high score, it does not mean they can be good for given model.