# **Model and features selection**

## **Table of contents**

* [Our goals](#Our-goals)
* [Loading the data](#Loading-the-data)
* [Setting up the pipeline](#Setting-up-the-pipeline)
    * [Loading, joining the dataframes](#Loading,-joining-the-dataframes)
    * [Pipeline preprocessing blocks](#Pipeline-preprocessing-blocks)
* [Testing different pipeline setups](#Testing-different-pipeline-setups)
    * [First model: manual imputation and encoding](#First-model:-manual-imputation-and-encoding)
        * [Ordinal encoding](#Ordinal-encoding)
        * [Target encoding](#Target-encoding)
    * [Second model: no imputation for categorical variable](#Second-model:-no-imputation-for-categorical-variable)
* [Recursive feature selection](#Recursive-feature-selection)
* [Summary](#Summary)

In [1]:
import sys
import os
from pathlib import Path

project_root = str(Path(os.getcwd()).parent)
if project_root not in sys.path:
    sys.path.append(project_root)

from utilities import (
    features_creation,
    plot_utilities
)

from utilities.plot_utilities import Rstyle_spines
from utilities.features_creation import (
    compute_features_credit_card,
    compute_features_previous,
    compute_features_bureau,
    compute_features_instal,
    compute_features_pos_cash,
)



from typing import Callable
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import pyarrow.parquet as pq
import pyarrow.dataset as ds
import gc

from sklearn import set_config

set_config(transform_output="pandas")
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import (
    StandardScaler,
    TargetEncoder,
    OrdinalEncoder,
    PolynomialFeatures,
)

from sklearn.decomposition import PCA
from sklearn.compose import ColumnTransformer
from sklearn.dummy import DummyClassifier
from sklearn.model_selection import StratifiedKFold, cross_validate

from sklearn.feature_selection import RFECV

from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
from xgboost import XGBClassifier

import joblib

from IPython.core.magic import register_cell_magic


@register_cell_magic
def skip(line, cell):
    return

## **Our goals**

<div style="background-color: #f8d7dA; padding: 10px; border-radius: 5px;">

In this notebook, we'll aim to:

* **Properly Set Up the Main Pipeline:** This will involve merging all our additional datasets and newly created features with the main dataframe.
* **Test Different Models:** We'll evaluate model performance using 3-fold cross-validation and select the model that performs best according to the ROC-AUC metric.
* **Apply Recursive Feature Selection:** We'll use this technique to reduce our set of features, which will make our model easier to deploy later on.
</div>

## **Loading the data**

<div style="background-color: #e8f4f8; padding: 10px; border-radius: 5px;">
From now on, we'll use the full training set stored in the train_data directory, rather than limiting ourselves to a sample. The training dataset was created in a previous notebook and represents 80% of the full dataset. Additionally, we'll use a stratified method to determine our 3 training folds for cross-validation.
</div>

In [2]:
train = pd.read_parquet("../app_data/application_train.parquet")
X_train = train.drop("TARGET", axis=1)
y_train = train["TARGET"]
skf = StratifiedKFold(n_splits=3, shuffle=True)

## **Setting up the pipeline**

<div style="background-color: #e8f4f8; padding: 10px; border-radius: 5px;">

In this section, we will:
* Create functions to:
    * Efficiently load data using pyarrow to optimize time and memory usage.
    * Handle the joining of all datasets and simplify the imputation of data from additional dataframes.
* Develop custom classes and functions to set up the preprocessing pipeline, including tasks such as imputation, encoding, feature engineering, and more.
</div>

### **Loading, joining the dataframes**

In [3]:
def load_additional_df(
    sk_id_curr: pd.Series, filename: str, dir_loc: str = "../add_data/"
) -> pd.DataFrame:
    tmp_ds = ds.dataset(dir_loc + filename)
    tmp_table = tmp_ds.to_table(filter=(ds.field("SK_ID_CURR").isin(sk_id_curr)))
    df = tmp_table.to_pandas()

    return df

In [4]:
def join_with_app(app_df: pd.DataFrame, additional_df: pd.DataFrame) -> pd.DataFrame:
    sk_id_curr = pd.DataFrame(index=app_df.index)
    pre_joined = sk_id_curr.join(additional_df, how="left").fillna(0)

    joined = app_df.join(pre_joined, how="left")
    return joined

In [5]:
def load_compute_and_join_with_app(
    app_df: pd.DataFrame,
    filename: str,
    compute_func: Callable[[pd.DataFrame], pd.DataFrame],
    dir_loc: str = "../add_data/",
) -> pd.DataFrame:
    sk_id_curr = app_df.index
    # Load previous application data set
    additional_df = load_additional_df(sk_id_curr, filename, dir_loc)
    # Compute the features
    additional_features = compute_func(additional_df)
    # Delete and gets memory back
    del additional_df
    gc.collect()
    # Join the dataset to the main one
    augmented_df = join_with_app(app_df, additional_features)
    # return additional_features
    return augmented_df

In [6]:
def get_full_dataframe(app_df: pd.DataFrame, dir: str = "../add_data/") -> pd.DataFrame:
    full_features = load_compute_and_join_with_app(
        app_df, "credit_card_balance.parquet", compute_features_credit_card
    )
    full_features = load_compute_and_join_with_app(
        full_features, "previous_application.parquet", compute_features_previous
    )

    full_features = load_compute_and_join_with_app(
        full_features, "bureau.parquet", compute_features_bureau
    )

    full_features = load_compute_and_join_with_app(
        full_features, "installments_payments.parquet", compute_features_instal
    )
    full_features = load_compute_and_join_with_app(
        full_features, "POS_CASH_balance.parquet", compute_features_pos_cash
    )
    return full_features.set_index("SK_ID_CURR")

In [7]:
class JoinDataFrame(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X_ = X.copy()

        X_ = get_full_dataframe(X_)

        return X_

### **Pipeline preprocessing blocks**

In [8]:
housing_block_cols = [
    "APARTMENTS_AVG",
    "BASEMENTAREA_AVG",
    "YEARS_BEGINEXPLUATATION_AVG",
    "YEARS_BUILD_AVG",
    "COMMONAREA_AVG",
    "ELEVATORS_AVG",
    "ENTRANCES_AVG",
    "FLOORSMAX_AVG",
    "FLOORSMIN_AVG",
    "LANDAREA_AVG",
    "LIVINGAPARTMENTS_AVG",
    "LIVINGAREA_AVG",
    "NONLIVINGAPARTMENTS_AVG",
    "NONLIVINGAREA_AVG",
    "APARTMENTS_MODE",
    "BASEMENTAREA_MODE",
    "YEARS_BEGINEXPLUATATION_MODE",
    "YEARS_BUILD_MODE",
    "COMMONAREA_MODE",
    "ELEVATORS_MODE",
    "ENTRANCES_MODE",
    "FLOORSMAX_MODE",
    "FLOORSMIN_MODE",
    "LANDAREA_MODE",
    "LIVINGAPARTMENTS_MODE",
    "LIVINGAREA_MODE",
    "NONLIVINGAPARTMENTS_MODE",
    "NONLIVINGAREA_MODE",
    "APARTMENTS_MEDI",
    "BASEMENTAREA_MEDI",
    "YEARS_BEGINEXPLUATATION_MEDI",
    "YEARS_BUILD_MEDI",
    "COMMONAREA_MEDI",
    "ELEVATORS_MEDI",
    "ENTRANCES_MEDI",
    "FLOORSMAX_MEDI",
    "FLOORSMIN_MEDI",
    "LANDAREA_MEDI",
    "LIVINGAPARTMENTS_MEDI",
    "LIVINGAREA_MEDI",
    "NONLIVINGAPARTMENTS_MEDI",
    "NONLIVINGAREA_MEDI",
    "TOTALAREA_MODE",
]

low_entropy_features = [
    "FLAG_DOCUMENT_11",
    "FLAG_DOCUMENT_13",
    "FLAG_DOCUMENT_9",
    "FLAG_DOCUMENT_14",
    "FLAG_CONT_MOBILE",
    "FLAG_DOCUMENT_15",
    "FLAG_DOCUMENT_19",
    "FLAG_DOCUMENT_20",
    "FLAG_DOCUMENT_21",
    "FLAG_DOCUMENT_17",
    "FLAG_DOCUMENT_7",
    "FLAG_DOCUMENT_2",
    "FLAG_DOCUMENT_4",
    "FLAG_DOCUMENT_10",
    "FLAG_DOCUMENT_12",
    "FLAG_MOBIL",
]

cat_features = [
    "NAME_CONTRACT_TYPE",
    "CODE_GENDER",
    "FLAG_OWN_CAR",
    "FLAG_OWN_REALTY",
    "NAME_TYPE_SUITE",
    "NAME_INCOME_TYPE",
    "NAME_EDUCATION_TYPE",
    "NAME_FAMILY_STATUS",
    "NAME_HOUSING_TYPE",
    "OCCUPATION_TYPE",
    "WEEKDAY_APPR_PROCESS_START",
    "ORGANIZATION_TYPE",
    "FONDKAPREMONT_MODE",
    "HOUSETYPE_MODE",
    "WALLSMATERIAL_MODE",
    "EMERGENCYSTATE_MODE",
]

numerical_features = [
    "CNT_CHILDREN",
    "AMT_INCOME_TOTAL",
    "AMT_CREDIT",
    "AMT_ANNUITY",
    "AMT_GOODS_PRICE",
    "REGION_POPULATION_RELATIVE",
    "DAYS_BIRTH",
    "DAYS_EMPLOYED",
    "DAYS_REGISTRATION",
    "DAYS_ID_PUBLISH",
    "OWN_CAR_AGE",
    "HOUR_APPR_PROCESS_START",
    "EXT_SOURCE_1",
    "EXT_SOURCE_2",
    "EXT_SOURCE_3",
    "DEF_30_CNT_SOCIAL_CIRCLE",
    "OBS_60_CNT_SOCIAL_CIRCLE",
    "DEF_60_CNT_SOCIAL_CIRCLE",
    "DAYS_LAST_PHONE_CHANGE",
    "AMT_REQ_CREDIT_BUREAU_HOUR",
    "AMT_REQ_CREDIT_BUREAU_DAY",
    "AMT_REQ_CREDIT_BUREAU_WEEK",
    "AMT_REQ_CREDIT_BUREAU_MON",
    "AMT_REQ_CREDIT_BUREAU_QRT",
    "AMT_REQ_CREDIT_BUREAU_YEAR",
]

columns_to_drop = low_entropy_features + ["CNT_FAM_MEMBERS", "OBS_30_CNT_SOCIAL_CIRCLE"]

ext_source = ["EXT_SOURCE_1", "EXT_SOURCE_2", "EXT_SOURCE_3"]

In [9]:
class RemoveAnomalieRescaleAndDrop(BaseEstimator, TransformerMixin):
    def __init__(self, columns_to_drop):
        self.columns_to_drop = columns_to_drop

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X_ = X.copy()
        X_.loc[X_["DAYS_EMPLOYED"] == 365243, "DAYS_EMPLOYED"] = np.nan
        X_["DAYS_EMPLOYED"] = -X_["DAYS_EMPLOYED"] / 365
        X_["DAYS_BIRTH"] = -X_["DAYS_BIRTH"] / 365
        X_["DAYS_REGISTRATION"] = -X_["DAYS_REGISTRATION"] / 365
        X_["DAYS_ID_PUBLISH"] = -X_["DAYS_ID_PUBLISH"] / 365
        X_.drop(self.columns_to_drop, axis=1, inplace=True)
        return X_

In [10]:
simple_imputer = SimpleImputer(strategy="median")

imputer = ColumnTransformer(
    transformers=[
        ("simple_imputer", simple_imputer, numerical_features + housing_block_cols),
    ],
    remainder="passthrough",
    verbose_feature_names_out=False,
    n_jobs=-1,
)

In [11]:
scaler = ColumnTransformer(
    transformers=[
        ("standard_scaler", StandardScaler(), housing_block_cols),
    ],
    remainder="passthrough",
    verbose_feature_names_out=False,
    n_jobs=-1,
)

In [12]:
encoder = ColumnTransformer(
    transformers=[
        ("encoder", TargetEncoder(target_type="binary"), cat_features),
    ],
    remainder="passthrough",
    verbose_feature_names_out=False,
    n_jobs=-1,
)

In [13]:
class PCASelection(BaseEstimator, TransformerMixin):
    def __init__(self, features_to_pca, n_components=0.9):
        self.features_to_pca = features_to_pca
        self.n_components = n_components
        self.pca = PCA(n_components=n_components)

    def fit(self, X, y=None):
        self.pca.fit(X[self.features_to_pca])
        return self

    def transform(self, X):
        X_ = X.copy()
        pca_result = self.pca.transform(X_[self.features_to_pca])
        X_ = X_.drop(columns=self.features_to_pca)
        for i in range(pca_result.shape[1]):
            X_[f"PCA_HOUSING_{i+1}"] = pca_result[f"pca{i}"]

        return X_


housing_reduction = ColumnTransformer(
    transformers=[("pca", PCASelection(housing_block_cols), housing_block_cols)],
    remainder="passthrough",
    verbose_feature_names_out=False,
)

In [14]:
class FeatureEngineering(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X_ = X.copy()
        X_["INCOME_CHILDREN_RATIO"] = X_["AMT_INCOME_TOTAL"] / (
            X_["CNT_CHILDREN"] + 0.0001
        )
        X_["CREDIT_INCOME_RATIO"] = X_["AMT_CREDIT"] / (X_["AMT_INCOME_TOTAL"] + 0.0001)
        X_["CREDIT_ANNUITY_RATIO"] = X_["AMT_CREDIT"] / (X_["AMT_ANNUITY"] + 0.0001)
        X_["ANNUITY_INCOME_RATIO"] = X_["AMT_ANNUITY"] / (
            X_["AMT_INCOME_TOTAL"] + 0.0001
        )
        X_["INCOME_ANNUITY_DIFF"] = X_["AMT_INCOME_TOTAL"] - X_["AMT_ANNUITY"]
        X_["CREDIT_GOODS_RATIO"] = X_["AMT_CREDIT"] / (X_["AMT_GOODS_PRICE"] + 0.0001)
        X_["CREDIT_GOODS_DIFF"] = X_["AMT_CREDIT"] - X_["AMT_GOODS_PRICE"] + 0.0001
        X_["GOODS_INCOME_RATIO"] = X_["AMT_GOODS_PRICE"] / (
            X_["AMT_INCOME_TOTAL"] + 0.0001
        )
        X_["AVG_EXT_SOURCE"] = (
            X_["EXT_SOURCE_1"] + X_["EXT_SOURCE_2"] + X_["EXT_SOURCE_3"]
        ) / 3
        X_["HARM_AVG_EXT_SOURCE"] = (
            X_["EXT_SOURCE_1"] * X_["EXT_SOURCE_2"] * X_["EXT_SOURCE_3"]
        ) / (X_["EXT_SOURCE_1"] + X_["EXT_SOURCE_2"] + X_["EXT_SOURCE_3"] + 0.001)
        X_["AVG_60_OBS_DEF"] = (
            X_["OBS_60_CNT_SOCIAL_CIRCLE"] + X_["DEF_60_CNT_SOCIAL_CIRCLE"]
        ) / 2
        X_["RATION_EMPLOYED_AGE"] = X_["DAYS_BIRTH"] / X_["DAYS_EMPLOYED"]

        # Combinining several datasets
        X_["RATIO_TOTAL_CREDIT_INCOME"] = (
            X_["AMT_CREDIT"]
            + X_["PREVIOUS_DIFF_CREDIT_DOWN_PAYMENT_SUM"]
            + X_["BUREAU_ACTIVE_AMT_CREDIT_SUM_SUM"]
        ) / (X_["AMT_INCOME_TOTAL"] + 0.0001)

        return X_

In [15]:
polynomial_transformer = PolynomialFeatures(degree=2, include_bias=False)

poly_transformer = ColumnTransformer(
    transformers=[
        (
            "polynomial_transformer",
            polynomial_transformer,
            [
                "EXT_SOURCE_1",
                "EXT_SOURCE_2",
                "EXT_SOURCE_3",
                "CREDIT_INCOME_RATIO",
                "ANNUITY_INCOME_RATIO",
                "CREDIT_ANNUITY_RATIO",
            ],
        )
    ],
    remainder="passthrough",
    verbose_feature_names_out=False,
)

## **Testing different pipeline setups**

### **First model: manual imputation and encoding**

<div style="background-color: #e8f4f8; padding: 10px; border-radius: 5px;">
In this model, we will handle the imputation and encoding processes manually, rather than relying on algorithms to do so.

* **Imputation:** We'll use simple imputation strategies for both numerical and categorical features. For numerical features, we'll use the median, and for categorical features, we'll impute with the most frequent category.
* **Encoding:** Since most categories lack a natural order, one might assume that one-hot encoding would be the best approach. However, I've found that one-hot encoding often performs poorly in practice. Therefore, we'll experiment with two different encoding methods: Ordinal Encoding and Target Encoding.
</div>

#### **Ordinal encoding**

In [16]:
simple_imputer = SimpleImputer(strategy="median")
imputer = ColumnTransformer(
    transformers=[
        ("simple_imputer", simple_imputer, numerical_features + housing_block_cols),
        ("cat_imputer", SimpleImputer(strategy="most_frequent"), cat_features),
    ],
    remainder="passthrough",
    verbose_feature_names_out=False,
    n_jobs=-1,
)

In [17]:
encoder = ColumnTransformer(
    transformers=[
        (
            "encoder",
            OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=-1),
            cat_features,
        ),
    ],
    remainder="passthrough",
    verbose_feature_names_out=False,
    n_jobs=-1,
)

In [18]:
preprocessor = Pipeline(
    [
        ("joiner", JoinDataFrame()),
        ("remover", RemoveAnomalieRescaleAndDrop(columns_to_drop)),
        ("imputer", imputer),
        ("scaler", scaler),
        ("housing_reduction", housing_reduction),
        ("feature_engineering", FeatureEngineering()),
        ("poly_transformer", poly_transformer),
        ("encoder", encoder),
    ]
)

In [19]:
classifiers = {
    "LightGBM": LGBMClassifier(class_weight="balanced", verbose=0),
    "CatBoost": CatBoostClassifier(
        class_weights=[1, len(y_train[y_train == 0]) / len(y_train[y_train == 1])],
        verbose=0,
    ),
    "Dummy": DummyClassifier(strategy="most_frequent"),
}

In [20]:
%%time
classifiers_scores = list(dict())

for key in classifiers.keys():
    pipeline = Pipeline([("preprocessor", preprocessor), (key, classifiers[key])])
    scores = cross_validate(
        pipeline,
        X_train,
        y_train,
        cv=skf,
        scoring="roc_auc",
        return_train_score=True,
        n_jobs=4,
    )
    classifiers_scores.append(
        {
            "classifier": key,
            "train_score": np.mean(scores["train_score"]),
            "test_score": np.mean(scores["test_score"]),
        }
    )

df_scores_first = pd.DataFrame.from_records(classifiers_scores, index="classifier")

CPU times: user 2.77 s, sys: 1.2 s, total: 3.97 s
Wall time: 12min 31s


In [21]:
df_scores_first

Unnamed: 0_level_0,train_score,test_score
classifier,Unnamed: 1_level_1,Unnamed: 2_level_1
LightGBM,0.823859,0.759089
CatBoost,0.916471,0.747255
Dummy,0.5,0.5


<div style="background-color: #e8f4f8; padding: 10px; border-radius: 5px;">
This time, we decided to drop XGBoost, as it was already performing worse than LightGBM and CatBoost. Our first observation is that our models, when augmented with data from additional datasets, are performing better. LightGBM's performance improved from 0.739356 to around 0.758 on the validation folds, while CatBoost improved from 0.737160 to around 0.747. Although this improvement may seem marginal, it could represent millions of dollars for a company like Home Credit.
</div>

#### **Target encoding**

In [22]:
simple_imputer = SimpleImputer(strategy="median")
imputer = ColumnTransformer(
    transformers=[
        ("simple_imputer", simple_imputer, numerical_features + housing_block_cols),
        ("cat_imputer", SimpleImputer(strategy="most_frequent"), cat_features),
    ],
    remainder="passthrough",
    verbose_feature_names_out=False,
    n_jobs=-1,
)

In [23]:
target_encoder = ColumnTransformer(
    transformers=[
        ("encoder", TargetEncoder(target_type="binary"), cat_features),
    ],
    remainder="passthrough",
    verbose_feature_names_out=False,
    n_jobs=-1,
)

In [24]:
preprocessor = Pipeline(
    [
        ("joiner", JoinDataFrame()),
        ("remover", RemoveAnomalieRescaleAndDrop(columns_to_drop)),
        ("imputer", imputer),
        ("scaler", scaler),
        ("housing_reduction", housing_reduction),
        ("feature_engineering", FeatureEngineering()),
        ("poly_transformer", poly_transformer),
        ("target_encoder", target_encoder),
    ]
)

In [25]:
classifiers = {
    "LightGBM": LGBMClassifier(class_weight="balanced", verbose=0),
    "CatBoost": CatBoostClassifier(
        class_weights=[1, len(y_train[y_train == 0]) / len(y_train[y_train == 1])],
        verbose=0,
    ),
}

In [26]:
classifiers_scores = list(dict())

for key in classifiers.keys():
    pipeline = Pipeline([("preprocessor", preprocessor), (key, classifiers[key])])
    scores = cross_validate(
        pipeline,
        X_train,
        y_train,
        cv=skf,
        scoring="roc_auc",
        return_train_score=True,
        n_jobs=4,
    )
    classifiers_scores.append(
        {
            "classifier": key,
            "train_score": np.mean(scores["train_score"]),
            "test_score": np.mean(scores["test_score"]),
        }
    )

df_scores_second = pd.DataFrame.from_records(classifiers_scores, index="classifier")



In [27]:
df_scores_second

Unnamed: 0_level_0,train_score,test_score
classifier,Unnamed: 1_level_1,Unnamed: 2_level_1
LightGBM,0.821973,0.758076
CatBoost,0.906729,0.747803


<div style="background-color: #e8f4f8; padding: 10px; border-radius: 5px;">
Target encoding seems to have improved the performance of both models. Once again, LightGBM is outperforming CatBoost, so we'll keep LightGBM and set aside CatBoost. Before proceeding with recursive feature selection, we'll try one last preprocessing option: allowing LightGBM to handle categorical variable imputation directly.
</div>

### **Second model: no imputation for categorical variable**

In [28]:
simple_imputer = SimpleImputer(strategy="median")
second_imputer = ColumnTransformer(
    transformers=[
        ("simple_imputer", simple_imputer, numerical_features + housing_block_cols),
    ],
    remainder="passthrough",
    verbose_feature_names_out=False,
    n_jobs=-1,
)

In [29]:
target_encoder = ColumnTransformer(
    transformers=[
        ("encoder", TargetEncoder(target_type="binary"), cat_features),
    ],
    remainder="passthrough",
    verbose_feature_names_out=False,
    n_jobs=-1,
)

In [30]:
second_pipeline = Pipeline(
    [
        ("joiner", JoinDataFrame()),
        ("remover", RemoveAnomalieRescaleAndDrop(columns_to_drop)),
        ("imputer", imputer),
        ("scaler", scaler),
        ("housing_reduction", housing_reduction),
        ("feature_engineering", FeatureEngineering()),
        ("poly_transformer", poly_transformer),
        ("target_encoder", target_encoder),
        ("lightgbm", LGBMClassifier(class_weight="balanced", verbose=0)),
    ]
)

In [31]:
scores = cross_validate(
    second_pipeline,
    X_train,
    y_train,
    cv=skf,
    scoring="roc_auc",
    return_train_score=True,
    n_jobs=4,
)



In [32]:
train_score = scores["train_score"].mean()
test_score = scores["test_score"].mean()
print(f"train_score = {train_score} \ntest_score = {test_score}")

train_score = 0.8218217017952489 
test_score = 0.7580602892518558


<div style="background-color: #e8f4f8; padding: 10px; border-radius: 5px;">
The performance is very slightly below what we achieved using our own imputation method for categorical variables. However, I’ll keep it this way, as I believe that combining this approach with hyperparameter tuning could lead to better overall performance.
</div>

## **Recursive feature selection**

<div style="background-color: #e8f4f8; padding: 10px; border-radius: 5px;">
Now we will proceed with recursive feature selection. By removing irrelevant features, we aim to improve model interpretability, make the deployed model more manageable, and likely increase its speed and efficiency.
</div>

In [33]:
lgbm = LGBMClassifier(class_weight="balanced", verbose=0)

In [34]:
rfecv = RFECV(estimator=lgbm, step=5, cv=skf, scoring="roc_auc", verbose=1)

In [35]:
feature_selection_pipeline = Pipeline(
    [
        ("joiner", JoinDataFrame()),
        ("remover", RemoveAnomalieRescaleAndDrop(columns_to_drop)),
        ("imputer", imputer),
        ("scaler", scaler),
        ("housing_reduction", housing_reduction),
        ("feature_engineering", FeatureEngineering()),
        ("poly_transformer", poly_transformer),
        ("target_encoder", target_encoder),
        ("rfcev", rfecv),
        ("lightgbm", lgbm),
    ]
)

In [36]:
%%skip
%%time
fsp_fitted = feature_selection_pipeline.fit(X_train, y_train)
joblib.dump(fsp_fitted, '../pkl/fsp_fitted.joblib')

In [40]:
fsp_fitted = joblib.load("../pkl/fsp_fitted.joblib")
rfecv_fitted = fsp_fitted.named_steps["rfcev"]
features_out = fsp_fitted.named_steps["target_encoder"].get_feature_names_out()
relevant_features = features_out[rfecv_fitted.support_]
relevant_features

array(['NAME_CONTRACT_TYPE', 'CODE_GENDER', 'FLAG_OWN_CAR',
       'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS',
       'NAME_HOUSING_TYPE', 'OCCUPATION_TYPE',
       'WEEKDAY_APPR_PROCESS_START', 'ORGANIZATION_TYPE',
       'WALLSMATERIAL_MODE', 'EXT_SOURCE_1', 'EXT_SOURCE_2',
       'EXT_SOURCE_3', 'CREDIT_INCOME_RATIO', 'ANNUITY_INCOME_RATIO',
       'CREDIT_ANNUITY_RATIO', 'EXT_SOURCE_1 EXT_SOURCE_2',
       'EXT_SOURCE_1 EXT_SOURCE_3', 'EXT_SOURCE_1 CREDIT_INCOME_RATIO',
       'EXT_SOURCE_1 ANNUITY_INCOME_RATIO',
       'EXT_SOURCE_1 CREDIT_ANNUITY_RATIO', 'EXT_SOURCE_2 EXT_SOURCE_3',
       'EXT_SOURCE_2 CREDIT_INCOME_RATIO',
       'EXT_SOURCE_2 ANNUITY_INCOME_RATIO',
       'EXT_SOURCE_2 CREDIT_ANNUITY_RATIO',
       'EXT_SOURCE_3 CREDIT_INCOME_RATIO',
       'EXT_SOURCE_3 ANNUITY_INCOME_RATIO',
       'EXT_SOURCE_3 CREDIT_ANNUITY_RATIO',
       'CREDIT_INCOME_RATIO ANNUITY_INCOME_RATIO',
       'CREDIT_INCOME_RATIO CREDIT_ANNUITY_RATIO', 'PCA_HOUSING_1',
  

<div style="background-color: #e8f4f8; padding: 10px; border-radius: 5px;">
We started with over 500 features, and after feature selection, we're down to 103—reducing the number of features by nearly fivefold! The next step will be to design new functions that compute only the necessary features. Following that, we'll train the model and tune its hyperparameters before deploying it.
</div>

## **Summary**

<div style="background-color: #f8d7dA; padding: 10px; border-radius: 5px;">
In this notebook, we have:

* **Created Custom Functions and Classes:** Developed tools to handle data merging, feature engineering, and preprocessing for our classifier.
* **Evaluated Different Boosted Models:** Assessed various models and chose LightGBM for our classification task. We found that incorporating additional datasets improved our model's performance.
* **Used Recursive Feature Selection:** Reduced the number of features significantly without compromising model performance. We now have only a fifth of the total features initially computed. Our next step is to modify the model pipeline to account for this reduction.
</div>