# Modelling

## Introduction

### Preliminary steps

In [1]:
import gc
import pandas as pd
import numpy as np
from hummingbird.ml import convert
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.linear_model import SGDClassifier
from lightgbm import LGBMClassifier
from sklearn.model_selection import StratifiedKFold
from xgboost import XGBClassifier
from sklearn.model_selection import RandomizedSearchCV
from bayes_opt import BayesianOptimization
from sklearn.metrics import roc_auc_score
import warnings
warnings.simplefilter(action='ignore')



In [2]:
train_data = pd.read_csv('../aggregated_data/train_data.csv')
test_data = pd.read_csv('../aggregated_data/test_data.csv')

In [3]:
test_ID = test_data.pop("SK_ID_CURR")

In [4]:
X = train_data.copy()
y = X.pop("TARGET")
train_ID = X.pop("SK_ID_CURR")

### Functions

In [6]:
def sgd_hyper_parameters(sgd_model):
    """
    Performs randomized hyperparameter tuning only for alpha parameters.

    Args:
        sgd_model ([sklearn.linear_model.SGDClassifier]): SGDClassifier

    Returns:
        RandomizedSearchCV.best_estimator_: Best parameters for the model
    """
    hyperparams = {"model__alpha": np.logspace(-4, 2)}
    rscv = RandomizedSearchCV(
        sgd_model,
        hyperparams,
        n_iter=15,
        scoring="roc_auc",
        cv=3,
        verbose=0,
        n_jobs=1,
        random_state=1,
    ).fit(X, y)

    return rscv.best_estimator_

In [7]:
def cross_validation(
    X: pd.DataFrame, y: pd.Series, model_name: str, *args, **kwargs) -> float:
    """
    Stratified three fold cross validation with 5 splits.
    Model_name arguments allows to perform validation on 3 models specified under Args section.

    Args:
        X ([pd.DataFrame]): training dataframe
        y ([pd.Series]): target variable
        model_name (str): 'sgd' - SGDclassifier, 'xgboost' - XGBClassifier, 'lgbm' - LGBMClassfier

    Returns:
        float: ROC AUC score
    """
    stratified_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=33)
    cv_preds = np.zeros(train_data.shape[0])

    for train_indices, cv_indices in stratified_cv.split(X, y):
        x_train = X.iloc[train_indices]
        y_train = y.iloc[train_indices]
        x_cv = X.iloc[cv_indices]
        y_cv = y.iloc[cv_indices]

        if model_name == "sgd":
            sgd_pipe.fit(x_train, y_train)
            cv_preds[cv_indices] = sgd_pipe.predict_proba(x_cv)[:, 1]

        if model_name == "xgboost":
            xgbc = XGBClassifier(**kwargs)
            xgbc.fit(
                x_train,
                y_train,
                eval_set=[(x_cv, y_cv)],
                eval_metric="auc",
                verbose=False,
                early_stopping_rounds=200,
            )
            cv_preds[cv_indices] = xgbc.predict_proba(
                x_cv, ntree_limit=xgbc.get_booster().best_ntree_limit
            )[:, 1]

        if model_name == "lgbm":
            lgbm_clf = LGBMClassifier(**kwargs)
            lgbm_clf.fit(
                x_train,
                y_train,
                eval_set=[(x_cv, y_cv)],
                eval_metric="auc",
                verbose=False,
                early_stopping_rounds=200,
            )
            cv_preds[cv_indices] = lgbm_clf.predict_proba(
                x_cv, num_iteration=lgbm_clf.best_iteration_
            )[:, 1]

    return round(roc_auc_score(y, cv_preds), 4)

In [8]:
def predictions_to_csv(predictions: np.array, model_name: str) -> None:
    """
    Function converts predictions of a certain estimator to a csv file.

    Args:
        predictions (np.array): prediction of the target variable preferably on the test data
        model_name (str): name of the estimator
    """
    predictions = pd.DataFrame(predictions).iloc[:, 1:]
    sub = pd.DataFrame()
    sub["SK_ID_CURR"] = test_ID
    sub["TARGET"] = predictions
    sub.to_csv(f"{model_name}.csv", index=False)

## SGD

Logistic Regression by default uses Gradient Descent and as such it would be better to use SGD Classifier on larger data sets. One another reason you might want to use SGD Classifier is, logistic regression, in its vanilla sklearn form, won’t work if you can’t hold the dataset in RAM but SGD will still work.

The alpha parameter has been found using sgd_hyper_parameters function.

**Note**:

These scalers have been tried: MinMaxScaler, RobustScaler. However, ```StandardScaler``` showed the best CV score.

These imputing options have been tried: IterativeImputer, Median method, Mean method. However, ```zero``` imputing has been giving the best CV score.



In [9]:
sgd_pipe = Pipeline(
    steps=[
        (
            "imputer",
            SimpleImputer(missing_values=np.nan, strategy="constant", fill_value=0),
        ),
        ("scaler", StandardScaler()),
        (
            "model",
            SGDClassifier(
                alpha=0.021209508879201904,
                loss="log",
                penalty="l2",
                class_weight="balanced",
                n_jobs=-1,
                random_state=0,
            ),
        ),
    ]
)


In [10]:
cross_validation(X, y, 'sgd')

0.7783

In [11]:
sgd_model = sgd_pipe.fit(X, y)

### Prediction speed

In [12]:
%%timeit
sgd_clf_predictions = sgd_model.predict_proba(test_data)

741 ms ± 24.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [13]:
# predictions_to_csv(sgd_clf_predictions, 'sgd_clf')

![SGD_KAGGLE](https://i.imgur.com/O1lm65H.png)

## XGB


### Baysian optimization XGB

For both XGBoost and LightGBM, we have too many hyperparameters to tune, and doing GridSearchCV or RandomizedSearchCV can be too expesive on such a big dataset for finding an optimal solution. That is why we will be using the Bayesian Optimization Technique to tune the hyperparameters, which works by looking at the results on previous hyperparameters while assigning new hyperparameters. It tries to model on the Cost Function which is dependent on all the hyperparameters.

In [None]:
def xgb_evaluation(
    max_depth,
    min_child_weight,
    gamma,
    subsample,
    colsample_bytree,
    colsample_bylevel,
    colsample_bynode,
    reg_alpha,
    reg_lambda):
    """
    Objective function for Bayesian Optimization of XGBoost's Hyperparamters. Takes the hyperparameters as input, and
    returns the Cross-Validation AUC as output.

    Inputs: Hyperparamters to be tuned.
        max_depth, min_child_weight, gamma, subsample, colsample_bytree, colsample_bylevel,
        colsample_bynode, reg_alpha, reg_lambda

    Returns:
        CV ROC-AUC Score
    """
    params = {
        "learning_rate": 0.01,
        "n_estimators": 10000,
        "tree_method": "gpu_hist",
        "gpu_id": 0,
        "max_depth": int(round(max_depth)),
        "min_child_weight": int(round(min_child_weight)),
        "subsample": subsample,
        "gamma": gamma,
        "colsample_bytree": colsample_bytree,
        "colsample_bylevel": colsample_bylevel,
        "colsample_bynode": colsample_bynode,
        "reg_alpha": reg_alpha,
        "reg_lambda": reg_lambda,
        "random_state": 1,
    }

    stratified_cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=1)
    cv_preds = np.zeros(train_data.shape[0])

    # iterating over each fold, training the model, and making Out of Fold Predictions
    for train_indices, cv_indices in stratified_cv.split(X, y):

        x_train = X.iloc[train_indices]
        y_train = y.iloc[train_indices]
        x_cv = X.iloc[cv_indices]
        y_cv = y.iloc[cv_indices]

        xgbc = XGBClassifier(**params)
        xgbc.fit(
            x_train,
            y_train,
            eval_set=[(x_cv, y_cv)],
            eval_metric="auc",
            verbose=False,
            early_stopping_rounds=200,
        )

        cv_preds[cv_indices] = xgbc.predict_proba(
            x_cv, ntree_limit=xgbc.get_booster().best_ntree_limit
        )[:, 1]
        gc.collect()

    return roc_auc_score(y, cv_preds)


bopt_xgb = BayesianOptimization(
    xgb_evaluation,
    {
        "max_depth": (5, 15),
        "min_child_weight": (5, 80),
        "gamma": (0.2, 1),
        "subsample": (0.5, 1),
        "colsample_bytree": (0.5, 1),
        "colsample_bylevel": (0.3, 1),
        "colsample_bynode": (0.3, 1),
        "reg_alpha": (0.001, 0.3),
        "reg_lambda": (0.001, 0.3),
    },
    random_state=1,
).maximize(n_iter=6, init_points=4)


target_values = []
for result in bopt_xgb.res:
    target_values.append(result["target"])
    if result["target"] == max(target_values):
        best_params = result["params"]

print("Best Hyperparameters for XGBoost are:\n")
print(best_params)


### XGB model

Hyperparameters below have been found using bayesian optimization.

In [14]:
xgb_params = {
    "learning_rate": 0.01,
    "n_estimators": 10000,
    "tree_method": "gpu_hist",
    "gpu_id": 0,
    "max_depth": 5,
    "min_child_weight": 80,
    "subsample": 0.9622896832878278,
    "gamma": 0.794005454765522,
    "colsample_bytree": 0.5741523601432443,
    "colsample_bylevel": 0.3272128085080071,
    "colsample_bynode": 0.9480907366417157,
    "reg_alpha": 0.24018946957919934,
    "reg_lambda": 0.23887141295582165,
    "random_state": 51412,
}


In [15]:
cross_validation(X, y, 'xgboost', xgb_params)

0.7848

In [16]:
xgb = XGBClassifier(**xgb_params)
xgb_model = xgb.fit(X, y)



### Prediction speed

In [17]:
%%timeit
xgb_model.predict_proba(test_data)

4.22 s ± 518 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [18]:
#predictions_to_csv(xgb_predictions, 'xgb')

![XGB_KAGGLE](https://i.imgur.com/51FY7P8.png)

## LGBM

### Baysian optimization LGBM

In [None]:
def lgbm_evaluation(
    num_leaves,
    max_depth,
    min_split_gain,
    min_child_weight,
    min_child_samples,
    subsample,
    colsample_bytree,
    reg_alpha,
    reg_lambda):
    """
    Objective function for Bayesian Optimization of LightGBM's Hyperparamters. Takes the hyperparameters as input, and
    returns the Cross-Validation AUC as output.

    Inputs: Hyperparamters to be tuned.
        num_leaves, max_depth, min_split_gain, min_child_weight,
        min_child_samples, subsample, colsample_bytree, reg_alpha, reg_lambda

    Returns:
        CV ROC-AUC Score
    """

    params = {
        "objective": "binary",
        "boosting_type": "gbdt",
        "learning_rate": 0.005,
        "n_estimators": 10000,
        "n_jobs": -1,
        "num_leaves": int(round(num_leaves)),
        "max_depth": int(round(max_depth)),
        "min_split_gain": min_split_gain,
        "min_child_weight": min_child_weight,
        "min_child_samples": int(round(min_child_samples)),
        "subsample": subsample,
        "subsample_freq": 1,
        "colsample_bytree": colsample_bytree,
        "reg_alpha": reg_alpha,
        "reg_lambda": reg_lambda,
        "verbosity": -1,
        "seed": 266,
    }
    stratified_cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=33)

    cv_preds = np.zeros(train_data.shape[0])

    for train_indices, cv_indices in stratified_cv.split(X, y):

        x_tr = X.iloc[train_indices]
        y_tr = y.iloc[train_indices]
        x_cv = X.iloc[cv_indices]
        y_cv = y.iloc[cv_indices]

        lgbm_clf = LGBMClassifier(**params)
        lgbm_clf.fit(
            x_tr,
            y_tr,
            eval_set=[(x_cv, y_cv)],
            eval_metric="auc",
            verbose=False,
            early_stopping_rounds=200,
        )

        cv_preds[cv_indices] = lgbm_clf.predict_proba(
            x_cv, num_iteration=lgbm_clf.best_iteration_
        )[:, 1]

    return roc_auc_score(y, cv_preds)


bopt_lgbm = BayesianOptimization(
    lgbm_evaluation,
    {
        "num_leaves": (25, 50),
        "max_depth": (6, 11),
        "min_split_gain": (0, 0.1),
        "min_child_weight": (5, 80),
        "min_child_samples": (5, 80),
        "subsample": (0.5, 1),
        "colsample_bytree": (0.5, 1),
        "reg_alpha": (0.001, 0.3),
        "reg_lambda": (0.001, 0.3),
    },
    random_state=2,
).maximize(n_iter=6, init_points=4)


target_values = []
for result in bopt_lgbm.res:
    target_values.append(result["target"])
    if result["target"] == max(target_values):
        best_params = result["params"]

print("Best Hyperparameters obtained are:\n")
print(best_params)

### LGBM Model

Hyperparameters below have been found using bayesian optimization.

In [19]:
lgbm_params = {
    "objective": "binary",
    "boosting_type": "gbdt",
    "learning_rate": 0.005,
    "n_estimators": 10000,
    "n_jobs": -1,
    "num_leaves": 39,
    "max_depth": 9,
    "min_split_gain": 0.030820727751758883,
    "min_child_weight": 30.074868967458226,
    "min_child_samples": 31,
    "subsample": 0.7653763123038788,
    "subsample_freq": 1,
    "colsample_bytree": 0.6175714684701181,
    "reg_alpha": 0.15663020002553255,
    "reg_lambda": 0.22503178038757748,
    "verbosity": -1,
    "seed": 266,
}


In [22]:
cross_validation(X, y, 'lgbm', lgbm_params)

0.7901

In [23]:
lgbm = LGBMClassifier(**lgbm_params)
lgbm_model = lgbm.fit(X, y)

### Prediction speed

In [28]:
%%timeit
lgbm_predictions = lgbm_model.predict_proba(test_data)

21.3 s ± 384 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [29]:
# predictions_to_csv(lgbm_predictions, 'lgbm')

![LGBM_KAGGLE](https://i.imgur.com/W0Hcz7L.png)

## Model comparison

In [35]:
pd.DataFrame(
    [
        ["SGD", 0.7783, 0.7611, "0.74s"],
        ["XGB", 0.7848, 0.7859, "4.22s"],
        ["LGBM", 0.7901, 0.7871, "21.3s"],
    ],
    columns=["Model", "CV score", "Kaggle", "Speed"],
    index=list(range(1, 4)),
)

Unnamed: 0,Model,CV score,Kaggle,Speed
1,SGD,0.7783,0.7611,0.74s
2,XGB,0.7848,0.7859,4.22s
3,LGBM,0.7901,0.7871,21.3s


### Conclusions

1. Fastest model - SGDClassifier.
2. Most accurate model - LGBMClassifier. Both CV and Kaggle private score.
3. XGB and LGBM are better than a median score on Kaggle.




### Increase predictions speed


To increase speed of the predictions, I have converted all the models to pytorch back-end with hummingbird-ml library [Github](https://github.com/microsoft/hummingbird).


Following results have been achieved:
1. SGDClassifier - speed has increased by 6 times.
2. XGB - speed has increased by 3 times.
3. LGBM - spped has increased by 3 times.


However, since the converted model must run predictions on GPU rather than CPU, it makes it challenging to show on local machine. Thus, I have decided to leave this out of the project scope, however, I found it worth mentioning in case higher prediction speed is required.