This file contains the baseline LGBM model as well as the tuned version of LGBM. We are using optuna for hyperparameter tuning.

> Indented block




In [None]:
!pip install optuna

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting optuna
  Downloading optuna-3.0.3-py3-none-any.whl (348 kB)
[K     |████████████████████████████████| 348 kB 13.0 MB/s 
Collecting cliff
  Downloading cliff-3.10.1-py3-none-any.whl (81 kB)
[K     |████████████████████████████████| 81 kB 12.1 MB/s 
Collecting colorlog
  Downloading colorlog-6.7.0-py2.py3-none-any.whl (11 kB)
Collecting alembic>=1.5.0
  Downloading alembic-1.8.1-py3-none-any.whl (209 kB)
[K     |████████████████████████████████| 209 kB 85.6 MB/s 
Collecting cmaes>=0.8.2
  Downloading cmaes-0.9.0-py3-none-any.whl (23 kB)
Collecting Mako
  Downloading Mako-1.2.4-py3-none-any.whl (78 kB)
[K     |████████████████████████████████| 78 kB 7.9 MB/s 
Collecting cmd2>=1.0.0
  Downloading cmd2-2.4.2-py3-none-any.whl (147 kB)
[K     |████████████████████████████████| 147 kB 62.8 MB/s 
[?25hCollecting pbr!=2.1.0,>=2.0.0
  Downloading pbr-5.11.0-py2.py3-none-any.whl (112

In [None]:
import pandas as pd
import numpy as np
import optuna
import sklearn 
import sklearn.datasets
from sklearn.model_selection import cross_val_score
from sklearn.metrics import log_loss
from sklearn.model_selection import StratifiedKFold
from lightgbm import LGBMClassifier
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report, confusion_matrix


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
df_train = pd.read_csv('/content/drive/MyDrive/PG32 CS3244/smoteNCTrain.csv')
df_test = pd.read_csv('/content/drive/MyDrive/PG32 CS3244/smoteNCTest.csv')

In [None]:
df_train.head()

Unnamed: 0,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,NAME_INCOME_TYPE,NAME_EDUCATION_TYPE,NAME_FAMILY_STATUS,NAME_HOUSING_TYPE,DAYS_BIRTH,DAYS_EMPLOYED,FLAG_WORK_PHONE,OCCUPATION_TYPE,CNT_FAM_MEMBERS,STATUS
0,0,0,0,0,135000.0,4,1,1,4,-13566,-1900,1,6,2,0
1,1,1,1,0,315000.0,4,3,3,4,-10328,-543,0,6,1,0
2,0,1,1,0,315000.0,0,1,2,4,-18184,-3021,0,8,1,0
3,0,0,1,1,180000.0,2,1,3,4,-13467,-3850,0,11,2,0
4,1,1,0,1,247500.0,4,1,1,4,-13086,-1931,0,8,3,0


In [None]:
df_train.shape

(50164, 15)

In [None]:
df_test.shape

(21518, 15)

In [None]:
X_train = df_train.iloc[:,1:-1] # all the variables
y_train = df_train.iloc[:,-1] # labels

X_test = df_test.iloc[:,1:-1] # all the variables
y_test = df_test.iloc[:,-1] # labels

In [None]:
from sklearn.metrics import f1_score, precision_score, recall_score

Here we do hyperparameter tuning on the training datasets and using cross validation to determine which value to take. A five fold cross validataion is used by 
```
StratifiedKFold(n_splits=5, shuffle=True, random_state = 1)
```
Then we calculate the mean accuracy and the mean accuracy is returned by the ```objective``` function

In [None]:
from sklearn.model_selection import StratifiedKFold

def objective(trial):
  param_grid = {
        "n_estimators": trial.suggest_int("n_estimators", 100, 600, step = 10),
        "learning_rate": trial.suggest_float("learning_rate", 0.01, 0.6, step=0.01),
        "num_leaves": trial.suggest_int("num_leaves", 400, 3000, step=20),
        "max_depth": trial.suggest_int("max_depth", 6, 12),
        "min_data_in_leaf": trial.suggest_int("min_data_in_leaf", 20, 250, step=5),
        "lambda_l1": trial.suggest_int("lambda_l1", 0, 50, step=5),
        "lambda_l2": trial.suggest_int("lambda_l2", 0, 50, step=5),
        "min_gain_to_split": trial.suggest_float("min_gain_to_split", 0, 10, step = 0.1),
        "max_bin": trial.suggest_int("max_bin", 50, 300, step = 10),
        "bagging_fraction": trial.suggest_float(
            "bagging_fraction", 0.2, 0.9, step=0.1
        ),
        "feature_fraction": trial.suggest_float(
            "feature_fraction", 0.2, 0.8, step=0.1
        ),
        "boosting_type": trial.suggest_categorical("boosting_type", ["goss", "dart", "gbdt"]),

  }
  model = LGBMClassifier(objective="binary", **param_grid, random_state = 42)
  strat_k_fold = StratifiedKFold(n_splits=5, shuffle=True, random_state = 1)

  score = cross_val_score(model, X_train, y_train, cv = strat_k_fold).mean()
  return score





The default sampler used in Optuna is TPESampler. It's based on Bayesian hyperparameter optimization, which is an efficient method for hyperparameter tuning. It will start off just like random sampler, but this sampler records the history of a set of hyperparameter values and the corresponding objective value from past trails. Then, it will suggest the set of hyperparameter values for the next trail based on the set with promising objective values from past trails. Since we are using accuracy in our cross validation, we will set the ```direction``` to "maximize



In [None]:
study = optuna.create_study(direction = "maximize")

study.optimize(objective, n_trials = 100)
trial = study.best_trial
print("Best Score: ", trial.value)
print("Best Params: ")
for key, value in trial.params.items():
    print("  {}= {}, ".format(key, value))

[32m[I 2022-11-19 07:25:36,182][0m A new study created in memory with name: no-name-823d0f4f-a040-4ec4-a04c-b5a2b5f7e527[0m
[32m[I 2022-11-19 07:25:37,230][0m Trial 0 finished with value: 0.831134155214427 and parameters: {'n_estimators': 100, 'learning_rate': 0.55, 'num_leaves': 880, 'max_depth': 11, 'min_data_in_leaf': 220, 'lambda_l1': 10, 'lambda_l2': 25, 'min_gain_to_split': 2.1, 'max_bin': 110, 'bagging_fraction': 0.5, 'feature_fraction': 0.2, 'boosting_type': 'gbdt'}. Best is trial 0 with value: 0.831134155214427.[0m
[32m[I 2022-11-19 07:25:40,571][0m Trial 1 finished with value: 0.8685910339579547 and parameters: {'n_estimators': 330, 'learning_rate': 0.5700000000000001, 'num_leaves': 1520, 'max_depth': 8, 'min_data_in_leaf': 195, 'lambda_l1': 0, 'lambda_l2': 20, 'min_gain_to_split': 7.800000000000001, 'max_bin': 140, 'bagging_fraction': 0.9, 'feature_fraction': 0.7, 'boosting_type': 'gbdt'}. Best is trial 1 with value: 0.8685910339579547.[0m
[32m[I 2022-11-19 07:25:4

Best Score:  0.9653535954953119
Best Params: 
  n_estimators= 570, 
  learning_rate= 0.56, 
  num_leaves= 480, 
  max_depth= 9, 
  min_data_in_leaf= 250, 
  lambda_l1= 0, 
  lambda_l2= 45, 
  min_gain_to_split= 0.1, 
  max_bin= 280, 
  bagging_fraction= 0.5, 
  feature_fraction= 0.5, 
  boosting_type= goss, 


##### This is the 'better model' proposed by optuna

In [None]:
#This is the 'better model' proposed by optuna
tuned_model = LGBMClassifier(objective="binary",
  boosting_type= 'goss', 
  n_estimators= 570, 
  learning_rate= 0.56, 
  num_leaves= 480, 
  max_depth= 9, 
  min_data_in_leaf= 250,  
  lambda_l1= 0, 
  lambda_l2= 45, 
  min_gain_to_split= 0.1, 
  max_bin=280,
  bagging_fraction= 0.5, 
  feature_fraction= 0.5, 
  random_state=42
  )

In [None]:
tuned_model.fit(X_train, y_train)

LGBMClassifier(bagging_fraction=0.5, boosting_type='goss', feature_fraction=0.5,
               lambda_l1=0, lambda_l2=45, learning_rate=0.56, max_bin=280,
               max_depth=9, min_data_in_leaf=250, min_gain_to_split=0.1,
               n_estimators=570, num_leaves=480, objective='binary',
               random_state=42)

In [None]:
y_predict = tuned_model.predict(X_test)
print(sum(y_predict == y_test))
print('Accuracy Score is {:.5}'.format(accuracy_score(y_test, y_predict)))
print('Recall Score is {:.5}'.format(recall_score(y_test, y_predict)))
print('Precision Score is {:.5}'.format(precision_score(y_test, y_predict)))
print('F1 Score is {:.5}'.format(f1_score(y_test, y_predict)))

17524
Accuracy Score is 0.81439
Recall Score is 0.66326
Precision Score is 0.95058
F1 Score is 0.78134


Comparing with default model:
When comparing with default model, we can see that there's a 0.025 increase in accuracy score, a 0.001 decrease in recall, a 0.07 increase in precision and a 0.03 decrease in F1 Score.

In [None]:
#This is the baseline LightGBM

default_model = LGBMClassifier(random_state=42)
default_model.fit(X_train, y_train)
from sklearn.metrics import accuracy_score
y_predict = default_model.predict(X_test)
print(sum(y_predict == y_test))

print('Accuracy Score is {:.5}'.format(accuracy_score(y_test, y_predict)))
print('Recall Score is {:.5}'.format(recall_score(y_test, y_predict)))
print('Precision Score is {:.5}'.format(precision_score(y_test, y_predict)))
print('F1 Score is {:.5}'.format(f1_score(y_test, y_predict)))

16984
Accuracy Score is 0.78929
Recall Score is 0.66224
Precision Score is 0.88785
F1 Score is 0.75862


To get a better understanding of the importance of hyperparameters, we show the hyperparameter importances here.

In [None]:
optuna.visualization.plot_param_importances(study)

In [None]:
optuna.visualization.plot_slice(study)

We can see that the main hyperparameter affecting performance of the model is min_data_in_leaf