This file contains the baseline Catboost model as well as the tuned version of Catboost model. We are using optuna for hyperparameter tuning.


In [None]:
!pip install optuna

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
import pandas as pd
import numpy as np
import optuna
import sklearn 
import sklearn.datasets
from sklearn.model_selection import cross_val_score
from sklearn.metrics import log_loss
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report, confusion_matrix


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
df_train = pd.read_csv('/content/drive/MyDrive/PG32 CS3244/smoteNCTrain.csv')
df_test = pd.read_csv('/content/drive/MyDrive/PG32 CS3244/smoteNCTest.csv')

In [None]:
X_train = df_train.iloc[:,1:-1] # all the variables
y_train = df_train.iloc[:,-1] # labels

X_test = df_test.iloc[:,1:-1] # all the variables
y_test = df_test.iloc[:,-1] # labels

In [None]:
df_train.head()

Unnamed: 0,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,NAME_INCOME_TYPE,NAME_EDUCATION_TYPE,NAME_FAMILY_STATUS,NAME_HOUSING_TYPE,DAYS_BIRTH,DAYS_EMPLOYED,FLAG_WORK_PHONE,OCCUPATION_TYPE,CNT_FAM_MEMBERS,STATUS
0,0,0,0,0,135000.0,4,1,1,4,-13566,-1900,1,6,2,0
1,1,1,1,0,315000.0,4,3,3,4,-10328,-543,0,6,1,0
2,0,1,1,0,315000.0,0,1,2,4,-18184,-3021,0,8,1,0
3,0,0,1,1,180000.0,2,1,3,4,-13467,-3850,0,11,2,0
4,1,1,0,1,247500.0,4,1,1,4,-13086,-1931,0,8,3,0


In [None]:
pip install catboost

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
from catboost import CatBoostClassifier

In [None]:
default_model = CatBoostClassifier(cat_features=[0, 1, 2, 5, 6, 7, 8, 11, 12], random_state=42)
default_model.fit(X_train, y_train)
from sklearn.metrics import accuracy_score
y_predict = default_model.predict(X_test)
print(sum(y_predict == y_test))
print('Accuracy Score is {:.5}'.format(accuracy_score(y_test, y_predict)))
print('Recall Score is {:.5}'.format(recall_score(y_test, y_predict)))
print('Precision Score is {:.5}'.format(precision_score(y_test, y_predict)))
print('F1 Score is {:.5}'.format(f1_score(y_test, y_predict)))

Learning rate set to 0.054828
0:	learn: 0.6632735	total: 92.6ms	remaining: 1m 32s
1:	learn: 0.6372803	total: 181ms	remaining: 1m 30s
2:	learn: 0.6062593	total: 276ms	remaining: 1m 31s
3:	learn: 0.5862970	total: 334ms	remaining: 1m 23s
4:	learn: 0.5663227	total: 432ms	remaining: 1m 25s
5:	learn: 0.5490944	total: 507ms	remaining: 1m 24s
6:	learn: 0.5288864	total: 581ms	remaining: 1m 22s
7:	learn: 0.5097762	total: 661ms	remaining: 1m 21s
8:	learn: 0.4893986	total: 756ms	remaining: 1m 23s
9:	learn: 0.4762938	total: 827ms	remaining: 1m 21s
10:	learn: 0.4632788	total: 889ms	remaining: 1m 19s
11:	learn: 0.4517460	total: 956ms	remaining: 1m 18s
12:	learn: 0.4416263	total: 1.05s	remaining: 1m 19s
13:	learn: 0.4338783	total: 1.12s	remaining: 1m 18s
14:	learn: 0.4257609	total: 1.19s	remaining: 1m 17s
15:	learn: 0.4186180	total: 1.28s	remaining: 1m 18s
16:	learn: 0.4093471	total: 1.36s	remaining: 1m 18s
17:	learn: 0.4008898	total: 1.44s	remaining: 1m 18s
18:	learn: 0.3934100	total: 1.53s	remaining

Here we do hyperparameter tuning on the training datasets and using cross validation to determine which value to take. A five fold cross validataion is used by 
```
StratifiedKFold(n_splits=5, shuffle=True, random_state = 1)
```
Then we calculate the mean accuracy and the mean accuracy is returned by the ```objective``` function

In [None]:
def objective(trial):
  param_grid = {
        "learning_rate": trial.suggest_float("learning_rate", 0.04, 0.1, step = 0.02),
        "iterations": trial.suggest_int("iterations", 200, 1000, step = 200),
        "loss_function": trial.suggest_categorical("loss_function", ['Logloss', 'CrossEntropy'])  
  }
  model = CatBoostClassifier(**param_grid, cat_features=[0, 1, 2, 5, 6, 7, 8, 11, 12], random_state = 42, silent = True)
  strat_k_fold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

  score = cross_val_score(model, X_train, y_train, cv = strat_k_fold).mean()
  return score

The default sampler used in Optuna is TPESampler. It's based on Bayesian hyperparameter optimization, which is an efficient method for hyperparameter tuning. It will start off just like random sampler, but this sampler records the history of a set of hyperparameter values and the corresponding objective value from past trails. Then, it will suggest the set of hyperparameter values for the next trail based on the set with promising objective values from past trails. Since we are using accuracy in our cross validation, we will set the ```direction``` to "maximize

In [None]:
study = optuna.create_study(direction = "maximize")
study.optimize(objective, n_trials = 20)
trial = study.best_trial
print("Best Score: ", trial.value)
print("Best Params: ")
for key, value in trial.params.items():
    print("  {}= {}, ".format(key, value))

[32m[I 2022-11-19 11:23:56,258][0m A new study created in memory with name: no-name-441f8bce-c20c-4780-b391-0bd8d3aaf94f[0m
[32m[I 2022-11-19 11:26:51,397][0m Trial 0 finished with value: 0.9550275379127665 and parameters: {'learning_rate': 0.04, 'iterations': 400, 'loss_function': 'CrossEntropy'}. Best is trial 0 with value: 0.9550275379127665.[0m
[32m[I 2022-11-19 11:34:13,447][0m Trial 1 finished with value: 0.9734072119422175 and parameters: {'learning_rate': 0.08, 'iterations': 1000, 'loss_function': 'Logloss'}. Best is trial 1 with value: 0.9734072119422175.[0m
[32m[I 2022-11-19 11:38:33,700][0m Trial 2 finished with value: 0.9710947950710025 and parameters: {'learning_rate': 0.1, 'iterations': 600, 'loss_function': 'Logloss'}. Best is trial 1 with value: 0.9734072119422175.[0m
[32m[I 2022-11-19 11:41:20,924][0m Trial 3 finished with value: 0.9550275379127665 and parameters: {'learning_rate': 0.04, 'iterations': 400, 'loss_function': 'Logloss'}. Best is trial 1 with

Best Score:  0.9746431731426644
Best Params: 
  learning_rate= 0.1, 
  iterations= 1000, 
  loss_function= CrossEntropy, 


Comparing the tuned model with the default model, we can see there's a 0.0026 increase in accuracy, a 0.002 increase in recall, 0.004 increase in precision, and 0.002 increase in F1 score. This shows that Catboost provides good result with default parameter.

In [None]:
tuned_model = CatBoostClassifier( random_state=42, learning_rate = 0.1, iterations = 1000, loss_function = 'CrossEntropy', cat_features=[0, 1, 2, 5, 6, 7, 8, 11, 12])
tuned_model.fit(X_train, y_train)
from sklearn.metrics import accuracy_score
y_predict = tuned_model.predict(X_test)
print(sum(y_predict == y_test))
print('Accuracy Score is {:.5}'.format(accuracy_score(y_test, y_predict)))
print('Recall Score is {:.5}'.format(recall_score(y_test, y_predict)))
print('Precision Score is {:.5}'.format(precision_score(y_test, y_predict)))
print('F1 Score is {:.5}'.format(f1_score(y_test, y_predict)))

0:	learn: 0.6408399	total: 98.8ms	remaining: 1m 38s
1:	learn: 0.6005766	total: 192ms	remaining: 1m 35s
2:	learn: 0.5537398	total: 287ms	remaining: 1m 35s
3:	learn: 0.5184029	total: 359ms	remaining: 1m 29s
4:	learn: 0.4857226	total: 451ms	remaining: 1m 29s
5:	learn: 0.4620748	total: 521ms	remaining: 1m 26s
6:	learn: 0.4421763	total: 613ms	remaining: 1m 26s
7:	learn: 0.4272170	total: 681ms	remaining: 1m 24s
8:	learn: 0.4107920	total: 778ms	remaining: 1m 25s
9:	learn: 0.3950685	total: 847ms	remaining: 1m 23s
10:	learn: 0.3874137	total: 946ms	remaining: 1m 25s
11:	learn: 0.3766477	total: 1.04s	remaining: 1m 25s
12:	learn: 0.3662527	total: 1.13s	remaining: 1m 26s
13:	learn: 0.3595434	total: 1.2s	remaining: 1m 24s
14:	learn: 0.3544676	total: 1.3s	remaining: 1m 25s
15:	learn: 0.3490647	total: 1.4s	remaining: 1m 25s
16:	learn: 0.3405587	total: 1.51s	remaining: 1m 27s
17:	learn: 0.3360015	total: 1.63s	remaining: 1m 28s
18:	learn: 0.3319878	total: 1.73s	remaining: 1m 29s
19:	learn: 0.3269974	tot