<a href="https://colab.research.google.com/github/LiXinYiEmily/STAT3011_2324_GP7/blob/main/STAT3011_project2_model_Ada%2BGb.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Import Data

In [1]:
from __future__ import print_function
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
import warnings
warnings.filterwarnings("ignore")

In [5]:
df_train_new = pd.read_csv('/content/cs-training-processed-v1.csv')
df_train_new.head()

Unnamed: 0,seriousdlqin2yrs,age,numberofopencreditlinesandloans,numberrealestateloansorlines,numberofdependents,age_group,prop_3059,prop_6089,prop_90plus,category_pastdue,dependents_groups,rsll_groups,ocll_quantile_groups,sqrt_debtratio,sqrt_monthlyincome,sqrt_revolvingutilizationofunsecuredlines
0,1,45,13,6,2,3,1.0,0.0,0.0,1,2,3,3,0.896093,95.498691,0.875287
1,0,40,4,0,1,3,0.0,0.0,0.0,0,1,0,0,0.349108,50.990195,0.978341
2,0,38,2,0,0,2,0.5,0.0,0.5,1,0,0,0,0.291742,55.154329,0.811283
3,0,30,5,0,0,2,0.0,0.0,0.0,0,0,0,1,0.189868,57.445626,0.483539
4,0,49,7,1,0,3,1.0,0.0,0.0,1,0,1,1,0.157879,137.920267,0.952491


In [3]:
df_test = pd.read_csv('/content/cs-test-processed-v1.csv')
df_test.head()

Unnamed: 0,seriousdlqin2yrs,age,numberofopencreditlinesandloans,numberrealestateloansorlines,numberofdependents,age_group,prop_3059,prop_6089,prop_90plus,category_pastdue,dependents_groups,rsll_groups,ocll_quantile_groups,sqrt_debtratio,sqrt_monthlyincome,sqrt_revolvingutilizationofunsecuredlines
0,,43,4,0,0,3,0.0,0.0,0.0,0,0,0,0,0.421323,75.498344,0.94102
1,,57,15,4,2,4,0.0,0.0,0.0,0,2,3,3,0.726111,95.608577,0.680658
2,,59,12,1,2,4,0.0,0.0,0.0,0,2,1,3,0.829245,71.295161,0.208027
3,,38,7,2,0,2,1.0,0.0,0.0,1,0,2,1,0.962268,56.568542,0.529441
4,,27,4,0,1,1,0.0,0.0,0.0,0,1,0,0,0.141128,62.169124,1.0


In [11]:
df_train_new.isna().sum()

seriousdlqin2yrs                             0
age                                          0
numberofopencreditlinesandloans              0
numberrealestateloansorlines                 0
numberofdependents                           0
age_group                                    0
prop_3059                                    0
prop_6089                                    0
prop_90plus                                  0
category_pastdue                             0
dependents_groups                            0
rsll_groups                                  0
ocll_quantile_groups                         0
sqrt_debtratio                               0
sqrt_monthlyincome                           0
sqrt_revolvingutilizationofunsecuredlines    0
dtype: int64

### Basic Classifier

Firstly, we want to try some basic classifiers without parameter-tuning, to learn about the basic performance of these models and choose which model is suitable for further tunning.

In [4]:
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.utils import resample
from sklearn.metrics import precision_score, recall_score, accuracy_score, roc_auc_score

In [6]:
X = df_train_new.iloc[:, 1:]
y = df_train_new['seriousdlqin2yrs']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=3011)

We evaluate the performance of the model with precision, recall, accuracy, and roc_auc score.

- $Precision = \frac{TP}{TP+FP}$

Precision measures the proportion of correctly predicted positive instances out of all instances predicted as positive.  It focuses on the quality of positive predictions.

- $Recall = \frac{TP}{TP+FN}$

Recall measures the proportion of correctly predicted positive instances out of all actual positive instances. It focuses on the ability to find all positive instances.

- $Accuracy = \frac{TP+TN}{TP+TN+FP+FN}$

Accuracy measures the overall correctness of the classifier's predictions. It considers both true positives and true negatives. However, it can be misleading when the classes are imbalanced.

- $ROC-AUC$

 It measures the ability of the model to distinguish between positive and negative instances across different probability thresholds.

In [7]:
def evaluate_metrics(model, X_test, y_test):
    # Predict on the test set
    y_pred = model.predict(X_test)
    y_pred_prob = model.predict_proba(X_test)[:, 1]

    # Calculate evaluation metrics
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    accuracy = accuracy_score(y_test, y_pred)
    roc_auc = roc_auc_score(y_test, y_pred_prob)

    print(f'precision: {precision}')
    print(f'recall: {recall}')
    print(f'accuracy: {accuracy}')
    print(f'roc_auc: {roc_auc}')

    # Return the evaluation metrics
    return precision, recall, accuracy, roc_auc

#### AdaBoosting

The first baseline model is the AdaBoosting classifier. The basic idea behind AdaBoost is to train a sequence of weak classifiers iteratively, where each subsequent classifier focuses on the instances that were misclassified by the previous classifiers. This iterative process helps to improve the overall predictive performance.

It is capable of handling complex datasets and improving performance compared to individual weak learners. However, it is sensitive to noisy data and outliers.

The performance of AdaBoosting in the original notebook is as follows.

- precision: 0.5245
- recall: 0.1977
- accuracy: 0.9347
- roc_auc: 0.8599

Then we fit the AdaBoosting model with data processed with revised feature engineering methods. All of these four metrics are better than before.

In [8]:
ada = AdaBoostClassifier(n_estimators=200, learning_rate=1.0)
ada.fit(X_train, y_train)

In [12]:
precision_ada, recall_ada, accuracy_ada, roc_auc_ada = evaluate_metrics(ada, X_test, y_test)

precision: 0.5551982851018221
recall: 0.2070343725019984
accuracy: 0.9360266666666667
roc_auc: 0.8664693766445213


In [18]:
precision_ada, recall_ada, accuracy_ada, roc_auc_ada = evaluate_metrics(ada, X_train, y_train)

precision: 0.5693693693693693
recall: 0.2099946836788942
accuracy: 0.9365416581480724
roc_auc: 0.8658117081622051


In [9]:
precision_ada, recall_ada, accuracy_ada, roc_auc_ada = evaluate_metrics(ada, X_test, y_test)

precision: 0.5575342465753425
recall: 0.2027902341803687
accuracy: 0.9359
roc_auc: 0.8647959199565712


#### GradientBoosting

The second baseline model is the GradientBoosting classifier. The difference between GradientBoosting and AdaBoosting is that GradientBoosting uses a gradient descent optimization algorithm to minimize the loss function during each iteration.

The performance of GradientBoosting in the original notebook is as follows.

- precision: 0.5486
- recall: 0.1832
- accuracy: 0.9357
- roc_auc: 0.8632


Then we fit the GradientBoosting model with data processed with revised feature engineering methods. All of these four metrics are better than before.

In [9]:
gb = GradientBoostingClassifier(loss='log_loss', learning_rate=0.1, n_estimators=200, subsample=1.0,
                                min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_depth=3,
                                init=None, random_state=None, max_features=None, verbose=0)
gb.fit(X_train, y_train)

In [14]:
precision_gb, recall_gb, accuracy_gb, roc_auc_gb = evaluate_metrics(gb, X_test, y_test)

precision: 0.580607476635514
recall: 0.19864108713029577
accuracy: 0.93696
roc_auc: 0.8708901899567266


In [20]:
precision_gb, recall_gb, accuracy_gb, roc_auc_gb = evaluate_metrics(gb, X_train, y_train)

precision: 0.6558704453441295
recall: 0.215311004784689
accuracy: 0.9399639107903182
roc_auc: 0.8710462663004621


In [11]:
precision_gb, recall_gb, accuracy_gb, roc_auc_gb = evaluate_metrics(gb, X_test, y_test)

precision: 0.5959595959595959
recall: 0.17638266068759342
accuracy: 0.9369
roc_auc: 0.8683089432049094


### Hyper parameter optimization using Grid Search & Randomized search

This step is the **Grid Search** and **Randomized Cross Validation (CV) Search** for baseline models. Combinations of Parameters will be (randomly) choosen and evaluated with the CV technique. Finally the parameter combination with the best CV performance will be chosen.

The performance of baseline models increase after the parameter tunning.

In [39]:
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

#### AdaMoosting

The best parameters of AdaBoosting are n_estimators=200, learning_rate=1.

In [None]:
adaHyperParams = {'n_estimators': [200],
                  'learning_rate': [0.5, 1, 1.5]}

gridSearchAda = GridSearchCV(estimator=ada, param_grid=adaHyperParams,
                                           scoring='roc_auc', cv=5, verbose=2)
gridSearchAda.fit(X_train, y_train)

Fitting 5 folds for each of 3 candidates, totalling 15 fits
[CV] END ................learning_rate=0.5, n_estimators=200; total time=  18.1s
[CV] END ................learning_rate=0.5, n_estimators=200; total time=  17.2s
[CV] END ................learning_rate=0.5, n_estimators=200; total time=  17.0s
[CV] END ................learning_rate=0.5, n_estimators=200; total time=  17.0s
[CV] END ................learning_rate=0.5, n_estimators=200; total time=  18.1s
[CV] END ..................learning_rate=1, n_estimators=200; total time=  17.0s
[CV] END ..................learning_rate=1, n_estimators=200; total time=  17.5s
[CV] END ..................learning_rate=1, n_estimators=200; total time=  18.4s
[CV] END ..................learning_rate=1, n_estimators=200; total time=  16.9s
[CV] END ..................learning_rate=1, n_estimators=200; total time=  17.3s
[CV] END ................learning_rate=1.5, n_estimators=200; total time=  17.9s
[CV] END ................learning_rate=1.5, n_est

In [None]:
gridSearchAda.best_params_, gridSearchAda.best_score_

({'learning_rate': 1, 'n_estimators': 200}, 0.8595553227888884)

In [None]:
adaHyperParams_r = {'n_estimators': [150, 200, 250],
                  'learning_rate': [0.5, 1, 1.5]}
gridSearchAda_r = RandomizedSearchCV(estimator=ada, param_distributions=adaHyperParams_r,
                                   n_iter=5, scoring='roc_auc', cv=5, verbose=2).fit(X_train, y_train)

Fitting 5 folds for each of 5 candidates, totalling 25 fits
[CV] END ..................learning_rate=1, n_estimators=150; total time=  14.4s
[CV] END ..................learning_rate=1, n_estimators=150; total time=  20.7s
[CV] END ..................learning_rate=1, n_estimators=150; total time=  21.3s
[CV] END ..................learning_rate=1, n_estimators=150; total time=  14.6s
[CV] END ..................learning_rate=1, n_estimators=150; total time=  13.9s
[CV] END ................learning_rate=0.5, n_estimators=200; total time=  17.7s
[CV] END ................learning_rate=0.5, n_estimators=200; total time=  18.6s
[CV] END ................learning_rate=0.5, n_estimators=200; total time=  17.2s
[CV] END ................learning_rate=0.5, n_estimators=200; total time=  17.7s
[CV] END ................learning_rate=0.5, n_estimators=200; total time=  18.9s
[CV] END ................learning_rate=0.5, n_estimators=150; total time=  13.3s
[CV] END ................learning_rate=0.5, n_est

In [None]:
print(gridSearchAda_r.best_params_)
print(f'ROC_AUC score: {gridSearchAda_r.best_score_}')

{'n_estimators': 150, 'learning_rate': 1}
ROC_AUC score: 0.8594788155477782


In [23]:
ada = AdaBoostClassifier(n_estimators=150, learning_rate=1)
ada.fit(X_train, y_train)

In [11]:
precision_ada, recall_ada, accuracy_ada, roc_auc_ada = evaluate_metrics(ada, X_test, y_test)

precision: 0.5573212258796821
recall: 0.20624300559552357
accuracy: 0.9360733333333333
roc_auc: 0.8668361326939362


#### GradientBoosting

The best parameters of GradientBoosting are n_estimators=150, learning_rate=0.1,min_samples_leaf=5, max_features='log2', max_depth=5.

In [None]:
gbHyperParams = {'n_estimators': [100, 150, 200],
                 'max_depth': [5, 10, 15, 20, 30],
                 'min_samples_leaf': [1, 2, 3, 4, 5],
                 'learning_rate': [0.1, 0.5, 1],
                 'max_features':['sqrt', 'log2']}

gridSearchGB = RandomizedSearchCV(estimator=gb, param_distributions=gbHyperParams,
                                  n_iter=10, scoring='roc_auc', cv=5, verbose=2).fit(X_train, y_train)

Fitting 5 folds for each of 10 candidates, totalling 50 fits
[CV] END learning_rate=0.1, max_depth=5, max_features=log2, min_samples_leaf=2, n_estimators=150; total time=  11.4s
[CV] END learning_rate=0.1, max_depth=5, max_features=log2, min_samples_leaf=2, n_estimators=150; total time=  11.6s
[CV] END learning_rate=0.1, max_depth=5, max_features=log2, min_samples_leaf=2, n_estimators=150; total time=  10.5s
[CV] END learning_rate=0.1, max_depth=5, max_features=log2, min_samples_leaf=2, n_estimators=150; total time=  10.8s
[CV] END learning_rate=0.1, max_depth=5, max_features=log2, min_samples_leaf=2, n_estimators=150; total time=  11.7s
[CV] END learning_rate=0.5, max_depth=20, max_features=sqrt, min_samples_leaf=2, n_estimators=150; total time= 1.7min
[CV] END learning_rate=0.5, max_depth=20, max_features=sqrt, min_samples_leaf=2, n_estimators=150; total time= 1.5min
[CV] END learning_rate=0.5, max_depth=20, max_features=sqrt, min_samples_leaf=2, n_estimators=150; total time= 1.5min


In [None]:
print(gridSearchGB.best_params_)
print(f'ROC_AUC score: {gridSearchGB.best_score_}')

{'n_estimators': 150, 'min_samples_leaf': 5, 'max_features': 'log2', 'max_depth': 5, 'learning_rate': 0.1}
ROC_AUC score: 0.863498016794215


In [15]:
gb = GradientBoostingClassifier(loss='log_loss', learning_rate=0.1, n_estimators=150,
                                min_samples_leaf=5, max_depth=5, max_features='log2', verbose=0)
gb.fit(X_train, y_train)

In [12]:
precision_gb, recall_gb, accuracy_gb, roc_auc_gb = evaluate_metrics(gb, X_test, y_test)

precision: 0.5573212258796821
recall: 0.19884652278177458
accuracy: 0.9467466666666666
roc_auc: 0.8713442127034415


### UpSampling & DownSampling

As discovered in the EDA part, this dataset is imbalanced with few positive samples. Since our goal is to classify whether a customer will have enough money to pay back the loan in the future, it is important for us to pay more attention to these positive samples.

To address this problem, in this step we use the UpSampling and DownSampling technique to adjust the imbalanced dataset.

Originally, the ratio of Positive:Negative is about 1:10. We adjust it into 1:2. The result shows that without affecting the Accuracy and ROC-AUC too much, the recall significantly increases, which makes our model more capable in detecting potential defaulting customers.

In [17]:
sample0 = df_train_new.iloc[X_train.index.values,][df_train_new['seriousdlqin2yrs']==0]
sample1 = df_train_new.iloc[X_train.index.values,][df_train_new['seriousdlqin2yrs']==1]
print(len(sample0))
print(len(sample1))

104975
7524


In [18]:
df_majority_downsampled = resample(sample0,
                                 replace=False,
                                 n_samples=100000)
#Upsample minority class
df_minority_upsampled = resample(sample1,
                                 replace=True,
                                 n_samples=50000)
# Combine minority class with downsampled majority class
df_up_down_sampled = pd.concat([df_majority_downsampled, df_minority_upsampled])

In [19]:
X_train_sampled = df_up_down_sampled.iloc[:, 1:]
y_train_sampled = df_up_down_sampled['seriousdlqin2yrs']

In [20]:
ada_sampled = AdaBoostClassifier(n_estimators=200, learning_rate=1.0)
ada_sampled.fit(X_train_sampled, y_train_sampled)

In [21]:
precision_ada, recall_ada, accuracy_ada, roc_auc_ada = evaluate_metrics(ada_sampled, X_test, y_test)

precision: 0.3257364007740271
recall: 0.605515587529976
accuracy: 0.8900533333333334
roc_auc: 0.8666823041937899


In [22]:
gb_sampled = GradientBoostingClassifier(loss='log_loss', learning_rate=0.1, n_estimators=150,
                                min_samples_leaf=5, max_depth=5, max_features='log2', verbose=0)
gb_sampled.fit(X_train_sampled, y_train_sampled)

In [23]:
precision_gb, recall_gb, accuracy_gb, roc_auc_gb = evaluate_metrics(gb_sampled, X_test, y_test)

precision: 0.3199834847233691
recall: 0.6195043964828137
accuracy: 0.8867733333333333
roc_auc: 0.8702141093000221


In [40]:
adaHyperParams = {'n_estimators': [150, 200, 250],
                  'learning_rate': [0.5, 1, 1.5]}

gridSearchAda_sampled = GridSearchCV(estimator=ada, param_grid=adaHyperParams,
                                           scoring='roc_auc', cv=5, verbose=2)
gridSearchAda_sampled.fit(X_train_sampled, y_train_sampled)

Fitting 5 folds for each of 9 candidates, totalling 45 fits
[CV] END ................learning_rate=0.5, n_estimators=150; total time=  24.1s
[CV] END ................learning_rate=0.5, n_estimators=150; total time=  30.5s
[CV] END ................learning_rate=0.5, n_estimators=150; total time=  23.6s
[CV] END ................learning_rate=0.5, n_estimators=150; total time=  48.7s
[CV] END ................learning_rate=0.5, n_estimators=150; total time=  33.6s
[CV] END ................learning_rate=0.5, n_estimators=200; total time=  49.3s
[CV] END ................learning_rate=0.5, n_estimators=200; total time=  34.0s
[CV] END ................learning_rate=0.5, n_estimators=200; total time=  32.2s
[CV] END ................learning_rate=0.5, n_estimators=200; total time=  31.4s
[CV] END ................learning_rate=0.5, n_estimators=200; total time=  32.8s
[CV] END ................learning_rate=0.5, n_estimators=250; total time=  39.4s
[CV] END ................learning_rate=0.5, n_est

In [41]:
ada_sampled = gridSearchAda_sampled.best_estimator_.fit(X_train_sampled, y_train_sampled)
precision_ada, recall_ada, accuracy_ada, roc_auc_ada = evaluate_metrics(ada_sampled, X_test, y_test)

precision: 0.32388663967611336
recall: 0.6075139888089528
accuracy: 0.8892
roc_auc: 0.8643984007034045


In [46]:
gridSearchAda_sampled.best_params_

{'learning_rate': 1.5, 'n_estimators': 250}

In [42]:
gbHyperParams = {'n_estimators': [100, 150, 200],
                 'max_depth': [5, 10, 15, 20, 30],
                 'min_samples_leaf': [1, 2, 3, 4, 5],
                 'learning_rate': [0.1, 0.5, 1],
                 'max_features':['sqrt', 'log2']}

gridSearchGB_sampled = RandomizedSearchCV(estimator=gb, param_distributions=gbHyperParams,
                                  n_iter=10, scoring='roc_auc', cv=5, verbose=2).fit(X_train_sampled, y_train_sampled)

Fitting 5 folds for each of 10 candidates, totalling 50 fits
[CV] END learning_rate=0.5, max_depth=15, max_features=log2, min_samples_leaf=5, n_estimators=100; total time= 1.2min
[CV] END learning_rate=0.5, max_depth=15, max_features=log2, min_samples_leaf=5, n_estimators=100; total time= 1.2min
[CV] END learning_rate=0.5, max_depth=15, max_features=log2, min_samples_leaf=5, n_estimators=100; total time= 1.2min
[CV] END learning_rate=0.5, max_depth=15, max_features=log2, min_samples_leaf=5, n_estimators=100; total time= 1.3min
[CV] END learning_rate=0.5, max_depth=15, max_features=log2, min_samples_leaf=5, n_estimators=100; total time= 1.2min
[CV] END learning_rate=1, max_depth=20, max_features=log2, min_samples_leaf=5, n_estimators=200; total time= 4.7min
[CV] END learning_rate=1, max_depth=20, max_features=log2, min_samples_leaf=5, n_estimators=200; total time= 4.7min
[CV] END learning_rate=1, max_depth=20, max_features=log2, min_samples_leaf=5, n_estimators=200; total time= 4.7min
[

In [43]:
gb_sampled = gridSearchGB_sampled.best_estimator_.fit(X_train_sampled, y_train_sampled)
precision_gb, recall_gb, accuracy_gb, roc_auc_gb = evaluate_metrics(gb_sampled, X_test, y_test)

precision: 0.441938178780284
recall: 0.2114308553157474
accuracy: 0.9295733333333334
roc_auc: 0.7966084986745161


In [45]:
gridSearchGB_sampled.best_params_

{'n_estimators': 150,
 'min_samples_leaf': 1,
 'max_features': 'sqrt',
 'max_depth': 30,
 'learning_rate': 1}

### Ensemble Model with Bagging

Our baseline models use the Boosting technique, which is helpful in reducing the bias but might increase variance. To address this problem, in this step we use the **Bagging** ensemble technique.

A Bagging classifier is an ensemble meta-estimator that fits base classifiers each on random subsets (sampled with **BootStrap**) of the original dataset and then aggregate their individual predictions to form a final prediction. It can reduce the variance  by introducing randomization into its construction procedure and then make an ensemble out of it.

After bagging, the performance of all of these four models improve.

In [24]:
from sklearn.ensemble import BaggingClassifier

ada_bag = BaggingClassifier(estimator=ada, n_estimators=10).fit(X_train, y_train)
pre_ada_bag, re_ada_bag, acc_ada_bag, auc_ada_bag = evaluate_metrics(ada_bag, X_test, y_test)

precision: 0.5599128540305011
recall: 0.20543565147881696
accuracy: 0.9362133333333333
roc_auc: 0.8661803913061332


In [25]:
gb_bag = BaggingClassifier(estimator=gb, n_estimators=10).fit(X_train, y_train)
pre_gb_bag, re_gb_bag, acc_gb_bag, auc_gb_bag = evaluate_metrics(gb_bag, X_test, y_test)

precision: 0.5875912408759124
recall: 0.19304556354916066
accuracy: 0.93712
roc_auc: 0.8722885341078528


In [13]:
ada_bag_sampled = BaggingClassifier(estimator=ada_sampled, n_estimators=10).fit(X_train_sampled, y_train_sampled)
pre_ada_bag, re_ada_bag, acc_ada_bag, auc_ada_bag = evaluate_metrics(ada_bag_sampled, X_test, y_test)

precision: 0.3274891774891775
recall: 0.6147162270183853
accuracy: 0.8907733333333333
roc_auc: 0.8690342541442017


In [27]:
gb_bag_sampled = BaggingClassifier(estimator=gb_sampled, n_estimators=10).fit(X_train_sampled, y_train_sampled)
pre_gb_bag, re_gb_bag, acc_gb_bag, auc_gb_bag = evaluate_metrics(gb_bag_sampled, X_test, y_test)

precision: 0.3242131248714256
recall: 0.6298960831334932
accuracy: 0.8877066666666666
roc_auc: 0.8721346998063015


In [29]:
!pip install xgboost
import xgboost as xgb



In [36]:
from xgboost import XGBClassifier
xgb_params = {
    'colsample_bytree': 0.8,
 'max_depth': 16,
 'min_child_samples': 39,
 'min_child_weight': 1,
 'n_estimators': 289,
 'reg_alpha': 1.0,
 'reg_lambda': 10.0,
 'scale_pos_weight': 1,
 #'seed': 3011,
 'subsample': 1.0}
xgb_model = XGBClassifier(boosting_type = "gbdt",n_jobs =-1, **xgb_params)
#bagging = BaggingClassifier(estimator=xgb_model, n_estimators=10, random_state=3011)
#bagging_xgb_sampled = bagging.fit(X_train_sampled,y_train_sampled)
xgb_sampled = xgb_model.fit(X_train_sampled,y_train_sampled)

In [37]:
evaluate_metrics(xgb_sampled, X_test, y_test)

precision: 0.40307692307692305
recall: 0.31414868105515587
accuracy: 0.9232
roc_auc: 0.8200171219102208


(0.40307692307692305, 0.31414868105515587, 0.9232, 0.8200171219102208)

### Ensemble Model with Voting

In [None]:
def voting(models, X_test, prob=False):
  pred_result = np.repeat(0.0, len(X_test))
  for m in models:
    if prob:
      y_pred = m.predict_proba(X_test)[:, 1]
    else:
      y_pred = m.predict(X_test)
    pred_result += y_pred

  if prob:
    pred_result /= len(models)
    pred_result = (pred_result>0.5).astype(int)
  else:
    pred_result = (pred_result>=(len(models)/2)).astype(int)

  return pred_result

In [None]:
y_pred = gb_sampled.predict(X_test)
y_pred_prob = gb_sampled.predict_proba(X_test)[:, 1]

In [None]:
voting_pred = voting([ada_bag, gb_bag, ada_bag_sampled, gb_bag_sampled], X_test, prob=True)

In [None]:
precision = precision_score(y_test, voting_pred)
recall = recall_score(y_test, voting_pred)
accuracy = accuracy_score(y_test, voting_pred)

print(f'precision: {precision}')
print(f'recall: {recall}')
print(f'accuracy: {accuracy}')

precision: 0.4517488411293721
recall: 0.4284572342126299
accuracy: 0.9271733333333333


### Prediction on the test set

In [12]:
X_test = df_test.iloc[:, 1:]

y_pred_ada = ada_bag.predict(X_test)
y_pred_gb = gb_bag.predict(X_test)

In [15]:
pd.DataFrame({'y_pred':y_pred_ada}).to_csv('prediction_ada.csv', index=False)
pd.DataFrame({'y_pred':y_pred_gb}).to_csv('prediction_gb.csv', index=False)