# Combination Sampling

Implement the SMOTEENN technique with the credit card default data. Then estimate a logistic regression model and report the classification evaluation metrics.

ln_balance_limit is the log of the maximum balance they can have on the card; 1 is female, 0 male for sex; the education is denoted: 1 = graduate school; 2 = university; 3 = high school; 4 = others; 1 is married and 0 single for marriage; default_next_month is whether the person defaults in the following month (1 yes, 0 no).

In [1]:
import pandas as pd
from path import Path
from collections import Counter

In [2]:
data = Path(r'C:\Users\TribThapa\Desktop\Thapa\ResearchFellow\Courses\FinTech_Bootcamp_MonashUni2021\monu-mel-virt-fin-pt-05-2021-u-c\Activities\Week 11\3\07-Stu_Do_Combination_Sampling\Resources\cc_default.csv')
df = pd.read_csv(data)

In [3]:
x_cols = [i for i in df.columns if i not in ('ID', 'default_next_month')]
X = df[x_cols]
y = df['default_next_month']

In [4]:
x_cols

['ln_balance_limit', 'sex', 'education', 'marriage', 'age']

In [5]:
Counter(y)

Counter({1: 6636, 0: 23364})

In [6]:
# Normal train-test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

# Combination Sampling with SMOTEENN

In [7]:
# Use the SMOTEENN technique to perform combination sampling on the data
from imblearn.combine import SMOTEENN

sm_teenn = SMOTEENN(random_state=1)

X_smoteenn_resampled, y_smoteenn_resampled = sm_teenn.fit_resample(X_train, y_train)

# Count the resampled classes
Counter(y_smoteenn_resampled)

Counter({0: 7433, 1: 6007})

# Logistic Regression

In [9]:
# Fit a Logistic regression model using random undersampled data
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(solver='lbfgs', random_state=1)

model.fit(X_smoteenn_resampled, y_smoteenn_resampled)

pred = model.predict(X_test)

# Evaluation Metrics

In [10]:
# Display the confusion matrix
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, pred)

cm_df = pd.DataFrame(cm, index=["Actual 0", "Actual 1"], columns=["Pred 0", "Pred 1"])

cm_df

Unnamed: 0,Pred 0,Pred 1
Actual 0,4027,1805
Actual 1,832,836


In [11]:
# Calculate the Balanced Accuracy Score
from sklearn.metrics import balanced_accuracy_score

balanced_accuracy_score(y_test, pred)

0.5958498633192212

In [12]:
# Print the imbalanced classification report
from imblearn.metrics import classification_report_imbalanced

print(classification_report_imbalanced(y_test, pred))

                   pre       rec       spe        f1       geo       iba       sup

          0       0.83      0.69      0.50      0.75      0.59      0.35      5832
          1       0.32      0.50      0.69      0.39      0.59      0.34      1668

avg / total       0.71      0.65      0.54      0.67      0.59      0.35      7500

