# Machine Learning & Synthetic Data

For this notebook, I'll use the adjusted set from feature engineering notebook on Logit, XGBoost, Light GBM, SVM and Neural Network models.

In [25]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
import xgboost as xgb
import lightgbm as lgbm
from sklearn import svm
from sklearn import metrics
import tensorflow as tf
from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import SGD, Adam
from imblearn.over_sampling import SMOTE    #  This is the library I'll use to create synthetic data

## Prepping data

### Loading datasets

In [2]:
# Import merged dataset (train + test)
default_df = pd.read_csv('default_df.csv')

# Separate between features (X) and answer (y)
x = default_df.drop(['Unnamed: 0','Loan Status'], axis=1)
y = default_df['Loan Status']

# Split dataset considering train and test must have default class
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=101, stratify=y)

### Synthetic Minority Oversampling Technique - SMOTE

Our feature engineering showed us that default clients represent 9.23% of train set. This will make our model less precise do identify default clients, which is a huge problem since it's suppose to do that.

Total classes in training set 'Loan Status'
|Code|Number of observations| Label|
|:-:|:------:|:--:|
|0  |  58,209 | Non-Default Clients|
|1  |   5,920 | Default Clients |

In this notebook I'll fix the imbalanced set by generating synthetic data for default clients (minority class). This technique is known as **oversampling**, and is commonly used in cases like this.

**How does it work?**

*[SMOTE](https://www.blog.trainindata.com/smote-in-python-a-guide-to-balanced-datasets/) will analyze the dataset and find the minorty class. After that, will start to calculate the closest neighboors (k-means) start to generate extra observations based on those distance mean.*

In [3]:
# Create SMOTE instance
smote = SMOTE(random_state=101)

# Apply SMOTE on my TRAINING set, already split between x_train and y_train
x_train, y_train = smote.fit_resample(x_train,y_train)

# Check classes
print('Total classes in training set', y_train.value_counts())

Total classes in training set Loan Status
0.0    60985
1.0    60985
Name: count, dtype: int64


### Standardizing data

In [4]:
# Create a standard scaler based on train set
scaler = StandardScaler()

# Use train set as reference
scaler.fit_transform(x_train)

# Transform test without contaminating it
scaler.transform(x_test)

array([[-0.74797086,  1.97373137, -1.17839743, ..., -0.08747114,
        -0.28630688,  1.20135296],
       [-0.2205362 , -1.54737296,  2.82020707, ..., -0.08747114,
        -0.28630688, -0.98837057],
       [ 0.41802934, -0.89354366,  1.60311381, ..., -0.08747114,
        -0.28630688,  1.20135296],
       ...,
       [ 0.39817562,  0.54742868,  0.81968218, ..., -0.08747114,
        -0.28630688, -0.98837057],
       [ 1.47901709,  0.93424444, -0.86987622, ..., -0.08747114,
        -0.28630688,  1.20135296],
       [ 1.85910968,  0.66504332, -0.29539639, ..., -0.08747114,
        -0.28630688, -0.98837057]])

## Logit Model

In [5]:
# Create instance for Logit model
logit = LogisticRegression()

# Fit model in my training set
logit.fit(x_train, y_train)

# Predict y_test
logit_predict = logit.predict(x_test)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [6]:
# Compare results
logit_matrix = metrics.confusion_matrix(y_test, logit_predict)
print('Confusion Matrix','\n',logit_matrix, '\n')

print(metrics.classification_report(y_test, logit_predict))

Confusion Matrix 
 [[13063 13074]
 [  808   968]] 

              precision    recall  f1-score   support

         0.0       0.94      0.50      0.65     26137
         1.0       0.07      0.55      0.12      1776

    accuracy                           0.50     27913
   macro avg       0.51      0.52      0.39     27913
weighted avg       0.89      0.50      0.62     27913



Logit results are **bad**. The model can predict non-default (`Loan Status` = 0) with an avarage precision (F1) of 0.12, but needing to repeat the operation (recall) about half the times (0.55).

In [7]:
# Extract classification report
class_report = metrics.classification_report(y_test, logit_predict, output_dict=True)
class_report = pd.DataFrame(class_report).round(2).transpose()
class_report['Model'] = 'logit'
class_report

Unnamed: 0,precision,recall,f1-score,support,Model
0.0,0.94,0.5,0.65,26137.0,logit
1.0,0.07,0.55,0.12,1776.0,logit
accuracy,0.5,0.5,0.5,0.5,logit
macro avg,0.51,0.52,0.39,27913.0,logit
weighted avg,0.89,0.5,0.62,27913.0,logit


## XGBoost

In [8]:
# Create XGBoost instance
XGB = xgb.XGBClassifier()

# Fit he model
XGB.fit(x_train, y_train)

# Pedict
xgb_predict = XGB.predict(x_test)

In [9]:
# Analyze performance
print('Confusion matrix', '\n',metrics.confusion_matrix(y_test, xgb_predict), '\n')

print(metrics.classification_report(y_test, xgb_predict))

Confusion matrix 
 [[26085    52]
 [ 1773     3]] 

              precision    recall  f1-score   support

         0.0       0.94      1.00      0.97     26137
         1.0       0.05      0.00      0.00      1776

    accuracy                           0.93     27913
   macro avg       0.50      0.50      0.48     27913
weighted avg       0.88      0.93      0.90     27913



In [10]:
# Create temporary class report 
temp_class_report = metrics.classification_report(y_test, xgb_predict, output_dict=True)
temp_class_report = pd.DataFrame(temp_class_report).round(2).transpose()
temp_class_report['Model'] = 'xgb'

# Concat with main df
class_report = pd.concat([class_report, temp_class_report], axis=0)

# Display final df
# class_report

## Light GBM

In [11]:
# Create instance
lgb = lgbm.LGBMClassifier()

# Fit in train set
lgb.fit(x_train, y_train)

# Predict
lgb_predict = lgb.predict(x_test)

[LightGBM] [Info] Number of positive: 60985, number of negative: 60985
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.018349 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 6749
[LightGBM] [Info] Number of data points in the train set: 121970, number of used features: 35
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000


In [12]:
# Confusion matrix
print('Confusion Matrix:', '\n',metrics.confusion_matrix(y_test,lgb_predict), '\n')

# Classification report
print(metrics.classification_report(y_test, lgb_predict))

Confusion Matrix: 
 [[26105    32]
 [ 1774     2]] 

              precision    recall  f1-score   support

         0.0       0.94      1.00      0.97     26137
         1.0       0.06      0.00      0.00      1776

    accuracy                           0.94     27913
   macro avg       0.50      0.50      0.48     27913
weighted avg       0.88      0.94      0.91     27913



In [13]:
# Create temporary classification report df
temp_class_report = metrics.classification_report(y_test, lgb_predict, output_dict=True)
temp_class_report = pd.DataFrame(temp_class_report).round(2).transpose()
temp_class_report['Model'] = 'Light GBM'

# Concat with main report
class_report = pd.concat([class_report, temp_class_report], axis=0)
# class_report
class_report.to_csv('classification_report.csv')

## SVM

In [14]:
# Instance model
svc = svm.SVC()

# Fit
svc.fit(x_train, y_train)

# Predict
svc_predict = svc.predict(x_test)

In [15]:
# See results
print('Confusion matrix', '\n', metrics.confusion_matrix(y_test, svc_predict))

# Classification report
print(metrics.classification_report(y_test, svc_predict))

Confusion matrix 
 [[ 7951 18186]
 [  480  1296]]
              precision    recall  f1-score   support

         0.0       0.94      0.30      0.46     26137
         1.0       0.07      0.73      0.12      1776

    accuracy                           0.33     27913
   macro avg       0.50      0.52      0.29     27913
weighted avg       0.89      0.33      0.44     27913



In [16]:
# Temporary classification report
temp_class_report = metrics.classification_report(y_test, svc_predict, output_dict=True)
temp_class_report = pd.DataFrame(temp_class_report).round(2).transpose()
temp_class_report['Model'] = 'SVM'

# Merge with report df
class_report = pd.concat([class_report, temp_class_report], axis=0)
# class_report
class_report.to_csv('classification_report.csv')

Unnamed: 0,precision,recall,f1-score,support,Model
0.0,0.94,0.5,0.65,26137.0,logit
1.0,0.07,0.55,0.12,1776.0,logit
accuracy,0.5,0.5,0.5,0.5,logit
macro avg,0.51,0.52,0.39,27913.0,logit
weighted avg,0.89,0.5,0.62,27913.0,logit
0.0,0.94,1.0,0.97,26137.0,xgb
1.0,0.05,0.0,0.0,1776.0,xgb
accuracy,0.93,0.93,0.93,0.93,xgb
macro avg,0.5,0.5,0.48,27913.0,xgb
weighted avg,0.88,0.93,0.9,27913.0,xgb


## Neural Network

In [None]:
# First transform df in arrays. This is the way tensorflow builds its models
x_train_a = x_train.to_numpy()
y_train_a = y_train.to_numpy()
x_test_a = x_test.to_numpy()
y_test_a = y_test.to_numpy()

x_train_a.shape

(121970, 36)

In [59]:
# Build neural netwrok
ann_sgd = tf.keras.models.Sequential([
  tf.keras.layers.Dense(64, input_shape=(36,), activation='tanh'),
  tf.keras.layers.Dense(32, activation='tanh'),
  tf.keras.layers.Dropout(0.20),
  tf.keras.layers.Dense(1, activation='sigmoid')
])

In [60]:
# Compile and fit
opt = tf.keras.optimizers.SGD(learning_rate=0.01)

# Since this is a classification problem, our loss analysis also changes from MSE to binarycrossentropy
ann_sgd.compile(optimizer=opt, 
            loss='binary_crossentropy',
            metrics=['accuracy'])

# Predic
ann_sgd.fit(x_train_a, y_train_a, epochs=50)

Epoch 1/50
[1m3812/3812[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 852us/step - accuracy: 0.5023 - loss: 0.7117
Epoch 2/50
[1m3812/3812[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 865us/step - accuracy: 0.5032 - loss: 0.6947
Epoch 3/50
[1m3812/3812[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 838us/step - accuracy: 0.4991 - loss: 0.6945
Epoch 4/50
[1m3812/3812[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 874us/step - accuracy: 0.4996 - loss: 0.6941
Epoch 5/50
[1m3812/3812[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 871us/step - accuracy: 0.5032 - loss: 0.6937
Epoch 6/50
[1m3812/3812[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 865us/step - accuracy: 0.4991 - loss: 0.6937
Epoch 7/50
[1m3812/3812[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 878us/step - accuracy: 0.4999 - loss: 0.6935
Epoch 8/50
[1m3812/3812[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 874us/step - accuracy: 0.5057 - loss: 0.6934
Epoch 9/

<keras.src.callbacks.history.History at 0x2210ff89550>

In [None]:
# Predict using ANN, and here I'll round rthe predictions so we have 1 (defualt) and 0 (non-default) instead of float numbers
ann_sgd_predict = (ann_sgd.predict(x_test_a) > 0.5).astype(int)

[1m873/873[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 694us/step


In [64]:
# See results
print('Confusion matrix', '\n', metrics.confusion_matrix(y_test_a, ann_sgd_predict))

# Classification report
print(metrics.classification_report(y_test_a, ann_sgd_predict))

Confusion matrix 
 [[  124 26013]
 [    7  1769]]
              precision    recall  f1-score   support

         0.0       0.95      0.00      0.01     26137
         1.0       0.06      1.00      0.12      1776

    accuracy                           0.07     27913
   macro avg       0.51      0.50      0.06     27913
weighted avg       0.89      0.07      0.02     27913



In [65]:
# Temporary classification report
temp_class_report = metrics.classification_report(y_test_a, ann_sgd_predict, output_dict=True)
temp_class_report = pd.DataFrame(temp_class_report).round(2).transpose()
temp_class_report['Model'] = 'ANN SGD'

# Merge with report df
class_report = pd.concat([class_report, temp_class_report], axis=0)

# Save
class_report.to_csv('classification_report.csv')

#### Adam optimizer

In [68]:
# Using similar architecture
ann_adam = tf.keras.models.Sequential([
  tf.keras.layers.Dense(64, input_shape=(36,), activation='relu'),
  tf.keras.layers.Dense(32, activation='tanh'),
  tf.keras.layers.Dropout(0.20),
  tf.keras.layers.Dense(10, activation='tanh'),
  tf.keras.layers.Dense(1, activation='sigmoid')
])

# Compile and fit
opt = tf.keras.optimizers.Adam(learning_rate=0.01)

# Since this is a classification problem, our loss analysis also changes from MSE to binarycrossentropy
ann_adam.compile(optimizer=opt, 
            loss='binary_crossentropy',
            metrics=['accuracy'])

# Fit in train sets
ann_adam.fit(x_train_a, y_train_a, epochs=50)

Epoch 1/50
[1m3812/3812[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 1ms/step - accuracy: 0.5003 - loss: 0.6965
Epoch 2/50
[1m3812/3812[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 1ms/step - accuracy: 0.5009 - loss: 0.6953
Epoch 3/50
[1m3812/3812[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 1ms/step - accuracy: 0.5009 - loss: 0.6952
Epoch 4/50
[1m3812/3812[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 1ms/step - accuracy: 0.5022 - loss: 0.6948
Epoch 5/50
[1m3812/3812[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 1ms/step - accuracy: 0.4988 - loss: 0.6955
Epoch 6/50
[1m3812/3812[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 1ms/step - accuracy: 0.5002 - loss: 0.6951
Epoch 7/50
[1m3812/3812[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 1ms/step - accuracy: 0.5018 - loss: 0.6954
Epoch 8/50
[1m3812/3812[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 1ms/step - accuracy: 0.5013 - loss: 0.6954
Epoch 9/50
[1m3812/3812

<keras.src.callbacks.history.History at 0x221086e0550>

In [69]:
# Predict using ANN, and here I'll round rthe predictions so we have 1 (defualt) and 0 (non-default) instead of float numbers
ann_adam_predict = (ann_adam.predict(x_test_a) > 0.5).astype(int)

[1m873/873[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 740us/step


In [70]:
# See results
print('Confusion matrix', '\n', metrics.confusion_matrix(y_test_a, ann_adam_predict))

# Classification report
print(metrics.classification_report(y_test_a, ann_adam_predict))

Confusion matrix 
 [[26137     0]
 [ 1776     0]]
              precision    recall  f1-score   support

         0.0       0.94      1.00      0.97     26137
         1.0       0.00      0.00      0.00      1776

    accuracy                           0.94     27913
   macro avg       0.47      0.50      0.48     27913
weighted avg       0.88      0.94      0.91     27913



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [71]:
# Temporary classification report
temp_class_report = metrics.classification_report(y_test_a, ann_adam_predict, output_dict=True)
temp_class_report = pd.DataFrame(temp_class_report).round(2).transpose()
temp_class_report['Model'] = 'ANN ADAM'

# Merge with report df
class_report = pd.concat([class_report, temp_class_report], axis=0)

# Save
class_report.to_csv('classification_report.csv')

# Display final results
class_report

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Unnamed: 0,precision,recall,f1-score,support,Model
0.0,0.94,0.5,0.65,26137.0,logit
1.0,0.07,0.55,0.12,1776.0,logit
accuracy,0.5,0.5,0.5,0.5,logit
macro avg,0.51,0.52,0.39,27913.0,logit
weighted avg,0.89,0.5,0.62,27913.0,logit
0.0,0.94,1.0,0.97,26137.0,xgb
1.0,0.05,0.0,0.0,1776.0,xgb
accuracy,0.93,0.93,0.93,0.93,xgb
macro avg,0.5,0.5,0.48,27913.0,xgb
weighted avg,0.88,0.93,0.9,27913.0,xgb
