# Machine Learning & Synthetic Data

For this notebook, I'll use the adjusted set from feature engineering notebook on Logit, XGBoost, Light GBM, SVM and Neural Network models.

In [16]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
import xgboost as xgb
import lightgbm as lgbm
from sklearn import svm
from sklearn import metrics
import tensorflow as tf
from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import SGD, Adam
from imblearn.over_sampling import SMOTE    #  This is the library I'll use to create synthetic data

## Prepping data

### Loading datasets

In [17]:
# Import merged dataset (train + test)
default_df = pd.read_csv('train_adj.csv')

# Separate between features (X) and answer (y)
x = default_df.drop(['Unnamed: 0','Loan Status'], axis=1)
y = default_df['Loan Status']

# Split dataset considering train and test must have default class
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=101, stratify=y)

### Synthetic Minority Oversampling Technique - SMOTE

Our feature engineering showed us that default clients represent 9.23% of train set. This will make our model less precise do identify default clients, which is a huge problem since it's suppose to do that.

Total classes in training set 'Loan Status'
|Code|Number of observations| Label|
|:-:|:------:|:--:|
|0  |  58,209 | Non-Default Clients|
|1  |   5,920 | Default Clients |

In this notebook I'll fix the imbalanced set by generating synthetic data for default clients (minority class). This technique is known as **oversampling**, and is commonly used in cases like this.

**How does it work?**

*[SMOTE](https://www.blog.trainindata.com/smote-in-python-a-guide-to-balanced-datasets/) will analyze the dataset and find the minorty class. After that, will start to calculate the closest neighboors (k-means) start to generate extra observations based on those distance mean.*

In [18]:
# Create SMOTE instance
smote = SMOTE(random_state=101)

# Apply SMOTE on my TRAINING set, already split between x_train and y_train
x_train, y_train = smote.fit_resample(x_train,y_train)

# Check classes
print('Total classes in training set', y_train.value_counts())

Total classes in training set Loan Status
0    40746
1    40746
Name: count, dtype: int64


### Standardizing data

In [19]:
# # Create a standard scaler based on train set
# scaler = StandardScaler()

# # Use train set as reference
# scaler.fit_transform(x_train)

# # Transform test without contaminating it
# scaler.transform(x_test)

## Logit Model

In [20]:
# Create instance for Logit model
logit = LogisticRegression()

# Fit model in my training set
logit.fit(x_train, y_train)

# Predict y_test
logit_predict = logit.predict(x_test)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [21]:
# Compare results
logit_matrix = metrics.confusion_matrix(y_test, logit_predict)
print('Confusion Matrix','\n',logit_matrix, '\n')

print(metrics.classification_report(y_test, logit_predict))

print('\n AUC:', round(metrics.roc_auc_score(y_test, logit_predict),4))

Confusion Matrix 
 [[9969 7494]
 [ 995  781]] 

              precision    recall  f1-score   support

           0       0.91      0.57      0.70     17463
           1       0.09      0.44      0.16      1776

    accuracy                           0.56     19239
   macro avg       0.50      0.51      0.43     19239
weighted avg       0.83      0.56      0.65     19239


 AUC: 0.5053


In [22]:
# Extract classification report
class_report = metrics.classification_report(y_test, logit_predict, output_dict=True)
class_report = pd.DataFrame(class_report).round(2).transpose()
class_report['Model'] = 'logit'
class_report

Unnamed: 0,precision,recall,f1-score,support,Model
0,0.91,0.57,0.7,17463.0,logit
1,0.09,0.44,0.16,1776.0,logit
accuracy,0.56,0.56,0.56,0.56,logit
macro avg,0.5,0.51,0.43,19239.0,logit
weighted avg,0.83,0.56,0.65,19239.0,logit


In [23]:
# Create AUC report
auc = {'Model': ['Logit SMOT'],
       'AUC': round(metrics.roc_auc_score(y_test, logit_predict),4)}

# Add to the report
auc_report = pd.DataFrame(auc)
auc_report

Unnamed: 0,Model,AUC
0,Logit SMOT,0.5053


## XGBoost

In [24]:
# Create XGBoost instance
XGB = xgb.XGBClassifier()

# Fit he model
XGB.fit(x_train, y_train)

# Pedict
xgb_predict = XGB.predict(x_test)

In [25]:
# Analyze performance
print('Confusion matrix', '\n',metrics.confusion_matrix(y_test, xgb_predict), '\n')

print(metrics.classification_report(y_test, xgb_predict))

Confusion matrix 
 [[17396    67]
 [ 1768     8]] 

              precision    recall  f1-score   support

           0       0.91      1.00      0.95     17463
           1       0.11      0.00      0.01      1776

    accuracy                           0.90     19239
   macro avg       0.51      0.50      0.48     19239
weighted avg       0.83      0.90      0.86     19239



In [26]:
# Create temporary class report
temp_class_report = metrics.classification_report(y_test, xgb_predict, output_dict=True)
temp_class_report = pd.DataFrame(temp_class_report).round(2).transpose()
temp_class_report['Model'] = 'xgb'

# Concat with main df
class_report = pd.concat([class_report, temp_class_report], axis=0)

# Display final df
# class_report

In [27]:
# Create AUC report
auc = {'Model': ['XGBoost SMOT'],
       'AUC': round(metrics.roc_auc_score(y_test, xgb_predict),4)}

# Add to the report
temp_auc_report = pd.DataFrame(auc)
auc_report = pd.concat([auc_report, temp_auc_report], axis=0)
auc_report

Unnamed: 0,Model,AUC
0,Logit SMOT,0.5053
0,XGBoost SMOT,0.5003


## Light GBM

In [28]:
# Create instance
lgb = lgbm.LGBMClassifier()

# Fit in train set
lgb.fit(x_train, y_train)

# Predict
lgb_predict = lgb.predict(x_test)

[LightGBM] [Info] Number of positive: 40746, number of negative: 40746
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.018394 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 6714
[LightGBM] [Info] Number of data points in the train set: 81492, number of used features: 35
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000


In [29]:
# Confusion matrix
print('Confusion Matrix:', '\n',metrics.confusion_matrix(y_test,lgb_predict), '\n')

# Classification report
print(metrics.classification_report(y_test, lgb_predict))

Confusion Matrix: 
 [[17457     6]
 [ 1774     2]] 

              precision    recall  f1-score   support

           0       0.91      1.00      0.95     17463
           1       0.25      0.00      0.00      1776

    accuracy                           0.91     19239
   macro avg       0.58      0.50      0.48     19239
weighted avg       0.85      0.91      0.86     19239



In [30]:
# Create temporary classification report df
temp_class_report = metrics.classification_report(y_test, lgb_predict, output_dict=True)
temp_class_report = pd.DataFrame(temp_class_report).round(2).transpose()
temp_class_report['Model'] = 'Light GBM'

# Concat with main report
class_report = pd.concat([class_report, temp_class_report], axis=0)
# class_report
class_report.to_csv('classification_report.csv')

In [31]:
# Create AUC report
auc = {'Model': ['LGBT SMOT'],
       'AUC': round(metrics.roc_auc_score(y_test, lgb_predict),4)}

# Add to the report
temp_auc_report = pd.DataFrame(auc)
auc_report = pd.concat([auc_report, temp_auc_report], axis=0)
auc_report

Unnamed: 0,Model,AUC
0,Logit SMOT,0.5053
0,XGBoost SMOT,0.5003
0,LGBT SMOT,0.5004


## Standardize Data

In [32]:
# Create a standard scaler based on train set
scaler = StandardScaler()

# Use train set as reference
scaler.fit_transform(x_train)

# Transform test without contaminating it
scaler.transform(x_test)

array([[ 2.20015961, -1.2063035 , -0.17770818, ..., -0.08442591,
        -0.29289712, -1.00739589],
       [-0.55341949, -0.8022342 , -0.69380347, ..., -0.08442591,
        -0.29289712,  1.16801014],
       [ 0.31244599,  0.17124059,  1.26297451, ..., -0.08442591,
        -0.29289712,  1.16801014],
       ...,
       [ 0.94831789, -0.57509008,  1.33764398, ..., -0.08442591,
        -0.29289712, -1.00739589],
       [-1.55001685, -1.02963585,  1.06469431, ..., -0.08442591,
        -0.29289712,  1.16801014],
       [ 0.67217634,  2.47757751,  0.58802977, ..., -0.08442591,
        -0.29289712,  1.16801014]])

## SVM

In [33]:
# Instance model
svc = svm.SVC()

# Fit
svc.fit(x_train, y_train)

# Predict
svc_predict = svc.predict(x_test)

In [34]:
# See results
print('Confusion matrix', '\n', metrics.confusion_matrix(y_test, svc_predict))

# Classification report
print(metrics.classification_report(y_test, svc_predict))

Confusion matrix 
 [[ 6215 11248]
 [  632  1144]]
              precision    recall  f1-score   support

           0       0.91      0.36      0.51     17463
           1       0.09      0.64      0.16      1776

    accuracy                           0.38     19239
   macro avg       0.50      0.50      0.34     19239
weighted avg       0.83      0.38      0.48     19239



In [35]:
# Temporary classification report
temp_class_report = metrics.classification_report(y_test, svc_predict, output_dict=True)
temp_class_report = pd.DataFrame(temp_class_report).round(2).transpose()
temp_class_report['Model'] = 'SVM'

# Merge with report df
class_report = pd.concat([class_report, temp_class_report], axis=0)
# class_report
class_report.to_csv('classification_report.csv')

In [36]:
# Create AUC report
auc = {'Model': ['SVM SMOT'],
       'AUC': round(metrics.roc_auc_score(y_test, svc_predict),4)}

# Add to the report
temp_auc_report = pd.DataFrame(auc)
auc_report = pd.concat([auc_report, temp_auc_report], axis=0)
auc_report

Unnamed: 0,Model,AUC
0,Logit SMOT,0.5053
0,XGBoost SMOT,0.5003
0,LGBT SMOT,0.5004
0,SVM SMOT,0.5


## Neural Network

In [37]:
# First transform df in arrays. This is the way tensorflow builds its models
x_train_a = x_train.to_numpy()
y_train_a = y_train.to_numpy()
x_test_a = x_test.to_numpy()
y_test_a = y_test.to_numpy()

x_train_a.shape

(81492, 36)

In [38]:
# Build neural netwrok
ann_sgd = tf.keras.models.Sequential([
  tf.keras.layers.Dense(64, input_shape=(36,), activation='tanh'),
  tf.keras.layers.Dense(32, activation='tanh'),
  tf.keras.layers.Dropout(0.20),
  tf.keras.layers.Dense(1, activation='sigmoid')
])

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


In [39]:
# Compile and fit
opt = tf.keras.optimizers.SGD(learning_rate=0.01)

# Since this is a classification problem, our loss analysis also changes from MSE to binarycrossentropy
ann_sgd.compile(optimizer=opt,
            loss='binary_crossentropy',
            metrics=['accuracy'])

# Predic
ann_sgd.fit(x_train_a, y_train_a, epochs=50)

Epoch 1/50
[1m2547/2547[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 2ms/step - accuracy: 0.4982 - loss: 0.7123
Epoch 2/50
[1m2547/2547[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 2ms/step - accuracy: 0.4995 - loss: 0.6954
Epoch 3/50
[1m2547/2547[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 2ms/step - accuracy: 0.4999 - loss: 0.6945
Epoch 4/50
[1m2547/2547[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 2ms/step - accuracy: 0.4976 - loss: 0.6942
Epoch 5/50
[1m2547/2547[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 2ms/step - accuracy: 0.5022 - loss: 0.6939
Epoch 6/50
[1m2547/2547[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 2ms/step - accuracy: 0.5007 - loss: 0.6938
Epoch 7/50
[1m2547/2547[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 2ms/step - accuracy: 0.5057 - loss: 0.6936
Epoch 8/50
[1m2547/2547[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 2ms/step - accuracy: 0.4988 - loss: 0.6935
Epoch 9/50
[1m2547/254

<keras.src.callbacks.history.History at 0x79c6f9a79150>

In [40]:
# Predict using ANN, and here I'll round rthe predictions so we have 1 (defualt) and 0 (non-default) instead of float numbers
ann_sgd_predict = (ann_sgd.predict(x_test_a) > 0.5).astype(int)

[1m602/602[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step


In [41]:
# See results
print('Confusion matrix', '\n', metrics.confusion_matrix(y_test_a, ann_sgd_predict))

# Classification report
print(metrics.classification_report(y_test_a, ann_sgd_predict))

Confusion matrix 
 [[17453    10]
 [ 1775     1]]
              precision    recall  f1-score   support

           0       0.91      1.00      0.95     17463
           1       0.09      0.00      0.00      1776

    accuracy                           0.91     19239
   macro avg       0.50      0.50      0.48     19239
weighted avg       0.83      0.91      0.86     19239



In [42]:
# Temporary classification report
temp_class_report = metrics.classification_report(y_test_a, ann_sgd_predict, output_dict=True)
temp_class_report = pd.DataFrame(temp_class_report).round(2).transpose()
temp_class_report['Model'] = 'ANN SGD'

# Merge with report df
class_report = pd.concat([class_report, temp_class_report], axis=0)

# Save
# class_report.to_csv('classification_report.csv')

In [43]:
# Create AUC report
auc = {'Model': ['ANN SGD SMOT'],
       'AUC': round(metrics.roc_auc_score(y_test, ann_sgd_predict),4)}

# Add to the report
temp_auc_report = pd.DataFrame(auc)
auc_report = pd.concat([auc_report, temp_auc_report], axis=0)
auc_report

Unnamed: 0,Model,AUC
0,Logit SMOT,0.5053
0,XGBoost SMOT,0.5003
0,LGBT SMOT,0.5004
0,SVM SMOT,0.5
0,ANN SGD SMOT,0.5


#### Adam optimizer

In [44]:
# Using similar architecture
ann_adam = tf.keras.models.Sequential([
  tf.keras.layers.Dense(64, input_shape=(36,), activation='relu'),
  tf.keras.layers.Dense(32, activation='tanh'),
  tf.keras.layers.Dropout(0.20),
  tf.keras.layers.Dense(10, activation='tanh'),
  tf.keras.layers.Dense(1, activation='sigmoid')
])

# Compile and fit
opt = tf.keras.optimizers.Adam(learning_rate=0.01)

# Since this is a classification problem, our loss analysis also changes from MSE to binarycrossentropy
ann_adam.compile(optimizer=opt,
            loss='binary_crossentropy',
            metrics=['accuracy'])

# Fit in train sets
ann_adam.fit(x_train_a, y_train_a, epochs=50)

Epoch 1/50


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[1m2547/2547[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 3ms/step - accuracy: 0.4959 - loss: 0.6986
Epoch 2/50
[1m2547/2547[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 2ms/step - accuracy: 0.4935 - loss: 0.6958
Epoch 3/50
[1m2547/2547[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 2ms/step - accuracy: 0.4994 - loss: 0.6952
Epoch 4/50
[1m2547/2547[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 3ms/step - accuracy: 0.5008 - loss: 0.6950
Epoch 5/50
[1m2547/2547[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 3ms/step - accuracy: 0.5003 - loss: 0.6958
Epoch 6/50
[1m2547/2547[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 2ms/step - accuracy: 0.4993 - loss: 0.6953
Epoch 7/50
[1m2547/2547[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 2ms/step - accuracy: 0.4996 - loss: 0.6955
Epoch 8/50
[1m2547/2547[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 3ms/step - accuracy: 0.4967 - loss: 0.6960
Epoch 9/50
[1m2547/2547[0m [3

<keras.src.callbacks.history.History at 0x79c6fa485250>

In [45]:
# Predict using ANN, and here I'll round rthe predictions so we have 1 (defualt) and 0 (non-default) instead of float numbers
ann_adam_predict = (ann_adam.predict(x_test_a) > 0.5).astype(int)

[1m602/602[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step


In [46]:
# See results
print('Confusion matrix', '\n', metrics.confusion_matrix(y_test_a, ann_adam_predict))

# Classification report
print(metrics.classification_report(y_test_a, ann_adam_predict))

Confusion matrix 
 [[    0 17463]
 [    0  1776]]
              precision    recall  f1-score   support

           0       0.00      0.00      0.00     17463
           1       0.09      1.00      0.17      1776

    accuracy                           0.09     19239
   macro avg       0.05      0.50      0.08     19239
weighted avg       0.01      0.09      0.02     19239



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [47]:
# Temporary classification report
temp_class_report = metrics.classification_report(y_test_a, ann_adam_predict, output_dict=True)
temp_class_report = pd.DataFrame(temp_class_report).round(2).transpose()
temp_class_report['Model'] = 'ANN ADAM'

# Merge with report df
class_report = pd.concat([class_report, temp_class_report], axis=0)

# Save
class_report.to_csv('classification_report.csv')

# Display final results
class_report

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Unnamed: 0,precision,recall,f1-score,support,Model
0,0.91,0.57,0.7,17463.0,logit
1,0.09,0.44,0.16,1776.0,logit
accuracy,0.56,0.56,0.56,0.56,logit
macro avg,0.5,0.51,0.43,19239.0,logit
weighted avg,0.83,0.56,0.65,19239.0,logit
0,0.91,1.0,0.95,17463.0,xgb
1,0.11,0.0,0.01,1776.0,xgb
accuracy,0.9,0.9,0.9,0.9,xgb
macro avg,0.51,0.5,0.48,19239.0,xgb
weighted avg,0.83,0.9,0.86,19239.0,xgb


In [48]:
# Create AUC report
auc = {'Model': ['ANN ADAM SMOT'],
       'AUC': round(metrics.roc_auc_score(y_test, ann_adam_predict),4)}

# Add to the report
temp_auc_report = pd.DataFrame(auc)
auc_report = pd.concat([auc_report, temp_auc_report], axis=0)
auc_report

# Save
auc_report.to_csv('auc_report.csv')