# Comparing Results: Synthetic data vs Increse Model Penalty

Now that I tested the same models with two different techniques to handle imbalance, let's see which one performed better.

- Synthetic data for minority class: `classification_report.csv`
- Increase Penalty for positive class (minority): `classification_report2.csv`


## Concerns
Since EDA, I noticed train and test sets aren't similar and have very different behavior on their clients and their features related to risk. For instance, grade A clients have the same interest rate as risky grades (which doesn't make much sense specially when the loan value is similar).

For this, I have couple of hypothesis for why is it happening:
1. We are missing other important information when it comes to risk analysis: age, location, investment account, ocuppation.
2. Train set with suspect data: in my EDA test set have the expected risk x rate relation we would expect in these sorts of scenarios. But **train set** doesn't seems to understand this relationship. And one explenation for it not reflect the real world is because train set data is synthetic generated by some algorithm like K-means.
3. Error on **test set** label (y): as discusses, test set have all its label column `null`. This could by a data entry error and **all them are zero (non-default)** OR the label has 1 and 0 but someone forgot to add it.

Since my dataset is not balanced, accuracy might be an unrealiable metric to look. I'll focus on macro avg, since it considers the proportion of each class.

In [1]:
import pandas as pd

In [2]:
# Load results sets
synt = pd.read_csv('classification_report.csv')
ip = pd.read_csv('classification_report2.csv')

# Concat sets into one
results = pd.concat([synt,ip], axis=0)

## Logits

In [9]:
print(results[results['Model'].str.contains('logit')])

     Unnamed: 0  precision  recall  f1-score   support     Model
0           0.0       0.94    0.50      0.65  26137.00     logit
1           1.0       0.07    0.55      0.12   1776.00     logit
2      accuracy       0.50    0.50      0.50      0.50     logit
3     macro avg       0.51    0.52      0.39  27913.00     logit
4  weighted avg       0.89    0.50      0.62  27913.00     logit
0           0.0       0.94    0.48      0.63  26137.00  logit IP
1           1.0       0.07    0.56      0.12   1776.00  logit IP
2      accuracy       0.48    0.48      0.48      0.48  logit IP
3     macro avg       0.50    0.52      0.38  27913.00  logit IP
4  weighted avg       0.89    0.48      0.60  27913.00  logit IP


Consideting the **macro avg** **f1-score** , the models are not so different in terms of performance. While Logistical Regression with synthetic data was 0.01 point better, both models are far from identifying default clients (the ideal would be f1 = 1 or closer to).

## XGBoosts

In [10]:
print(results[results['Model'].str.contains('xgb')])

     Unnamed: 0  precision  recall  f1-score   support   Model
5           0.0       0.94    1.00      0.97  26137.00     xgb
6           1.0       0.05    0.00      0.00   1776.00     xgb
7      accuracy       0.93    0.93      0.93      0.93     xgb
8     macro avg       0.50    0.50      0.48  27913.00     xgb
9  weighted avg       0.88    0.93      0.90  27913.00     xgb
5           0.0       0.94    0.80      0.86  26137.00  xgb IP
6           1.0       0.08    0.24      0.12   1776.00  xgb IP
7      accuracy       0.76    0.76      0.76      0.76  xgb IP
8     macro avg       0.51    0.52      0.49  27913.00  xgb IP
9  weighted avg       0.88    0.76      0.82  27913.00  xgb IP


Between XGBoosts, increase penalty was 0.01 point better than synthetic data. Compared to **Logit with Synthetic data** (macro avg f1: 0.39), **XGBoost IP** (macro avg f1: 0.49) had better results, and was able to classify both classes (default = 1, and non-default = 0) better than previous models.

## Light GBM

In [15]:
print(results[results['Model'].str.lower().str.contains('light')])

      Unnamed: 0  precision  recall  f1-score   support         Model
10           0.0       0.94    1.00      0.97  26137.00     Light GBM
11           1.0       0.06    0.00      0.00   1776.00     Light GBM
12      accuracy       0.94    0.94      0.94      0.94     Light GBM
13     macro avg       0.50    0.50      0.48  27913.00     Light GBM
14  weighted avg       0.88    0.94      0.91  27913.00     Light GBM
10           0.0       0.94    0.65      0.77  26137.00  Light GBM IP
11           1.0       0.07    0.41      0.12   1776.00  Light GBM IP
12      accuracy       0.63    0.63      0.63      0.63  Light GBM IP
13     macro avg       0.51    0.53      0.45  27913.00  Light GBM IP
14  weighted avg       0.89    0.63      0.73  27913.00  Light GBM IP


While XGBoost and **Light GBM** have a similar use of gradients on their algorithms, their techniques differ and we see the impact on the results. **Light GBM** will focus on maximizing the leaf in each decision tree. This allows to get the minimum error per branch, but has a high chance of overfitting.

Between the both imbalance techniques, **synthetic data** provided a better result (macro avg f1: 0.48). It's very similar to **XGBoost IP** (macro avg f1: 0.49), however when we check light GBM other metrics we see it was unable to classify the default (y=1) clients.

## SVM

In [16]:
print(results[results['Model'].str.lower().str.contains('svm')])

      Unnamed: 0  precision  recall  f1-score   support   Model
15           0.0       0.94    0.30      0.46  26137.00     SVM
16           1.0       0.07    0.73      0.12   1776.00     SVM
17      accuracy       0.33    0.33      0.33      0.33     SVM
18     macro avg       0.50    0.52      0.29  27913.00     SVM
19  weighted avg       0.89    0.33      0.44  27913.00     SVM
15           0.0       0.94    0.40      0.56  26137.00  SVM IP
16           1.0       0.07    0.65      0.12   1776.00  SVM IP
17      accuracy       0.42    0.42      0.42      0.42  SVM IP
18     macro avg       0.51    0.52      0.34  27913.00  SVM IP
19  weighted avg       0.89    0.42      0.53  27913.00  SVM IP


I was hoping that non-linearity would be the answer for a better model, but seems I was wrong. Non-linearity doesn't seems to be better than a simple linear regression (Logit).

Our winning model still **XGBoost IP**

## Artifical Neural Networks

### Stochastic Gradient Descent (SGD)

In [22]:
print(results[results['Model'].str.lower().str.contains('sgd')], '\n')
print(results[results['Model'].str.lower().str.contains('xgb ip')])

      Unnamed: 0  precision  recall  f1-score   support       Model
20           0.0       0.94    0.02      0.04  26137.00     ANN SGD
21           1.0       0.06    0.98      0.12   1776.00     ANN SGD
22      accuracy       0.08    0.08      0.08      0.08     ANN SGD
23     macro avg       0.50    0.50      0.08  27913.00     ANN SGD
24  weighted avg       0.89    0.08      0.04  27913.00     ANN SGD
20           0.0       0.94    0.99      0.96  26137.00  ANN SGD IP
21           1.0       0.05    0.01      0.01   1776.00  ANN SGD IP
22      accuracy       0.93    0.93      0.93      0.93  ANN SGD IP
23     macro avg       0.49    0.50      0.49  27913.00  ANN SGD IP
24  weighted avg       0.88    0.93      0.90  27913.00  ANN SGD IP 

     Unnamed: 0  precision  recall  f1-score   support   Model
5           0.0       0.94    0.80      0.86  26137.00  xgb IP
6           1.0       0.08    0.24      0.12   1776.00  xgb IP
7      accuracy       0.76    0.76      0.76      0.76  xgb I

With a **ANN SGD IP** (macro avg f1: 0.49) similar to **XGBoost IP** (macro avg f1: 0.49), it's time to heck other metrics to break the tie. We move to weighted avg f1 scores, this will tell us the model performance considering the major class.

Now this is tricky, we see **ANN SGD IP** (weighted avg f1: 0.90) surpasses XGBoost IP. 

### ADAM

In [23]:
print(results[results['Model'].str.lower().str.contains('adam')], '\n')
print(results[results['Model'].str.lower().str.contains('xgb ip')])

      Unnamed: 0  precision  recall  f1-score   support        Model
25           0.0       0.94    1.00      0.97  26137.00     ANN ADAM
26           1.0       0.00    0.00      0.00   1776.00     ANN ADAM
27      accuracy       0.94    0.94      0.94      0.94     ANN ADAM
28     macro avg       0.47    0.50      0.48  27913.00     ANN ADAM
29  weighted avg       0.88    0.94      0.91  27913.00     ANN ADAM
25           0.0       0.00    0.00      0.00  26137.00  ANN ADAM IP
26           1.0       0.06    1.00      0.12   1776.00  ANN ADAM IP
27      accuracy       0.06    0.06      0.06      0.06  ANN ADAM IP
28     macro avg       0.03    0.50      0.06  27913.00  ANN ADAM IP
29  weighted avg       0.00    0.06      0.01  27913.00  ANN ADAM IP 

     Unnamed: 0  precision  recall  f1-score   support   Model
5           0.0       0.94    0.80      0.86  26137.00  xgb IP
6           1.0       0.08    0.24      0.12   1776.00  xgb IP
7      accuracy       0.76    0.76      0.76      

And here we see that changing the optimizor to the classic ADAM, didn't help performance.

## Take aways

Here I tested different models using two different techniques to deal with imbalanced dataset. In general, increasing penalty in models seems to have a better performance than generating synthetic data. This is quite expected, since we preserve the unsees variables behavior in our models.

Another interestinng thing is the result of the models itself. While it was quite difficult to find a good model, the **ANN SGD IP** (macro avg f1: 0.49) presented the best option to predict default clients even with all the differences and weird behavior I found on EDA. It is important to notice that even though the model display the best result between the models, it's only possible to detect **default clients** in 5% of the time, which is really low.

Also important to notice how picking and testing optimizors change the result of our models. While ADAM is the safe bet for most situations, SGD provided better results in this architecture and dataset.