# "Should you use synthetic data for label balancing"

After working on synthetic data generation for the past six months, I have encountered many articles claiming that synthetic data is the ultimate solution for nearly every machine learning problem. This perception is likely driven by the commercialization of the industry, where companies promote synthetic data as a universal fix. An example of this is, and also the inspiration for me writing this article, is the article provided by Synthetic Data Vault (SDV) titled: "Can you use synthetic data for label balancing?" (https://sdv.dev/blog/synthetic-label-balancing/) (same applies to Gretel).

The article addresses a well-known issue in classification: imbalanced target labels. It correctly identifies common techniques like Random Oversampling (ROS) and noise injection while acknowledging their downsides (being overfitting and noise injection). However, it then presents synthetic data as a "compelling solution" without providing evidence. While I am a fan of SDV, their generators, preprocessors, and constraints, this article overlooks critical aspects. Although you definitely can use synthetic data for this case, the key question is whether you should use synthetic data and how it compares to state-of-the-art (SOTA) techniques in this context.

Throughout this article, I aim to provide an answer to this question by comparing synthetic data produced by SDV generators against alternatives and build on top of the aformentioned article. Specifically, I compare data-level approaches such as noise injection, ROS, Synthetic Minority Over-sampling TEchnique (SMOTE), and CTGAN against the algorithm-level approach of Cost-Sensitive learning. This exploration is not novel and adjacent research is available in literature. Adiputra and Wanchai (2024) compare similar approaches. However, in their approach data is resampled (explain resampling first?) before perfroming cross validation, a common pitfall of resampling in imbalanced classification tasks leading to data leakage. 

This article aims to improve on this by providing a more methodologically sound approach whilst providing the intuition and explanation for practictioners that are less familiar with imbalanced classification problems.

### Notes

- Check the helper_functions
    - Make a BaseSampler for the NoiseSampler and CTGANSampler
    - Run unit tests to see whether the samplers work as they should
    - move _get_num_samples to init?

## Imports

In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt     
import seaborn as sns

import lightgbm as lgb

from imblearn.over_sampling import RandomOverSampler, SMOTE
from imblearn.pipeline import Pipeline
from sklearn.metrics import f1_score, confusion_matrix, classification_report, roc_auc_score, precision_score, recall_score
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from skopt import BayesSearchCV

from helper_functions import NoiseSampler, ColumnScaler, CTGANSampler

## Data exploration

For this analysis, the creditcard dataset will be used from Kaggle (https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud), containing transactions and whether they were fraudulent or not. The goal is to predict whether a transaction is fraudulent or not, making it a classification task. Naturally, the amount of genuine transactions outweigh the number of fraudulent transactions resulting in an imbalanced classification task.

For this analysis, we will only be using a subset of the columns, which in this case contains mostly the Principal Components of the original data for confidentiality reasons. In addition, we also have a variable called "Time" which is the seconds from the first transaction in the dataset, 'Amount' which is the amount spent on the transaction, and 'Class' indicating whether the transactions is fraudulent or not, which also is our target variable. Ultimately, the data consists mostly of floats with the exception of our target which is binary.

In [3]:
creditcard = pd.read_csv('../data/creditcard.csv')

print(creditcard.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   Time    284807 non-null  float64
 1   V1      284807 non-null  float64
 2   V2      284807 non-null  float64
 3   V3      284807 non-null  float64
 4   V4      284807 non-null  float64
 5   V5      284807 non-null  float64
 6   V6      284807 non-null  float64
 7   V7      284807 non-null  float64
 8   V8      284807 non-null  float64
 9   V9      284807 non-null  float64
 10  V10     284807 non-null  float64
 11  V11     284807 non-null  float64
 12  V12     284807 non-null  float64
 13  V13     284807 non-null  float64
 14  V14     284807 non-null  float64
 15  V15     284807 non-null  float64
 16  V16     284807 non-null  float64
 17  V17     284807 non-null  float64
 18  V18     284807 non-null  float64
 19  V19     284807 non-null  float64
 20  V20     284807 non-null  float64
 21  V21     28

Next, we see how our target labels are distributed.

In [3]:
# Target label distribution rounded to 2 decimal places
round(creditcard['Class'].value_counts(normalize = True) * 100, 2)

Class
0    99.83
1     0.17
Name: proportion, dtype: float64

It becomes evident that the dataset is highly imbalanced, with  only 0.17% of transactions being fraudulent. This can be problematic for machine learning algorithms if you were to not account for this as models tend to bias towards the non-fraudulent cases given their overrepresentation in the data. As a result, a model predicting every transaction to be non-fraudulent would already result in a 99% accuracy even though every fraudulent transaction has been wrongfully predicted. This highlights the importance of carefully selecting evaluation metrics, as accuracy alone can be misleading in imbalanced classification problems.

### Data splitting

Next step is to split the data. We stratify on the target variable to ensure an even split across train/set sets.

In [4]:
X = creditcard.drop('Class', axis = 1)
y = creditcard['Class']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 2, stratify = y)

# Target label distribution rounded to 2 decimal places
print(round(y_train.value_counts(normalize = True) * 100, 2))
print(round(y_test.value_counts(normalize = True) * 100, 2))

Class
0    99.83
1     0.17
Name: proportion, dtype: float64
Class
0    99.83
1     0.17
Name: proportion, dtype: float64


## Metrics

As mentioned previously, the choice of metrics is far from trivial. However, given that it is beyond the scope of this blog, I will choose an F1-score without diving too deep into the costs of misclassification.

### Algorithm

For the algorithm, I will use a LGBM Classifier. Choosing the most optimal estimator is beyond the scope of this blog. LGBM is chosen for its efficiency and relative predictive power, therefore being used consistently.

## Settings

We define the folds and parameters to optimize over as these will be consistent across resampling techniques.

In [5]:
CV_FOLDS = StratifiedKFold(n_splits = 5, random_state = 2, shuffle = True)

# Define the grid search parameters
PARAM_GRID = {
    'LGB__n_estimators': [100, 200, 500, 800, 1000], 
    'LGB__max_depth': np.arange(4, 201, 4),  
    'LGB__learning_rate': [0.0001, 0.0005, 0.001, 0.01, 0.1]
}

N_ITER = 50

test_results = {}

## Noise injection

Honestly, a bit surpirised this was even recommended as an option in the article as I have never seen someone use it in a practical setting. Nonetheless, I will apply it as well to compare the results, albeit in a very oversimplified manner. Although the article does not clearly indicate what the data generating process is for the noise, I will use random sampling. Specifically, for each variable I will extract their minimum and maximum values and use them to sample from a uniform distribution. Therefore, the correlations between variables are overlooked and the bivariate distributions won't be correct.

Given that all variables, with exception of the target variable, are numerical this step is quite straightforward.

In [None]:
# Define the sampling pipeline
pipeline_noise = Pipeline(
    [('noise_sampler', NoiseSampler()),
     ('LGB', lgb.LGBMClassifier(random_state = 2, n_jobs = -1))]
)

bs_noise = BayesSearchCV(pipeline_noise, PARAM_GRID, cv = CV_FOLDS, scoring = 'f1_macro',
                         verbose = 1, n_iter = 5, random_state = 2)

bs_noise.fit(X_train, y_train)

# Obtain the best estimator and make predictions
noise_estimator = bs_noise.best_estimator_
y_pred_n = noise_estimator.predict(X_test)

Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
[LightGBM] [Info] Number of positive: 227451, number of negative: 227451
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.021799 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 7650
[LightGBM] [Info] Number of data points in the train set: 454902, number of used features: 30
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000


## ROS

In [None]:
# Define the sampling pipeline
pipeline_ros = Pipeline(
    [('ros', RandomOverSampler()),
     ('LGB', lgb.LGBMClassifier(random_state = 2, n_jobs = -1))]
)

bs_ros = BayesSearchCV(pipeline_noise, PARAM_GRID, cv = CV_FOLDS, scoring = 'f1_macro',
                         verbose = 1, n_iter = 5, random_state = 2)

bs_ros.fit(X_train, y_train)

# Obtain the best estimator and make predictions
ros_estimator = bs_ros.best_estimator_
y_pred_ros = ros_estimator.predict(X_test)

Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits


KeyboardInterrupt: 

## SMOTE

In [None]:
# Define the sampling pipeline
pipeline_smote = Pipeline(
    [('scaler', ColumnScaler()),
     ('smote', SMOTE()),
     ('LGB', lgb.LGBMClassifier(random_state = 2, n_jobs = -1))]
)

bs_smote = BayesSearchCV(pipeline_smote, PARAM_GRID, cv = CV_FOLDS, scoring = 'f1_macro',
                         verbose = 1, n_iter = 5, random_state = 2)

bs_smote.fit(X_train, y_train)

# Obtain the best estimator and make predictions
smote_estimator = bs_smote.best_estimator_
y_pred_smote = smote_estimator.predict(X_test)

## CTGAN

I will use the CTGAN synthesizer in this case with default parameters and without any constraints.

In [None]:
# Define the sampling pipeline
pipeline_ctgan = Pipeline(
    [('ctgan', CTGANSampler()),
     ('LGB', lgb.LGBMClassifier(random_state = 2, n_jobs = -1))]
)

bs_ctgan = BayesSearchCV(pipeline_ctgan, PARAM_GRID, cv = CV_FOLDS, scoring = 'f1_macro',
                         verbose = 1, n_iter = 5, random_state = 2)

bs_ctgan.fit(X_train, y_train)

# Obtain the best estimator and make predictions
ctgan_estimator = bs_ctgan.best_estimator_
y_pred_ctgan = ctgan_estimator.predict(X_test)

## Cost sensitive learning

I will be using the inverse class frequency for this as the real costs associated with misclassification is not known.

Should I define the weights inside of the LGBM classifier? Since it has to 'class_weight' parameter, it would make it easier, but would it lead to data leakage? Because I would have to define the weights based on the entire train set instead of the observations that end up in the train fold. Does this matter though? Because I have stratified everywhere (including the folds) so the weights should be the same regardless. Check later with SG. Also check whether there is an easy option to do this in lgbm.

In [10]:
1 / y_train.value_counts(normalize = True).values

array([  1.00173224, 578.28680203])