# "Should you use synthetic data for label balancing"

After having worked with tabular synthetic data for the past 6 months, I have encountered many articles claiming that synthetic data is the solution for many (machine learning) problems. While synthetic data can serve as a useful Privacy Enhancing Technology (PET) and has shown to be useful in certain tasks, its usefulness and relevance is not always clearly assessed. An example of this is, and also the inspiration for me writing this article, is the article provided by Synthetic Data Vault (SDV) titled: "Can you use synthetic data for label balancing?" (https://sdv.dev/blog/synthetic-label-balancing/).

The article addresses a well-known issue in classification problems: imbalanced target labels. The article correctly identifies techniques like Random Oversampling (ROS) and noise injection while acknowledging their downsides (being overfitting and noisy data). However, the article then presents synthetic data as a "compelling solution" without providing enough evidence of this. While I am a fan of SDV, their generators, preprocessors, and constraints, this article overlooks important aspects validating the validity of synthetic data for these problems. Although you definitely can use synthetic data for label balancing (to answer the question of the article), the key question is whether you **should** use synthetic data and how it compares to state-of-the-art (SOTA) techniques.

Throughout this article, I aim to provide an answer to this question by comparing synthetic data produced by SDV generators against alternatives and build on top of the aformentioned article. Specifically, I compare data-level approaches such as noise injection, ROS, Synthetic Minority Over-sampling TEchnique (SMOTE), and CTGAN against the algorithm-level approach of Cost-Sensitive learning. This idea is not novel and adjacent research is available in literature. Adiputra and Wanchai (2024) for instance compare similar approaches resampling and synthetic data approaches. However, their calidation approach uses cross validation while resampling before cross validation, which is a common pitfall leading to data leakage.

This article aims to improve on this by providing a more methodologically sound approach whilst providing the intuition and explanation for practictioners that are less familiar with imbalanced classification problems.

### Notes

- Check the helper_functions
    - Make a BaseSampler for the NoiseSampler and CTGANSampler
    - Run unit tests to see whether the samplers work as they should
    - move _get_num_samples to init?

## Imports

Notably, to ensure a proper cross validation procedure, I use a pipeline. Specifically, the pipeline from imb_learn is used over sklearn's

In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt     
import seaborn as sns

import lightgbm as lgb

from imblearn.over_sampling import RandomOverSampler, SMOTE
from imblearn.pipeline import Pipeline
from sklearn.metrics import f1_score, confusion_matrix, classification_report, roc_auc_score, precision_score, recall_score
from sklearn.model_selection import train_test_split, StratifiedKFold
from skopt import BayesSearchCV

from sdv.metadata import Metadata
from sdv.single_table import CTGANSynthesizer, TVAESynthesizer

from helper_functions import NoiseSampler, ColumnScaler, SDVSampler, get_scores

In [26]:
import warnings
warnings.filterwarnings("ignore")

## Data exploration

For this analysis, the creditcard dataset will be used from Kaggle (https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud), containing transactions and whether they were fraudulent or not. The goal is to predict whether a transaction is fraudulent or not, making it a classification task. Naturally, the amount of non-fraudulent transactions outweigh the number of fraudulent transactions resulting in an imbalanced classification task.

For this analysis, we will only be using a subset of the columns, which in this case contains mostly the Principal Components of the original data for confidentiality reasons. In addition, we also have a variable called "Time" which is the seconds from the first transaction in the dataset, 'Amount' which is the amount spent on the transaction, and 'Class' indicating whether the transactions is fraudulent or not, which also is our target variable. Ultimately, the data consists mostly of floats with the exception of our target which is binary.

In [27]:
# Select first then and last 2 columns
creditcard = pd.read_csv('../data/creditcard.csv')

# To reduce the dimensiolality of the dataset, we will only use the first 12 principal components,
# Time, Amount, and target
print(creditcard.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   Time    284807 non-null  float64
 1   V1      284807 non-null  float64
 2   V2      284807 non-null  float64
 3   V3      284807 non-null  float64
 4   V4      284807 non-null  float64
 5   V5      284807 non-null  float64
 6   V6      284807 non-null  float64
 7   V7      284807 non-null  float64
 8   V8      284807 non-null  float64
 9   V9      284807 non-null  float64
 10  V10     284807 non-null  float64
 11  V11     284807 non-null  float64
 12  V12     284807 non-null  float64
 13  V13     284807 non-null  float64
 14  V14     284807 non-null  float64
 15  V15     284807 non-null  float64
 16  V16     284807 non-null  float64
 17  V17     284807 non-null  float64
 18  V18     284807 non-null  float64
 19  V19     284807 non-null  float64
 20  V20     284807 non-null  float64
 21  V21     28

Next, we see how our target labels are distributed.

In [17]:
# Target label distribution rounded to 2 decimal places
round(creditcard['Class'].value_counts(normalize = True) * 100, 2)

Class
0    99.83
1     0.17
Name: proportion, dtype: float64

It becomes evident that the dataset is highly imbalanced, with  only 0.17% of transactions being fraudulent. This can be problematic for machine learning algorithms if you were to not account for this as models tend to bias towards the non-fraudulent cases given their overrepresentation in the data. As a result, a model predicting every transaction to be non-fraudulent would already result in a 99% accuracy even though every fraudulent transaction has been wrongfully predicted. This highlights the importance of carefully selecting evaluation metrics, as accuracy alone can be misleading in imbalanced classification problems.

### Data splitting

Next step is to split the data. We stratify on the target variable to ensure an even split across train/set sets.

In [18]:
X = creditcard.drop('Class', axis = 1)
y = creditcard['Class']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 2, stratify = y)

# Target label distribution rounded to 2 decimal places
print(round(y_train.value_counts(normalize = True) * 100, 2))
print(round(y_test.value_counts(normalize = True) * 100, 2))

Class
0    99.83
1     0.17
Name: proportion, dtype: float64
Class
0    99.83
1     0.17
Name: proportion, dtype: float64


## Metrics

As mentioned previously, the choice of metrics is far from trivial. However, given that it is beyond the scope of this blog, I will choose an F1-score without diving too deep into the costs of misclassification.

### Algorithm

For the algorithm, I will use a LGBM Classifier. Choosing the most optimal estimator is beyond the scope of this blog. LGBM is chosen for its efficiency and relative predictive power, therefore being used consistently.

## Settings

We define the folds and parameters to optimize over as these will be consistent across resampling techniques.

In [19]:
RANDOM_STATE = 2
N_ITER = 25

CV_FOLDS = StratifiedKFold(n_splits = 5, random_state = RANDOM_STATE, shuffle = True)

PARAM_GRID = {
    'LGB__n_estimators': [100, 200, 500, 800, 1000], 
    'LGB__max_depth': np.arange(4, 201, 4),  
    'LGB__learning_rate': [0.0001, 0.0005, 0.001, 0.01, 0.1]
}

## Noise injection

The first approach mentioned in the article is noise injection. Although I have not seen it being used in practice and the article does not mention the noise generating process, a uniform sampling procedure will be used. Specifically, for each variable I will extract their minimum and maximum values and use them to sample from a uniform distribution. Therefore, the correlations between variables are overlooked and the bivariate distributions won't be correct.

Given that all variables, with exception of the target variable, are numerical this step is quite straightforward.

In [6]:
# Define the sampling pipeline
pipeline_noise = Pipeline(
    [('noise_sampler', NoiseSampler()),
     ('LGB', lgb.LGBMClassifier(random_state = RANDOM_STATE, n_jobs = -1, verbose = -1))]
)

bs_noise = BayesSearchCV(pipeline_noise, PARAM_GRID, cv = CV_FOLDS, scoring = 'f1_macro',
                         n_iter = N_ITER, random_state = RANDOM_STATE)

bs_noise.fit(X_train, y_train)

# Obtain the best estimator and make predictions
noise_estimator = bs_noise.best_estimator_
y_pred_n = noise_estimator.predict(X_test)

## ROS

In [7]:
# Define the sampling pipeline
pipeline_ros = Pipeline(
    [('ros', RandomOverSampler()),
     ('LGB', lgb.LGBMClassifier(random_state = RANDOM_STATE, n_jobs = -1, verbose = -1))]
)

bs_ros = BayesSearchCV(pipeline_ros, PARAM_GRID, cv = CV_FOLDS, scoring = 'f1_macro',
                       n_iter = N_ITER, random_state = RANDOM_STATE)

bs_ros.fit(X_train, y_train)

# Obtain the best estimator and make predictions
ros_estimator = bs_ros.best_estimator_
y_pred_ros = ros_estimator.predict(X_test)

## SMOTE

In [20]:
# Define the sampling pipeline
pipeline_smote = Pipeline(
    [('scaler', ColumnScaler(['Amount', 'Time'])),
     ('smote', SMOTE()),
     ('LGB', lgb.LGBMClassifier(random_state = RANDOM_STATE, n_jobs = -1, verbose = -1))]
)

bs_smote = BayesSearchCV(pipeline_smote, PARAM_GRID, cv = CV_FOLDS, scoring = 'f1_macro',
                         n_iter = N_ITER, random_state = RANDOM_STATE)

bs_smote.fit(X_train, y_train)

# Obtain the best estimator and make predictions
smote_estimator = bs_smote.best_estimator_
y_pred_smote = smote_estimator.predict(X_test)

In [22]:
precision = precision_score(y_test, y_pred_smote, average = 'macro')
recall = recall_score(y_test, y_pred_smote, average = 'macro')
f1 = f1_score(y_test, y_pred_smote, average = 'macro')
roc_auc = roc_auc_score(y_test, y_pred_smote)

In [25]:
f1

0.6850057789419532

## SDV Generators

The SDV generators that will be compared are the CTGAN and TVAE

In [9]:
metadata = Metadata.detect_from_dataframe(X_train)

### CTGAN

I will use the CTGAN synthesizer in this case with default parameters and without any constraints.

In [10]:
# Define the sampling pipeline
pipeline_ctgan = Pipeline(
    [('ctgan', SDVSampler(CTGANSynthesizer, metadata)),
     ('LGB', lgb.LGBMClassifier(random_state = RANDOM_STATE, n_jobs = -1, verbose = -1))]
)

bs_ctgan = BayesSearchCV(pipeline_ctgan, PARAM_GRID, cv = CV_FOLDS, scoring = 'f1_macro',
                         n_iter = N_ITER, random_state = RANDOM_STATE)

bs_ctgan.fit(X_train, y_train)

# Obtain the best estimator and make predictions
ctgan_estimator = bs_ctgan.best_estimator_
y_pred_ctgan = ctgan_estimator.predict(X_test)



## TVAE Synthesizer

I will also use the TVAE Synthesizer. Again, there will be no hyperparameter tuning and no constraints.

In [11]:
# Define the sampling pipeline
pipeline_tvae = Pipeline(
    [('tvae', SDVSampler(TVAESynthesizer, metadata)),
     ('LGB', lgb.LGBMClassifier(random_state = RANDOM_STATE, n_jobs = -1, verbose = -1))]
)

bs_tvae = BayesSearchCV(pipeline_tvae, PARAM_GRID, cv = CV_FOLDS, scoring = 'f1_macro',
                        n_iter = N_ITER, random_state = RANDOM_STATE)

bs_tvae.fit(X_train, y_train)

# Obtain the best estimator and make predictions
tvae_estimator = bs_tvae.best_estimator_
y_pred_tvae = tvae_estimator.predict(X_test)



KeyboardInterrupt: 

## Cost sensitive learning

I will be using the inverse class frequency for this as the real costs associated with misclassification is not known. These weights will be specified using the classifier's 'class_weight' parameter. Specifically, this is set to 'balanced' to achieve the inversely proportional weights. 

Given that the folds are stratified, we assume the that the assigned weights for the entire train set is (roughly) equal to the weights for each fold. This will result in the following weights being assigned:


In [12]:
X_train.shape[0] / (2 * np.bincount(y_train)) 

array([  0.50086612, 289.14340102])

The pipeline then becomes

In [13]:
# Define the sampling pipeline
pipeline_csl = Pipeline(
    [('LGB', lgb.LGBMClassifier(class_weight= 'balanced', random_state = RANDOM_STATE,
                                n_jobs = -1, verbose = -1))]
)

bs_csl = BayesSearchCV(pipeline_csl, PARAM_GRID, cv = CV_FOLDS, scoring = 'f1_macro',
                       n_iter = N_ITER, random_state = RANDOM_STATE)

bs_csl.fit(X_train, y_train)

# Obtain the best estimator and make predictions
csl_estimator = bs_csl.best_estimator_
y_pred_csl = csl_estimator.predict(X_test)

# Results

In [14]:
y_preds = [y_pred_n, y_pred_ros, y_pred_smote, y_pred_ctgan, y_pred_csl]
sampling_procedures = ['Noise', 'ROS', 'SMOTE', 'CTGAN', 'Cost-Sensitive Learning']	

get_scores(y_test, y_preds, sampling_procedures)

Unnamed: 0,Precision,Recall,F1,ROC_AUC
Noise,0.943006,0.897871,0.919223,0.897871
ROS,0.946105,0.923381,0.934445,0.923381
SMOTE,0.685136,0.92223,0.757077,0.92223
CTGAN,0.916508,0.908023,0.912222,0.908023
Cost-Sensitive Learning,0.950955,0.92339,0.936737,0.92339
