# “Synthetic Data for Label Balancing: Should You Use It and How?”

AOver the past six months working with tabular synthetic data, I’ve come across many articles that promote it as a catch-all solution for various machine learning challenges. While synthetic data can serve as a useful Privacy Enhancing Technology (PET) and has shown to be useful in certain tasks, its usefulness and relevance is not always clearly assessed. An example of this is, and also the inspiration for me writing this article, is the article provided by Synthetic Data Vault (SDV) titled: "Can you use synthetic data for label balancing?" (https://sdv.dev/blog/synthetic-label-balancing/).

SDV’s article addresses a common challenge in classification: imbalanced target labels. It discusses traditional data-level solutions such as noise injection and Random Oversampling (ROS), correctly noting their limitations. However, it then proposes synthetic data as a 'compelling solution' without any empirical evidence. While I am a fan of SDV’s generators, constraints, and preprocessors, this article overlooks important aspects of evaluating synthetic data for label balancing. Although you definitely can use synthetic data for label balancing (to answer the question of the article), the key question is whether you **should** use synthetic data and how it compares to state-of-the-art techniques. Throughout this article, I aim to provide an answer to this question by comparing synthetic data produced by SDV generators against other techniques. Specifically, I compare various data-level and algorithm-level approaches against SDV synthesizers.

As you might be thinking (and I was too), this idea probably isn't very novel, and indeed similar research exists in the literature. The work of Adiputra and Wanchai (2024) caught my eye, which compares similar data-level approaches. However, their validation approach uses cross validation (CV) with synthetic data being generated before CV, which is a common pitfall leading to data leakage and biased results. (Also a mistake in section 5.7.4 of: https://d2l.ai/chapter_multilayer-perceptrons/kaggle-house-price.html. It is not the exact same mistake, but similar. Transformations should be applied separately so doesnt really. Mention this in the conclusion that you should also split other feature engineering methods in CV)

This article aims to provide an answer to the question of the article by SDV whether you should use synthetic data for imbalanced classification tasks. Furthermore, this article also aims to address pitfalls in cross validation leading to data leakage between train and holdout fold, why this is problematic, and how you can correctly set up a CV procedure.


To align with TDS guidelines, check other articles: https://towardsdatascience.com/tag/editors-pick/

Check guidelines: https://towardsdatascience.com/questions-96667b06af5/#How-To-Submit-Your-Work
Check FAQ: https://towardsdatascience.com/writers-faq-462571b65b35/#ai

Maybe I can leave some of the code in the pipelines out since it is repetitive?

Add some visuals as well. Perhaps a visual representation of cross validation mistakes.


To keep it concise, maybe remove SMOTE and SMOTE-ENN for ADASYN??

## Imports

Notably, to ensure a proper cross validation procedure, I use a pipeline. Specifically, the pipeline from imb_learn is used over sklearn's

In [None]:
import warnings
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt     

import lightgbm as lgb

from imblearn.over_sampling import RandomOverSampler, SMOTE
from imblearn.pipeline import Pipeline
from imblearn.combine import SMOTEENN

from sklearn.metrics import f1_score
from sklearn.model_selection import StratifiedKFold, cross_validate, train_test_split

from sdv.metadata import Metadata
from sdv.single_table import CTGANSynthesizer, TVAESynthesizer

from helper_functions import NoiseSampler, ColumnScaler, SDVSampler, SDVENN, display_scores

## Data exploration

For this analysis, the creditcard dataset will be used from Kaggle (https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud), containing transactions and whether they were fraudulent or not. To preserve privacy, most of the features are principal components derived from the original dataset. The goal is to predict whether a transaction is fraudulent or not, making it a classification task. Here the 'Class' variable indicates whether the transactions is fraudulent or not.


In [3]:
creditcard = pd.read_csv('./data/creditcard.csv')

Next, we see how our target labels are distributed.

In [4]:
# Target label distribution rounded to 2 decimal places
round(creditcard['Class'].value_counts(normalize = True) * 100, 2)

Class
0    99.83
1     0.17
Name: proportion, dtype: float64

Naturally, the amount of non-fraudulent transactions outweigh the number of fraudulent transactions resulting in an imbalanced classification task. It is highly imbalanced, with  only 0.17% of transactions being fraudulent. This imbalance can pose challenges for machine learning models, which may become biased toward predicting the majority class. To address this issue, several data-level techniques and algorithm-level techniques exist.

## Settings

Finally, we define our random seed for reproducability, define the metrics we will use to evaluate our models, and define the CV folds.

In [None]:
RANDOM_STATE = 2

SCORINGS = ['f1', 'roc_auc', 'precision', 'recall']

CV_FOLDS = StratifiedKFold(n_splits = 5, random_state = RANDOM_STATE, shuffle = True)

## Data splitting

Next step is to split the data. We split the data into a CV set and a test set. We stratify on the target variable to ensure an even split across sets. The CV set will be used to perform cross validation on and the test will be the untouched data to showcase the effect of improper cross validation procedure and how this generalizes to truely unseen data. 

In [6]:
X = creditcard.drop('Class', axis = 1)
y = creditcard['Class']

X_cv, X_test, y_cv, y_test = train_test_split(X, y, test_size = 0.2, stratify = y,
                                                    random_state = RANDOM_STATE)

Explain why I write this article

- I know that we could use synthetic data from SDV, but shoiuld you?
- Idea not very new, so I went and found literature
- Found the paper and was curious about the underlying code
- Found that the code applied incorrect CV
- Explain what incorrect CV is in short and then display it below?


## Incorrect cross validation procedure 

Firstly, displaying what an incorrect cross validation setup looks like. Suppose we wish to use ROS, which balances imbalanced datasets by duplicating minority class instances, to randomly oversample the fraud instances. A common mistake that is made, also made by Adiputra and Wanchai (2024), which inspired this blog, is to perform this resampling before splitting the data. 

This cross validation mistake is easily made and looks like this:

In [None]:
# We resample the data before performing cross-validation
ros = RandomOverSampler(random_state = RANDOM_STATE)
X_res, y_res = ros.fit_resample(X_cv, y_cv)

# Feed the already resampled data to the cross-validation
cv_score = cross_validate(
    lgb.LGBMClassifier(random_state = RANDOM_STATE, n_jobs = -1, verbose = -1),
    X_res, y_res, cv = CV_FOLDS, scoring = SCORINGS, n_jobs = -1
)

cv_score = cv_score['test_f1'].mean()
print(f"Cross validation score: {cv_score:.4f}")

Cross validation score: 0.9999


This approach leads to near perfect scores. Hopefully, this raises some suspicison as to the validity of the results. This seems great, but how do these results translate to truly unseen data: the test set we have not used in the RandomOverSampler.


In [8]:
# We define the model and fit it to the resampled cv data
lgbm_classifier = lgb.LGBMClassifier(random_state = RANDOM_STATE, n_jobs = -1, 
                                     verbose = -1)
lgbm_classifier.fit(X_res, y_res)

# predictions on the unseen test data and evaluation
preds = lgbm_classifier.predict(X_test)
test_score = f1_score(y_test, preds)
print(f"Test score: {test_score:.4f}")

Test score: 0.8442


The test results are lower than the cross-validation results, indicating that the cross-validation score obtained this way is not representative of the performance on truly unseen data. Furthermore, you might also want to tune hyperparameters during cross-validation. In this case, hyperparameters are selected based on an incorrect cross-validation procedure and overfit to that process, damaging generalization of your models.



### How do you resample the correct way? (maybe put this after the 'What is going wrong' section??)

There are multiple ways to do it. Namely, iterating over each fold in a loop and manually applying the transformations to the train and holdout set separately or by using a pipeline. A pipeline offers a more concise and easy to use alternative.

First, a manual loop is used to get a better feel of how the transformations should look like and what is exactly is going wrong in the previous setup. Afterwards, a pipeline will be used for more efficient coding.   !!!!!!!!!!!!

In [9]:
# track the test scores for each fold and average them afterwards
cv_results = []

for i, (train_index, hd_index) in enumerate(CV_FOLDS.split(X_cv, y_cv)):
    
    # Obtain train and holdout sets for the current fold
    X_train, X_hd = X_cv.iloc[train_index], X_cv.iloc[hd_index]
    y_train, y_hd = y_cv.iloc[train_index], y_cv.iloc[hd_index]
    
    # Applying the ROS to the training set of the current fold as opposed to beforehand for the entire cv set
    ros = RandomOverSampler(random_state = RANDOM_STATE)
    X_train_res, y_train_res = ros.fit_resample(X_train, y_train)
    
    # Fit the model to train and predict the holdout set that is not resampled
    lgb_clf = lgb.LGBMClassifier(random_state = RANDOM_STATE, n_jobs = -1, verbose = -1)
    lgb_clf.fit(X_train_res, y_train_res)
    cv_preds = lgb_clf.predict(X_hd)
    
    # Calculate and append F1 score for the current fold
    cv_score = f1_score(y_hd, cv_preds)
    cv_results.append(cv_score)
    
print(f"F1 score with correct CV procedure: {np.mean(cv_results):.4f}")

F1 score with correct CV procedure: 0.8384


Although the score is less impressive than the first method, it more accurately reflects the results on the truly unseen holdout set of 0.8442. But what is going wrong here?

## What is going wrong?

Although the difference may seem small, resampling before cross-validation can significantly affect the generalizability of results. The main idea behind creating a train and holdout split is to evaluate the model on data that it has not seen before. Cross-validation applies this same exact concept across multiple folds of the data to limit the variance of the results that come from making one single random split. In each fold, the holdout set is used as a separate unseen set of data to test the model on. Therefore, for each fold in the cross-validation procedure, no information from the holdout set should leak into the train set to ensure that it is still unseen. **Add a source here that better explains why we use cross validation**

By using ROS before cross validation in this example, we create duplicates of fraud instances across the ENTIRE dataset. During cross-validation, these duplicated fraud instances can end up in the holdout set of a fold whereas the original ends up in the train set of a fold (or vice versa). As a result, the holdout set does not consist of truly unseen data and the model is tasked to predict the outcome of an observation it has already seen during training.

To demonstrate this, we look at the exact duplicates across folds when ROS is used before and during cross-validation.

In [None]:
# Isolate fraudulent transactions
X_pos = X[y == 1]

# Determine the proportion of duplicates in the positive class
prop_dupl = X_pos.duplicated(keep = False).sum() / X_pos.shape[0]
print(f"Duplicates amongst fraudulent transactions: {prop_dupl:.4f}")

Duplicates amongst fraudulent transactions: 0.0650


This indicates that approx. 6.5% of fraud instances are not unique. Using stratification across the folds, we would expect to see a similar percentage of duplicates throughout cross-validation. 

First, the number of duplicated fraud instances across folds are displayed for the data that is resampled before cross-validation, i.e. the X_res and y_res from earlier. More specifically, we look into how many duplicates there are between train and holdout for each fold.

In [12]:
# Again, resampling the data before cross-validation
ros = RandomOverSampler(random_state = RANDOM_STATE)
X_res, y_res = ros.fit_resample(X_cv, y_cv)

# to track the proportion of duplicates
dupl_percentages = []

# Loop over each fold using the already resampled data
for i, (train_index, hd_index) in enumerate(CV_FOLDS.split(X_res, y_res)):
    
    # Obtain train and holdout sets for the current fold
    X_train, X_hd = X_res.iloc[train_index], X_res.iloc[hd_index]
    y_train, y_hd = y_res.iloc[train_index], y_res.iloc[hd_index]
    
    # Isolate the fraud instances
    train_pos = X_train[y_train == 1]
    hd_pos = X_hd[y_hd == 1]
    
    # Check to see the observations in holdout that are duplicates of the training set
    mask = hd_pos.apply(tuple, axis = 1).isin(train_pos.apply(tuple, axis = 1))
    duplicates = hd_pos[mask]
    
    # Determine the proportion of holdout observation that are duplicates
    prop_dupl = duplicates.shape[0] / hd_pos.shape[0] * 100
    dupl_percentages.append(prop_dupl)
    
print(f"Proportion of duplicates in holdout set across folds: {np.mean(dupl_percentages):.2f}%")


Proportion of duplicates in holdout set across folds: 100.00%


The average percentage of duplicates across folds is 100% if we resample before performing cross-validation. This means that all of the fraud instances in the holdout set are duplicates of observations in the train set, leading to the near perfect F1-score we saw earlier. Although the high percentage of duplicates is largely due to the severe class imbalance, and this issue will not always present itself in the form of exact dupliates for other synthesizers, this approach is still fundamentally flawed. It does not represent a valid evaluation setup, as synthetic or resampled data should not be used as a holdout set for testing a real model. Overall, the information contained in the observations that end up in the holdout set should not be used to transform the observations that end up in the train set, a concept which is often overlooked when applying cross-validation.

## What does the correct way look like?

The same steps as above are repeated, i.e. checking the number of duplicates in the holdout set, but now a valid cross-validation procedure is used by applying resampling WITHIN each fold as opposed to beforehand.

In [None]:
# Again, we track the proportion of duplicates 
dupl_percentages = []

for i, (train_index, hd_index) in enumerate(CV_FOLDS.split(X_cv, y_cv)):
   
    # Obtain train and holdout sets for the current fold
    X_train, X_hd = X_cv.iloc[train_index], X_cv.iloc[hd_index]
    y_train, y_hd = y_cv.iloc[train_index], y_cv.iloc[hd_index]
    
    # WITHIN each fold, we resample the training set
    ros = RandomOverSampler(random_state = RANDOM_STATE)
    X_res, y_res = ros.fit_resample(X_train, y_train)
    
    # Only then do we isolate the positive class
    train_pos = X_res[y_res == 1]
    hd_pos = X_hd[y_hd == 1]
    
    # Check to see the observations in holdout that are duplicates of the training set
    mask = hd_pos.apply(tuple, axis = 1).isin(train_pos.apply(tuple, axis = 1))
    duplicates = hd_pos[mask]
    
    # determine the proportion of holdout observation that are duplicates
    prop_dupl = duplicates.shape[0] / hd_pos.shape[0] * 100
    dupl_percentages.append(prop_dupl)
    
print(f"Proportion of duplicates in test set across folds: {np.mean(dupl_percentages):.2f}%")

Proportion of duplicates in test set across folds: 6.35%


Input the correct way predicting here???

With this, we can see that the proportion of duplicates across folds now much more resembles that obtained from the entire dataset, indicating that the folds are representative. This confirms that resampling within each fold preserves the original data characteristics and avoids data leakage. 

The effect of improper cross validation procedure in this example with ROS is clear due to the high amount of data leakage and the exact duplicates. However, when using synthetic data generators such as the ones used in SDV, this might be less obvious depending on how much they overfit to the data. Depending on the severity of class imbalance, the synthetic data generator, and the degree to which the generator may overfit, this data leakage issue might be less obvious. Furthermore, this issue of data leakge between train and holdout is not something that only pertains to resampling methods in imbalanced classification tasks. This also applies to other, perhaps more subtle, forms of data leakage, such as feature engineering methods that use information from observations that will end up in the holdout fold during cross validation. Think of scaling. Ideally, a scaler should be fit only on the training set and then applied to both training and holdout sets, repeating this for each fold in cross-validation. 

Though not all feature engineering has to lead to data leakage between train and holdout. Think of encoding a categorical gender variable from male/female to binary, which typically is fine since the information is contained within the rows. However, this might get a bit tricky when working with a high cardinality categorical variable where certain values are very rare meaning that they might occur in the hold out fold but not in the train fold. 

For this reason, it's important to be aware of what transformations are being applied, when they are applied, and what information they rely on. It's best practice to perform all transformations within cross-validation rather than beforehand to prevent data leakage. 

# Actual idea

Explain data level and algorithm level approaches in short

data level: oversampling (explain all techniques), hybrid sampling (also because I am curious)
algorithm level: CSL

Now.... on to the good part

As mentioned previously, I cover data and algorithm level approaches to compare to SDV generators (or just synthetic data in general?) in this case. Also good to define what also is considered synthetic data

The data level approaches consist of oversampling and hybrid sampling techniques. The oversampling techniques are ROS, Noise, SMOTE, and from SDV: CTGAN and TVAE. Hybrid sampling is included because of the paper because the idea seems interesting as well :). 

Finally, cost sensitive learning is compared as well.

# Data level approaches

Firstly, the data level approaches. Previously, we have performed cross validation by manually looping over the folds. Although this provides a lot of control, there is an easier and cleaner way to do it. We can use a Pipeline to perform the same operations. The pipelines used will ensure that transformations are always fit on the train set within each fold, reducing the risk of data leakage

## Random OverSampling

Again we use ROS, but this time we apply use the entire dataset and use a pipeline throughout cross-validation.

In [None]:
# Define the sampling pipeline
pipeline_ros = Pipeline(
    [('ros', RandomOverSampler(random_state = RANDOM_STATE)),
     ('LGB', lgb.LGBMClassifier(random_state = RANDOM_STATE, n_jobs = -1, verbose = -1))] # I should probably split this pipeline into two parts, one for the resampling and one for the model 
                                                                                            # since I will use the same model for all resampling techniques
)

cv_score_ros = cross_validate(pipeline_ros, X, y, cv = CV_FOLDS, scoring = SCORINGS, 
                               n_jobs = -1)

## Noise injection

The first approach mentioned in the article by SDV is noise injection. Since I have never seen this technique being used in practice and the article does not mention the noise generating process, I assume values are randomly generated from a normal distribution. Specifically, for each variable their mean and standard deviation is determined and used to sample from a normal distribution. Given that all variables, with exception of the target variable, are numerical this step is quite straightforward.

In [None]:
pipeline_noise = Pipeline(
    [('noise_sampler', NoiseSampler(random_state = RANDOM_STATE)),
     ('LGB', lgb.LGBMClassifier(random_state = RANDOM_STATE, n_jobs = -1, verbose = -1))]
)

cv_score_noise = cross_validate(pipeline_noise, X, y, cv = CV_FOLDS, scoring = SCORINGS, 
                                n_jobs = -1)

## SMOTE (fix scaling?)
Next up is SMOTE (Synthetic Minority Over-sampling Technique), a common oversampling method that generates synthetic examples of the minority class by interpolating between observations based on their distance. Since SMOTE relies on distance calculations, we also preprocess the data by scaling it in our pipeline.

In [21]:
pipeline_smote = Pipeline(
    [('scaler', ColumnScaler(['Amount', 'Time'])),
     ('smote', SMOTE(random_state = RANDOM_STATE)),
     ('LGB', lgb.LGBMClassifier(random_state = RANDOM_STATE, n_jobs = -1, verbose = -1))]
)

cv_score_smote = cross_validate(pipeline_smote, X, y, cv = CV_FOLDS, scoring = SCORINGS, 
                                n_jobs = -1)


## SDV Generators

SDV offers a wide variety of synthetic data generators, as well as many functions and techniques to improve the quality of the resulting synthetic data. Namely, SDV provides useful preprocessing tools and constraints (i.e., deterministic rules the synthetic data should adhere to) to incorporate domain knowledge and business rules. Since the other methods used in this article work out of the box, I will not be applying any of these methods and will compare them in the same way.

In [None]:
# We first define the metadata for the SDV synthesizers
metadata = Metadata.detect_from_dataframe(X)

### CTGAN

From both literature and industry experience, CTGAN is one of the more popular techniques due to its generalizability and its ability to handle datasets with mixed data types (anderen doen dit ook enigszins). What sets CTGAN apart is the incorporation of a conditional generator, which helps address class imbalances in the data. Built on the GAN framework, CTGAN is also widely adopted because it's both popular and easily customizable. As a result, many generators are built on this architeccture, such as WGAN, DP-CTGAN, TableGAN, and others.

https://proceedings.neurips.cc/paper/2019/hash/254ed7d2de3b23ab10936522dd547b78-Abstract.html

Is this really what I want to say in this section

In [None]:
pipeline_ctgan = Pipeline(
    [('ctgan', SDVSampler(CTGANSynthesizer, metadata, random_state = RANDOM_STATE)),
     ('LGB', lgb.LGBMClassifier(random_state = RANDOM_STATE, n_jobs = -1, verbose = -1))]
)

cv_score_ctgan = cross_validate(pipeline_ctgan, X, y, cv = CV_FOLDS, scoring = SCORINGS, 
                               n_jobs = -1)

### TVAE Synthesizer

I will also use the TVAE Synthesizer. Again, there will be no hyperparameter tuning and no constraints.

In [None]:
pipeline_tvae = Pipeline(
    [('tvae', SDVSampler(TVAESynthesizer, metadata, random_state = RANDOM_STATE)),
     ('LGB', lgb.LGBMClassifier(random_state = RANDOM_STATE, n_jobs = -1, verbose = -1))]
)

cv_score_tvae = cross_validate(pipeline_tvae, X, y, cv=CV_FOLDS, scoring = SCORINGS,
                               n_jobs = -1)

# Hybrid Sampling

Finally, the hybrid sampling techniques. As a common technique SMOTE ENN is used and then CTGAN ENN as proposed by Adiputra and Wanchai (2024).

## SMOTE ENN

This technique combines the already used SMOTE with ENN. ENN udnersamples the data for both classes based on the edited nearest neighbour method

In [22]:
pipeline_smoteenn = Pipeline(
    [('scaler', ColumnScaler(['Amount', 'Time'])),
     ('smote enn', SMOTEENN(random_state = RANDOM_STATE)),
     ('LGB', lgb.LGBMClassifier(random_state = RANDOM_STATE, n_jobs = -1, verbose = -1))]
)

cv_score_smoteenn = cross_validate(pipeline_smoteenn, X, y, cv = CV_FOLDS, scoring = SCORINGS, 
                                n_jobs = -1)


## CTGAN ENN

In addition, i am also using the hybrid sampling approach proposed. They simply apply CTGAN (to the minority class) first and then ENN to the entire resampled dataframe

In [8]:
pipeline_ctganenn = Pipeline(
    [('ctgan', SDVENN(CTGANSynthesizer, metadata, random_state = RANDOM_STATE)),
     ('LGB', lgb.LGBMClassifier(random_state = RANDOM_STATE, n_jobs = -1, verbose = -1))]
)

cv_score_ctganenn = cross_validate(pipeline_ctganenn, X, y, cv = CV_FOLDS, 
                                   scoring = SCORINGS, n_jobs = -1)

KeyboardInterrupt: 

In [None]:
cv_score_ctganenn

# Algorithm level approach

## Cost sensitive learning

As the only algorithm level approach, I will use cost sensitive learning. (EXPLAIN what is does here) and CSL is great for larger datasets and if you know the exact costs of misclassifying observation. Since I don't know the cost of misclassification, the inverse class frequency is used for this. These weights will be specified using the classifier's 'class_weight' parameter. Specifically, this is set to 'balanced' to achieve the inversely proportional weights. 

Given that the folds are stratified, we assume the that the assigned weights for the entire train set is (roughly) equal to the weights for each fold. This will result in the following weights being assigned:


In [14]:
X.shape[0] / (2 * np.bincount(y)) 

array([  0.50086524, 289.43800813])

The pipeline then becomes

In [23]:
# Define the sampling pipeline
pipeline_csl = Pipeline(
    [('LGB', lgb.LGBMClassifier(class_weight= 'balanced', random_state = RANDOM_STATE,
                                n_jobs = -1, verbose = -1))]
)

cv_score_csl = cross_validate(pipeline_csl, X, y, cv=CV_FOLDS, scoring = SCORINGS,
                               n_jobs = -1)

# Results

In [None]:
cv_scores = [cv_score_noise, cv_score_ros, cv_score_smote, cv_score_ctgan, cv_score_tvae, 
             cv_score_smoteenn, cv_score_ctganenn, cv_score_csl]   

In [25]:
# Complete the scores of 
r = display_scores(cv_scores, SCORINGS)
r

Unnamed: 0,F1,Roc_auc,Precision,Recall
Baseline,0.0,0.499,0.0,0.0
Noise,0.708,0.974,0.626,0.819
ROS,0.837,0.975,0.862,0.815
SMOTE,0.654,0.961,0.538,0.837
CTGAN,0.764,0.971,0.721,0.817
TVAE,0.763,0.967,0.715,0.821
Gaussian Copula,0.847,0.978,0.906,0.797
CSL,0.839,0.975,0.843,0.835


Gaussian copula provides the best F1, which is kind of surprising. This is likely very dataset dependent and to rule out/for synthetic data, this process should in actuality be repeated for multiple datasets, which I don't have the time to do. Perhaps I should add a quick section: 'Understanding the data', since the dataset at hand might be over simplistic.

However, the Gaussian Copula provides the worsy recall of all methods (and best precision). For the other methods, this is more aligned with each other. So, it would also still be dependent on how you want to use the model and the cost of misclassification.