# Synthetic Data, Imbalanced Labels, and Cross-Validation: Should you use it and how to Prevent Data Leakage

# Introduction

Having worked with synthetic data, I’ve come across many articles that promote tabular synthetic data as a catch-all solution for various obstacles (e.g. machine learning applications, privacy, and testing). While synthetic data definitely is useful, its usefulness and relevance is not always clearly assessed. An example of this is, and also the inspiration for writing this article, is the article provided by Synthetic Data Vault (SDV) titled: "Can you use synthetic data for label balancing?" [2].

SDV’s article addresses a common challenge in classification: imbalanced target labels. Synthetic data is proposed as a 'compelling solution' for this problem compared to more traditional approaches without any empirical evidence.  Although you definitely can use synthetic data for label balancing (to answer the question of the article), the key question is whether you **should** use synthetic data and how it compares to state-of-the-art techniques.

This article addresses the question of the article by SDV on whether you should use synthetic data for imbalanced classification tasks. This article also demonstrates how to set up a proper cross-validation pipeline to prevent data leakage between training and holdout sets when applying resampling techniques.


## Imports

In [1]:
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import pandas as pd   
from IPython.display import Image

## Settings

We define our random seed for reproducability, define the metrics we will use to evaluate our models, and define the cross-validation splits.

In [2]:
from sklearn.model_selection import StratifiedKFold

RANDOM_STATE = 2

SCORINGS = ['f1', 'precision', 'recall']

CV_FOLDS = StratifiedKFold(n_splits = 5, random_state = RANDOM_STATE, shuffle = True)

## Data exploration

For this analysis, the creditcard dataset will be used from Kaggle (https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud), containing transactions and whether they were fraudulent or not. To preserve privacy, most of the features are principal components derived from the original dataset. The goal is to predict whether a transaction is fraudulent or not, making it a classification task. Here the 'Class' variable indicates whether the transactions is fraudulent or not.


In [3]:
import pandas as pd

creditcard = pd.read_csv('./data/creditcard.csv')

Next, we see how our target labels are distributed.

In [4]:
# Target distribution rounded to 2 decimal places
round(creditcard['Class'].value_counts(normalize = True) * 100, 2)

Class
0    99.83
1     0.17
Name: proportion, dtype: float64

Naturally, the amount of non-fraudulent transactions outweigh the number of fraudulent transactions resulting in an imbalanced classification task. In this situation, the dataset is highly imbalanced, with  only 0.17% of transactions being fraudulent. This imbalance can pose challenges for machine learning models, which may become biased toward predicting the majority class. To address this issue, resampling methods are popular. These are methods to balance the dataset by increasing the minority class or by decreasing the majority class. And sometimes a combination of both, also called hybrid sampling. One of the techniques used for oversampling, is through synthetic data as SDV's article suggests. But how does this approach perform and more importantly compare to other techniques?

As you might be thinking (and I was too), this idea probably isn't very novel, and indeed similar research exists in the literature. This led me to the work of Adiputra and Wanchai (2024), which compares similar data-level approaches. However after snooping around in the repository containing the code for their research, I noticed in the validation synthetic data is generated **before** performing cross-validation, which is a common pitfall leading to data leakage and biased results. We'll demonstrate later on how this mistake is easily made and why this leads to data leakage.

Ultimately, we'll demonstrate how this data leakage mistake is easily made and why this leads to data leakage. After showing how to prevent this issue, we'll compare synthetic data to other state-of-the-art techniques for imbalanced datasets.



## Data splitting

Next step is to split the data. We split the data into a CV set and a test set. We stratify on the target variable to ensure an even split across sets. The CV set will be used to perform cross validation on and the test will be the untouched data to showcase the effect of improper cross validation procedure and how this generalizes to truely unseen data. 

In [5]:
from sklearn.model_selection import train_test_split

X = creditcard.drop('Class', axis = 1)
y = creditcard['Class']

X_cv, X_test, y_cv, y_test = train_test_split(X, y, test_size = 0.2, stratify = y,
                                                    random_state = RANDOM_STATE)

## Incorrect cross validation procedure 

Firstly, displaying what an incorrect cross validation setup looks like. Suppose we wish to use Random OverSampler, which balances the data by duplicating minority class instances, to randomly oversample the fraudulent transactions. A common mistake that is made, also made by Adiputra and Wanchai (2024), which inspired this blog, is to perform this resampling before splitting the data. 

This cross validation mistake is easily made and looks like this in code:

In [6]:
from imblearn.over_sampling import RandomOverSampler
from sklearn.model_selection import cross_validate
import lightgbm as lgb

# We resample the data before performing cross-validation
ros = RandomOverSampler(random_state = RANDOM_STATE)
X_res, y_res = ros.fit_resample(X_cv, y_cv)

# Feed the already resampled data into cross validation
cv_score = cross_validate(
    lgb.LGBMClassifier(random_state = RANDOM_STATE, n_jobs = -1, verbose = -1),
    X_res, y_res, cv = CV_FOLDS, scoring = SCORINGS, n_jobs = -1
)

# Calculate the mean f1 score
cv_score = cv_score['test_f1'].mean()
print(f"Cross validation score: {cv_score:.4f}")

Cross validation score: 0.9999


This approach leads to near perfect scores and this seems great, but how do these results translate to truly unseen data: the test set we have not used in the RandomOverSampler.


In [7]:
from sklearn.metrics import f1_score

# We define the model and fit it to the resampled cv data
lgbm_classifier = lgb.LGBMClassifier(random_state = RANDOM_STATE, n_jobs = -1, 
                                     verbose = -1)
lgbm_classifier.fit(X_res, y_res)

# predictions on the unseen data
preds = lgbm_classifier.predict(X_test)
test_score = f1_score(y_test, preds)
print(f"Test score: {test_score:.4f}")

Test score: 0.8442


The test results are noticeably lower than the cross-validation results, indicating that the cross-validation score obtained this way is not representative of the performance on truly unseen data. Furthermore, you might also want to tune hyperparameters during cross-validation. In this case, hyperparameters are selected based on an incorrect cross-validation procedure and overfit in that process, damaging generalization of your models.



## What is going wrong?

Although the difference may seem small, resampling before cross-validation can significantly affect the generalizability of results. The main idea behind creating a train and holdout split is to evaluate the model on data that it has not seen before. Cross-validation applies this same exact concept across multiple folds of the data to limit the variance of the results that come from making one single random split. In each fold, the holdout set is used as a separate unseen set of data to test the model on. Therefore, for each fold in the cross-validation procedure, no information from the holdout set should leak into the train set to ensure that it is still unseen.

By using ROS before cross validation in this example, we create duplicates of fraud instances across the ENTIRE dataset. During cross-validation, duplicated fraud instances can appear in the holdout set of a fold whereas the original ends up in the train set of a fold (or vice versa), as shown in Figure 1 below. As a result, the holdout set does not consist of truly unseen data and the model is tasked to predict the outcome of observations it has already seen during training. 

![Incorrect CV approach](attachments/incorrect_cv.png)

Figure 1: Incorrect cross-validation procedure by resampling beforehand for each fold <sub>i<sub>

Demonstrating this in our example, we look at the exact duplicates of fraudulent transactions across folds when ROS is used before cross-validation.

In [8]:
# Isolate fraudulent transactions
X_pos = X[y == 1]

# Determine the proportion of duplicates in the positive class
prop_dupl = X_pos.duplicated(keep = False).sum() / X_pos.shape[0]
print(f"Duplicates amongst fraudulent transactions in the entire dataset: {prop_dupl:.4f}")

Duplicates amongst fraudulent transactions in the entire dataset: 0.0650


This indicates that approx. 6.5% of fraud instances are not unique. Using stratified folds, we would expect to see a similar percentage of duplicates for each fold throughout cross-validation. 

First, the number of duplicated fraud instances across folds are displayed for the data that is resampled before cross-validation, i.e. the X_res and y_res from earlier. Specifically, we look into how many duplicates there are between train and holdout for each fold.

In [9]:
# Again, resampling the data before cross-validation
ros = RandomOverSampler(random_state = RANDOM_STATE)
X_res, y_res = ros.fit_resample(X_cv, y_cv)

# to track the proportion of duplicates
dupl_percentages = []

# Loop over each fold using the already resampled data
for i, (train_index, hd_index) in enumerate(CV_FOLDS.split(X_res, y_res)):
    
    # Obtain train and holdout sets for the current fold
    X_train, X_hd = X_res.iloc[train_index], X_res.iloc[hd_index]
    y_train, y_hd = y_res.iloc[train_index], y_res.iloc[hd_index]
    
    # Isolate the fraud instances
    train_pos = X_train[y_train == 1]
    hd_pos = X_hd[y_hd == 1]
    
    # Check the observations in holdout that are duplicates of the training set for fold i
    mask = hd_pos.apply(tuple, axis = 1).isin(train_pos.apply(tuple, axis = 1))
    duplicates = hd_pos[mask]
    
    # Determine the proportion of holdout observation that are duplicates
    prop_dupl = duplicates.shape[0] / hd_pos.shape[0] * 100
    dupl_percentages.append(prop_dupl)
    
print(f"Proportion of duplicates in holdout set across folds: {np.mean(dupl_percentages):.2f}%")


Proportion of duplicates in holdout set across folds: 100.00%


When resampling is performed before cross-validation, the average percentage of duplicates among the fraudulent transactions across folds reaches 100%. This means that every fraud instance in the holdout set is a duplicate of an observation in the training set, explaining the near perfect F1-scores observed earlier.

While the high duplication rate is largely attributed to the severe class imbalance in this dataset, the issue may manifest more subtly in other contexts. Moreover, data leakage will not always appear as exact duplicates for all synthesizers. For example, a synthesizer might avoid creating exact duplicates but still generate numerous highly similar observations. Therefore, one should also understand the behavior of the synthesizer used.

## Well... how do you resample the correct way?

To resample correctly, we iterate over each fold in a loop, applying the transformations to the training and holdout sets separately. The correct approach is to first split the unresampled data into folds and then apply resampling within each fold, rather than beforehand, as shown in Figure 2.

![Correct CV approach](attachments/correct_cv.png)

Figure 2: Correct cross-validation procedure by resampling within each fold

We begin with a manual implementation of the correct approach to illustrate how the transformations should look and to highlight what went wrong in the previous setup. Afterwards, we will use a pipeline for a more efficient implementation.



In [25]:
cv_results = []

# Loop over the folds of the data that is NOT resampled beforehand
for i, (train_index, hd_index) in enumerate(CV_FOLDS.split(X_cv, y_cv)):
    
    # Obtain train and holdout sets for fold i
    X_train, X_hd = X_cv.iloc[train_index], X_cv.iloc[hd_index]
    y_train, y_hd = y_cv.iloc[train_index], y_cv.iloc[hd_index]
    
    # Applying the ROS to the training set WITHIN each fold as opposed to beforehand 
    ros = RandomOverSampler(random_state = RANDOM_STATE)
    X_train_res, y_train_res = ros.fit_resample(X_train, y_train)
    
    # Fit the model to the resampled train set and predict the holdout set that is not resampled
    lgb_clf = lgb.LGBMClassifier(random_state = RANDOM_STATE, n_jobs = -1, verbose = -1)
    lgb_clf.fit(X_train_res, y_train_res)
    cv_preds = lgb_clf.predict(X_hd)
    
    # Calculate and append F1 score for the current fold
    cv_score = f1_score(y_hd, cv_preds)
    cv_results.append(cv_score)
    
print(f"F1 score with correct CV procedure: {np.mean(cv_results):.4f}")

F1 score with correct CV procedure: 0.8384


Although the score is less impressive than the first method, it more accurately reflects the results on the truly unseen holdout set of 0.8442. And how is this reflected in the duplicates in the holdout set?

In [11]:
dupl_percentages = []

# Loop over the folds of the data that is NOT resampled beforehand
for i, (train_index, hd_index) in enumerate(CV_FOLDS.split(X_cv, y_cv)):
   
    # Obtain train and holdout sets for fold i
    X_train, X_hd = X_cv.iloc[train_index], X_cv.iloc[hd_index]
    y_train, y_hd = y_cv.iloc[train_index], y_cv.iloc[hd_index]
    
    # WITHIN each fold, we resample the training set
    ros = RandomOverSampler(random_state = RANDOM_STATE)
    X_res, y_res = ros.fit_resample(X_train, y_train)
    
    # Only then do we isolate the positive class
    train_pos = X_res[y_res == 1]
    hd_pos = X_hd[y_hd == 1]
    
    # Check to see the observations in holdout that are duplicates of the training set
    mask = hd_pos.apply(tuple, axis = 1).isin(train_pos.apply(tuple, axis = 1))
    duplicates = hd_pos[mask]
    
    # determine the proportion of holdout observation that are duplicates
    prop_dupl = duplicates.shape[0] / hd_pos.shape[0] * 100
    dupl_percentages.append(prop_dupl)
    
print(f"Proportion of duplicates in test set across folds: {np.mean(dupl_percentages):.2f}%")

Proportion of duplicates in test set across folds: 6.35%


With this, we can see that the proportion of duplicates across folds resembles the one we obtained from the entire dataset, indicating that resampling within each fold leads to folds that are more representative. Furthermore, the effect of improper cross validation procedure in this example with ROS is clear in this situation due to the large amount of exact duplicates. However, depending on the severity of class imbalance, the synthetic data generator used, and the degree to which the generator may overfit, this data leakage issue might be less obvious. 

Moreover, this issue of data leakage between train and holdout is not something that only pertains to resampling methods in imbalanced classification tasks. This also applies to other, perhaps more subtle, forms of data leakage, such as feature engineering methods that use information from observations that will end up in the holdout set during cross validation. Think of scaling for instance. Ideally, a scaler should be fit only on the training set and then applied to both training and holdout sets, repeating this for each fold in cross-validation. Not all feature engineering has to lead to data leakage between train and holdout. For instance, the encoding of a categorical gender variable from male/female to binary, which typically is fine since the information is contained within the rows. However, this might get a bit tricky when working with a high cardinality categorical variable where certain values are quite rare and might appear in the hold out set but not in the train set. 

To avoid these issues, it's important to be aware of what transformations are being applied, when they are applied, and what information they rely on. Best practice is to perform all transformations within cross-validation folds rather than beforehand to prevent any data leakage. 

## How to deal with class imbalances?

As previously mentioned, imbalanced datasets could lead to biased results towards the larger class if not dealt with. Strategies to handle class imbalance fall into two main categories: data-level and algorithm-level approaches. Data-level approaches modify the dataset by oversampling the minority class, undersampling the majority, or a combination of the two (i.e. hybrid sampling) to balance the dataset. Algorithm-level approaches, in contrast, don't modify the dataset, but instead adjust the algorithm to deal with this imbalance in the dataset.

The data-level approaches we use in this article include ROS, noise injection, and synthetic data as covered in SDV's article. While ROS is a common oversampling technique, noise injection is less familiar in practice, and since the article doesn't detail the noise generation process, I assume values are randomly sampled from a normal distribution. Synthetic data refers to artificially generated observations by learning the underlying data patterns. In this scenario, the synthetic data should ideally resemble real observations to help improve model performance. This resemblance is called having high fidelity. However, excessively high fidelity risks creating exact copies of the original data, which can result from the generator overfitting to the data.

The techniques we use to generate synthetic data are Synthetic Minority Over-sampling TEchnique (SMOTE), Conditional Tabular Generative Adversarial Networks (CTGAN), and Tabular Variational Auto Encoder (TVAE). We will use SDV's library for the latter two. SMOTE generates synthetic examples of the minority class by interpolating between observations based on their feature space distance, whereas CTGAN and TVAE use neural networks to learn the underlying data distribution and generate new samples without relying on explicit distance-based interpolation.

Furthermore, we also use an algorithm-level approach: cost-sensitive learning (CSL). CSL assigns different weights to classes so that the model penalizes misclassification errors differently based on their relative importance. In this context, fraud instances receive higher weights, meaning the model faces a larger penalty for misclassifying fraudulent transactions.

CSL is particularly effective when the true costs of different types of misclassification are known. Since these exact costs are unknown, we use inverse class frequency as the weighting scheme, which naturally assigns higher weights to the rarer fraud class. Given that our folds are stratified, we assume that the weights for each fold are (roughy) equal to the following:

In [12]:
w1, w2 = X.shape[0] / (2 * np.bincount(y)) 
print(f"Majority class weight: {w1:.4f}, minority class weight: {w2:.4f}")

Majority class weight: 0.5009, minority class weight: 289.4380


Finally, inspired by Adiputra and Wanchai (2024), we also use hybrid sampling. Specifically, we will throw SMOTE-ENN and their proposed CTGAN-ENN into the mix, which combine oversampling with downsampling using Edited Nearest Neighbors, which removes observations from the majority class that are close to the decision boundary. CTGAN-ENN has been shown to provide improved performance over CSL and other over- and hybrid sampling techniques.


# Pipeline setup

Previously, we have performed cross validation by manually looping over the folds. Although this provides a lot of control, there is an easier and cleaner way to do it. We can use a Pipeline to perform the same operations. A Pipeline offers a more concise and easy to use alternative. The pipelines used will ensure the separation between train and holdout when performing the preprocessing steps for each fold, reducing the risk of data leakage.

In this article, we use the Pipeline from the imblearn package. While scikit-learn’s Pipeline can also handle transformations between training and test sets, imblearn’s version supports samplers, making it more suitable for our workflow.


## Random OverSampling pipeline 


First, we define the classifier to be used in the pipeline (and in other pipelines). In this article, we arbitrarily choose the LightGBM classifier, but you could replace this with any classifier of your choosing. Next, we construct the pipeline, which includes the preprocessing steps (e.g., our RandomOverSampler, which can be directly passed into the pipeline) followed by the classifier. Finally, we pass this pipeline into the cross_validate function, which subsequently performs all steps separately within each fold, and evaluates the model’s performance across folds using the metrics defined in our SCORINGS list.


In [13]:
from imblearn.pipeline import Pipeline

# Define the LightGBM classifier to be used in the pipeline
lgb_clf = lgb.LGBMClassifier(random_state=RANDOM_STATE, n_jobs=-1, verbose=-1)

# Define the sampling pipeline
pipeline_ros = Pipeline(
    [('ros', RandomOverSampler(random_state = RANDOM_STATE)),
     ('lgbm', lgb_clf)] 
)

cv_score_ros = cross_validate(pipeline_ros, X, y, cv = CV_FOLDS, scoring = SCORINGS, 
                               n_jobs = -1)

## Noise injection

**not shown in blog, only result**

In [14]:
from helper_functions import NoiseSampler

pipeline_noise = Pipeline(
    [('noise_sampler', NoiseSampler(random_state = RANDOM_STATE)),
     ('lgbm', lgb_clf)]
)

cv_score_noise = cross_validate(pipeline_noise, X, y, cv = CV_FOLDS, scoring = SCORINGS, 
                                n_jobs = -1)

## SMOTE

**not shown in blog, only result**

In [15]:
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE

pipeline_smote = Pipeline(
    [('scaler', StandardScaler()),
     ('smote', SMOTE(random_state = RANDOM_STATE)),
     ('lgbm', lgb_clf)]
)

cv_score_smote = cross_validate(pipeline_smote, X, y, cv = CV_FOLDS, scoring = SCORINGS, 
                                n_jobs = -1)

## SDV Generators

In order to use SDV's synthetic data generators, we first need to define the metadata of the dataset used. This can simply be done with the following code.

In [16]:
from sdv.metadata import Metadata

# We first define the metadata for the SDV synthesizers
metadata = Metadata.detect_from_dataframe(X)

It is also possible to update the metadata manually if you wish. Furthermore, SDV offers a wide variety of synthetic data generators, as well as many functions and parameters to try to improve the quality of the synthetic data. Namely, SDV provides useful preprocessing tools and constraints (i.e., deterministic rules the synthetic data should adhere to) to incorporate domain knowledge and business rules. Since the other methods used in this article work out of the box, I will not be applying any of these methods and will compare them in the same way.

## CTGAN Pipeline 



Integrating CTGAN into a pipeline is a bit more involved since there is no standard support for it like with the RandomOverSampler. Therefore, we have to build our custom resampling class, which we can pass into the pipeline.

We first build a BaseSampler class to support resampling procedures that are not natively integrated into the Pipeline from imbalanced-learn (e.g., CTGAN, TVAE, and our custom noise-based sampler). This class provides the shared functionality required across these samplers and ensures that during cross-validation, resampling is applied within each fold while keeping the training and holdout sets properly separated.

The class defines a resample() method that performs the actual resampling of the data, and a fit_resample() method that conforms to the scikit-learn/imblearn API by applying resampling only on the training set. In addition, the _get_num_samples() method determines how many synthetic samples of the minority class are required in order to balance the dataset against the majority class in binary classification problems.


In [17]:
from sklearn.base import BaseEstimator

class BaseSampler(BaseEstimator):
    """Base class for all custom samplers."""
    
    def __init__(self, random_state: int) -> None:
        self.random_state = random_state
    
    def fit_resample(self, X: pd.DataFrame, y: pd.Series) -> tuple[pd.DataFrame, pd.Series]:
        """Fit the sampler to the data and return the resampled data"""
        return self.resample(X, y)
    
    def _get_num_samples(self, y: pd.Series) -> int:
        """Get the number of samples to generate for the minority class for this binary
        classification problem."""
        y_values = y.value_counts()
        num_samples = y_values[0] - y_values[1]
        return num_samples  

    def resample(self, X: pd.DataFrame, y: pd.Series) -> tuple[pd.DataFrame, pd.Series]:
        """Resample the data using the specified sampling method"""
        
        # Create copies of the input data
        X_train = X.copy()
        y_train = y.copy()
        
        # Determine the number of samples to generate
        num_samples = self._get_num_samples(y_train)

        # Isolate the minority class and generate synthetic data
        X_minority = X_train[y_train == 1]
        X_upsampled = self._generate_syn_data(X_minority, num_samples)

        # Updating indepent variables and the target variable
        X_sampled = pd.concat([X_train, X_upsampled], axis = 0)
        y_sampled = pd.concat([y_train, pd.Series(np.ones(num_samples))], axis = 0)

        return X_sampled, y_sampled
    

Next, we define the SDVSampler class for the CTGAN and TVAE synthesizers, based on BaseSampler as its parent class. This class takes the synthesizer generator in the __init__ method and has a custom _generate_syn_data() function to resample using the SDV synthesizers.

In [18]:
class SDVSampler(BaseSampler):
    """Custom samler for SDV synthesizers."""
    
    def __init__(self, generator, metadata: Metadata, random_state: int) -> None:
        super().__init__(random_state)
        self.generator = generator
        self.metadata = metadata
       
    def _generate_syn_data(self, X_minority: pd.DataFrame, num_samples: int) -> pd.DataFrame:
        """Generate synthetic data using the SDV synthesizer."""
        
        # Creating a copy of the minority class data
        X_resample = X_minority.copy()
        
        # Creating a new instance of the synthesizer within each fold
        synthesizer = self.generator(self.metadata)

        # Fitting the synthesizer to the minority class and generating new observations
        synthesizer.fit(X_resample)
        X_sds = synthesizer.sample(num_samples)
        
        return X_sds

The final pipeline for CTGAN then looks as follows:

In [19]:
from sdv.single_table import CTGANSynthesizer

# Define the CTGAN pipeline using our custom SDVSampler class
pipeline_ctgan = Pipeline(
    [('ctgan', SDVSampler(CTGANSynthesizer, metadata, random_state = RANDOM_STATE)),
     ('lgbm', lgb_clf)]
)

cv_score_ctgan = cross_validate(pipeline_ctgan, X, y, cv = CV_FOLDS, scoring = SCORINGS, 
                               n_jobs = -1)

### TVAE Synthesizer

**not shown in blog, only result**

In [20]:
from sdv.single_table import TVAESynthesizer

pipeline_tvae = Pipeline(
    [('tvae', SDVSampler(TVAESynthesizer, metadata, random_state = RANDOM_STATE)),
     ('lgbm', lgb_clf)]
)

cv_score_tvae = cross_validate(pipeline_tvae, X, y, cv=CV_FOLDS, scoring = SCORINGS,
                               n_jobs = -1)

# Hybrid Sampling

## SMOTE ENN

**not shown in blog, only result**

In [21]:
from sklearn.preprocessing import StandardScaler
from imblearn.combine import SMOTEENN

pipeline_smoteenn = Pipeline(
    [('scaler', StandardScaler()),
     ('smote enn', SMOTEENN(random_state = RANDOM_STATE)),
     ('lgbm', lgb_clf)]
)

cv_score_smoteenn = cross_validate(pipeline_smoteenn, X, y, cv = CV_FOLDS, scoring = SCORINGS, 
                                n_jobs = -1)


## CTGAN ENN

**not shown in blog, only result**

In [22]:
from helper_functions import SDVENN

pipeline_ctganenn = Pipeline(
    [('ctgan', SDVENN(CTGANSynthesizer, metadata, random_state = RANDOM_STATE)),
     ('lgnm', lgb_clf)]
)

cv_score_ctganenn = cross_validate(pipeline_ctganenn, X, y, cv = CV_FOLDS, 
                                   scoring = SCORINGS, n_jobs = -1)

# Algorithm level approach

## Cost sensitive learning

**not shown in blog, only result**

In [23]:
# Define the sampling pipeline
pipeline_csl = Pipeline(
    [('LGB', lgb.LGBMClassifier(class_weight= 'balanced', random_state = RANDOM_STATE,
                                n_jobs = -1, verbose = -1))]
)

cv_score_csl = cross_validate(pipeline_csl, X, y, cv=CV_FOLDS, scoring = SCORINGS,
                               n_jobs = -1)

# Results

In [24]:
# loading a custom function to display the results
from helper_functions import display_scores

cv_scores = [cv_score_noise, cv_score_ros, cv_score_smote, cv_score_ctgan, cv_score_tvae, 
             cv_score_smoteenn, cv_score_ctganenn, cv_score_csl]   

result = display_scores(cv_scores, SCORINGS)
result

Unnamed: 0,F1,Precision,Recall
Noise,0.746,0.684,0.825
ROS,0.837,0.862,0.815
SMOTE,0.663,0.552,0.831
CTGAN,0.766,0.747,0.801
TVAE,0.775,0.73,0.827
SMOTENN,0.649,0.521,0.862
CTGANENN,0.812,0.92,0.73
CSL,0.839,0.843,0.835


Notably, none of the data-level approaches using synthetic data rank among the best-performing methods, while techniques using SMOTE actually delivered the worst performance. In terms of F1-score, the algorithm-level approach CSL performs best with a score of 0.839. The top-performing data-level approach is ROS, which closely follows CSL with a slightly lower score of 0.837.

Interestingly, the F1-scores for CTGAN and TVAE (0.766 and 0.775, respectively) are closer to the noise imputation (0.741) than to CSL's performance, suggesting that exploring SDV's pre-processors and constraints could be a relevant next step for improving performance, as mentioned previously.

SMOTE and SMOTE-ENN provide surprisingly low F1-scores, likely due to the dataset's high dimensionality. While the hybrid sampling approach of SMOTE-ENN does not improve upon SMOTE alone, Adiputra and Wanchai (2024)'s proposed CTGAN-ENN achieved an F1-score of 0.812, outperforming standard CTGAN and even resulting in the highest precision among all evaluated techniques.



## Conclusion

Overall, we have seen that resampling and other pre-processing steps should not be performed before cross-validation to prevent data leakage. Ideally, these steps should be performed during cross-validation, independently within each fold, which can easily be accomplished using a pipeline.

This analysis also highlights that synthetic data is not always the best solution for imbalanced classification tasks. While synthetic data can be applied to such problems, other state-of-the-art techniques may provide better performance. 

## References

[1] I. N. M. Adiputra and P. Wanchai, CTGAN-ENN: a tabular GAN-based hybrid sampling method for imbalanced and overlapped data in customer churn prediction (2024), Journal of Big Data, vol. 11, no. 1, Sep. 2024

[2] Neha Patki, Can You Use Synthetic Data for Label Balancing? (2023), DataCebo Blog
