# "Should you use synthetic data for label balancing"

After working on synthetic data generation for the past six months, I have encountered many articles claiming that synthetic data is the ultimate solution for nearly every machine learning problem. This perception is likely driven by the commercialization of the industry, where companies promote synthetic data as a universal fix. An example of this is, and also the inspiration for me writing this article, is the article provided by Synthetic Data Vault (SDV) titled: "Can you use synthetic data for label balancing?" (https://sdv.dev/blog/synthetic-label-balancing/) (same applies to Gretel).

The article addresses a well-known issue in classification: imbalanced target labels. It correctly identifies common techniques like Random Oversampling (ROS) and noise injection while acknowledging their downsides (being overfitting and noise injection). However, it then presents synthetic data as a "compelling solution" without providing evidence. While I am a fan of SDV, their generators, preprocessors, and constraints, this article overlooks critical aspects. Although you definitely can use synthetic data for this case, the key question is whether you should use synthetic data and how it compares to state-of-the-art (SOTA) techniques in this context.

Throughout this article, I aim to provide an answer to this question by comparing synthetic data produced by SDV generators against alternatives and build on top of the aformentioned article. Specifically, I compare data-level approaches such as noise injection, ROS, Synthetic Minority Over-sampling TEchnique (SMOTE), CTGAN, and TVAE against the algorithm-level approach of Cost-Sensitive learning. This exploration is not novel and adjacent research is available in literature. Adiputra and Wanchai (2024) compare similar approaches. However, in their approach data is resampled (explain resampling first?) before perfroming cross validation, a common pitfall of resampling in imbalanced classification tasks leading to data leakage. 

This article aims to improve on this by providing a more methodologically sound approach whilst providing the intuition and explanation for practictioners that are less familiar with imbalanced classification problems.

## Imports

In [5]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt     
import seaborn as sns

from imblearn.over_sampling import RandomOverSampler, SMOTE
from imblearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split

## Reading in and preparing data

For this analysis, the creditcard dataset will be used from Kaggle (https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud), containing transactions and whether they were fraudulent or not. The goal is to predict whether a transaction is fraudulent or not, making it a classification task. Naturally, the amount of genuine transactions outweigh the number of fraudulent transactions resulting in an imbalanced classification task.

For this 

In [None]:
creditcard = pd.read_csv('../data/creditcard.csv')

# Select first 10 and last 2 columns
creditcard = creditcard.iloc[:, list(range(10)) + list(range(-2, 0))]

print(creditcard.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 12 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   Time    284807 non-null  float64
 1   V1      284807 non-null  float64
 2   V2      284807 non-null  float64
 3   V3      284807 non-null  float64
 4   V4      284807 non-null  float64
 5   V5      284807 non-null  float64
 6   V6      284807 non-null  float64
 7   V7      284807 non-null  float64
 8   V8      284807 non-null  float64
 9   V9      284807 non-null  float64
 10  Amount  284807 non-null  float64
 11  Class   284807 non-null  int64  
dtypes: float64(11), int64(1)
memory usage: 26.1 MB


In [12]:
creditcard['Class'].value_counts(normalize = True)

Class
0    0.998273
1    0.001727
Name: proportion, dtype: float64

If you were to not account for this, models tend to bias the larger class. Fallacy of incorrect metric

### Noise injection

Honestly, a bit surpirised this was even recommended as an option in the article. I have never seen someone use it