<a href="https://colab.research.google.com/github/RDGopal/IB9AU-2026/blob/main/SD4_Synthetic_Loan_Data_Generation_with_CTGAN_%26_TVAE.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this notebook, we will evaluate two techniques to generate structured synthetic data: **Tabular GAN** and **Tabular Variational Autoencoder**.

#Part 1: Tabular GANs
Tabular GANs are a type of Generative Adversarial Network (GAN) specifically designed to generate synthetic tabular data (data organized in rows and columns, like a spreadsheet or a Pandas DataFrame) that closely resembles a real-world dataset. Traditional GANs were initially more successful in generating continuous data like images. Tabular data presents unique challenges due to the presence of:
* Mixed Data Types: Tables often contain both numerical (continuous or discrete) and categorical features.
* Complex Correlations: Features in a table can have intricate linear and non-linear relationships.
* Unbalanced Categories: Categorical features can have classes with highly varying frequencies.
* Discrete Values: Even numerical columns might represent discrete quantities.


CTGAN (Conditional Tabular Generative Adversarial Network) addresses these challenges through several key innovations built upon the standard GAN architecture:
* Generator (G):
Takes random noise as input.
Its goal is to generate synthetic data samples that the discriminator cannot distinguish from real data.
It uses neural networks (typically Multi-Layer Perceptrons or MLPs) to transform the noise into synthetic tabular data.
* Discriminator (D):
Takes a batch of data as input, which can be a mix of real data samples from the original dataset and synthetic data samples generated by the generator.
Its goal is to correctly classify each input sample as either "real" or "synthetic."
It also uses neural networks (MLPs) for this classification task.
* Adversarial Training:
The generator and discriminator are trained in an adversarial manner.
The generator tries to fool the discriminator by producing increasingly realistic synthetic data.
The discriminator tries to become better at distinguishing real from synthetic data.
This competition drives both networks to improve, ideally leading the generator to produce synthetic data that is statistically very similar to the real data.


In essence, CTGAN aims to learn the underlying data generation process of  tabular dataset by training a generator to produce synthetic data that fools a discriminator trained to distinguish it from the real data.

In [None]:
!pip install sdv

In [None]:
import pandas as pd
import numpy as np

from datasets import load_dataset

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score, f1_score

from scipy.stats import ks_2samp
from sklearn.neighbors import NearestNeighbors


## 1. Load and inspect `AnguloM/loan_data`

The dataset is a LendingClub‑style consumer loan dataset hosted on Hugging Face.  
Key fields (from the dataset card):

- `not.fully.paid`: **outcome** – 1 if the loan was *not* fully repaid (default/charge‑off), 0 otherwise. [web:111]
- `credit.policy`: 1 if the customer meets LendingClub's underwriting criteria.
- `purpose`: loan purpose (debt_consolidation, credit_card, etc.).
- Numeric features: `int.rate`, `installment`, `log.annual.inc`, `dti`, `fico`, `days.with.cr.line`, `revol.bal`, `revol.util`, `inq.last.6mths`, `delinq.2yrs`, `pub.rec`. [web:109][web:142]

We will:
1. Load the dataset.
2. Inspect schema and basic statistics.
3. Confirm class balance of `not.fully.paid`.


In [None]:
loan_ds = load_dataset("AnguloM/loan_data")
df = loan_ds["train"].to_pandas()

df.head()

In [None]:
df.info()


In [None]:
df["not.fully.paid"].value_counts(normalize=True)


## 2. Preprocessing with `not.fully.paid` as outcome

Our prediction / label variable is:

- `not.fully.paid` (1 = default / not fully repaid, 0 = fully paid).

We define three groups of columns:

- **Target:** `not.fully.paid`
- **Categorical features:** `purpose`, `credit.policy` (treated as discrete category).
- **Numeric features:** rate, installment, income, FICO, etc.

Steps:
1. Drop rows with missing values (dataset is usually clean, but we are defensive).
2. Split into features `X` and target `y`.


In [None]:
# Drop any NA rows to simplify the lab
df = df.dropna().reset_index(drop=True)

target_col = "not.fully.paid"

cat_cols = ["purpose", "credit.policy"]
num_cols = [
    "int.rate",
    "installment",
    "log.annual.inc",
    "dti",
    "fico",
    "days.with.cr.line",
    "revol.bal",
    "revol.util",
    "inq.last.6mths",
    "delinq.2yrs",
    "pub.rec"
]

X = df[cat_cols + num_cols].copy()
y = df[target_col].astype(int)

df[[target_col]].value_counts(normalize=True)


## 3. Train/test split on real data

We will later:

- Train a classifier on **real data** (baseline).
- Train a classifier on **synthetic data** and test on **real data** (TSTR).

To do this, we split the real dataset into `train` and `test` with stratification on `not.fully.paid`.


In [None]:
real_train, real_test = train_test_split(
    df,
    test_size=0.2,
    random_state=42,
    stratify=df[target_col]
)

real_train.shape, real_test.shape


## 4. Train CTGAN to generate synthetic loans

We will use **CTGAN**, a GAN variant designed specifically for mixed‑type tabular data.

Key design choices:

- Input to CTGAN: all feature columns **plus** the outcome `not.fully.paid`.
- `discrete_columns` parameter: includes all categorical fields and integer count features, including the binary outcome.

This lets us later **condition** on `not.fully.paid` if we want to oversample defaulted loans.


In [None]:
from ctgan import CTGAN

ctgan_data = df[cat_cols + num_cols + [target_col]].copy()

discrete_cols = cat_cols + ["inq.last.6mths", "delinq.2yrs", "pub.rec", target_col]

ctgan = CTGAN(
    epochs=300,       # increase for better quality if you have time/compute
    batch_size=500,
    verbose=True
)

ctgan.fit(ctgan_data, discrete_columns=discrete_cols)

## 5. Generate synthetic data

Let's generate 10,000 synthetic loan records.

We will inspect:

- Schema (column names, dtypes).
- Class balance for `not.fully.paid` in synthetic data vs. real data.


In [None]:
n_synth = 10_000
synthetic = ctgan.sample(n_synth)
synthetic.head()


In [None]:
synthetic.info()


In [None]:
print("Real outcome distribution:")
print(df[target_col].value_counts(normalize=True))

print("\nSynthetic outcome distribution:")
print(synthetic[target_col].value_counts(normalize=True))


### 5.1 Optional: condition on defaults (`not.fully.paid = 1`)

We can ask CTGAN to specifically generate records where the loan is **not fully paid**, which is useful for oversampling the rare default class.


In [None]:
# Generate a larger number of synthetic samples with a strong bias towards the condition

large_synthetic_sample = ctgan.sample(5000)

# Filter to keep only the records where 'not.fully.paid' is 1
# Then sample 2000 from this filtered set. Using replace=True to handle cases where fewer than 2000 are initially generated.
synthetic_defaults = large_synthetic_sample[large_synthetic_sample[target_col] == 1].sample(n=2000, random_state=42, replace=True)

synthetic_defaults[target_col].value_counts(normalize=True)

## 6. Is the Synthetic Data Trustworthy?

We will evaluate the generated synthetic data set along the three pillars of quality.

1. **Fidelity**: Does the synthetic data statistically resemble the real data?
2. **Utility**: Can a machine learning algorithm trained on synthetic data perform well on real data?
3. **Privacy**: Can we guarantee that the synthetic data does not expose sentitive information from the real data?

### 6.1 Fidelity: do synthetic and real loans look statistically similar?

We evaluate fidelity for numeric columns based on individual columns and pairwise correlations

#### 6.1.1 Univariate Fidelity:

- For each numeric column, run a **Kolmogorov–Smirnov (KS) test** comparing real vs synthetic samples.
- KS statistic near 0 ⇒ distributions are similar.
- Higher KS ⇒ synthetic deviates from real.

We do this feature by feature.


In [None]:
real = ctgan_data
syn = synthetic

ks_results = {}

for col in num_cols:
    r = real[col].sample(min(5000, len(real)), random_state=42)
    s = syn[col].sample(min(5000, len(syn)), random_state=42)
    stat, pval = ks_2samp(r, s)
    ks_results[col] = {"ks_stat": stat, "p_value": pval}

ks_df = pd.DataFrame(ks_results).T.sort_values("ks_stat")
ks_df


Interpretation:

- Which features have the **lowest** KS (best‑matched distributions)?
- Which features are hardest for CTGAN to mimic (highest KS)?



###' 6.1.2 Correlation Structure

We now compare **pairwise correlations** between numerical features in real vs synthetic data.

- Compute correlation matrices for real and synthetic numeric features.
- Look at absolute differences between them.


In [None]:
real_corr = real[num_cols].corr()
syn_corr = syn[num_cols].corr()

corr_diff = (real_corr - syn_corr).abs()
corr_diff


In [None]:
# Average absolute difference in correlation per feature
corr_diff.mean().sort_values(ascending=False)


### 6.2 Utility: can synthetic data train a useful default model?

We evaluate **utility** using the TSTR protocol:

1. Fit a classifier on **synthetic** data.
2. Test on **held‑out real** data.
3. Compare performance with a classifier trained on **real** data (upper bound).

Metrics:

- ROC AUC
- F1‑score (for imbalanced classification)


#### 6.2.1 Create encoders/scaler on REAL training data

We will:

- Fit **OneHotEncoder** and **StandardScaler** only on the **real training** subset.
- Apply exactly the same transformations to synthetic data and real test data.


In [None]:
# Fit on REAL TRAINING DATA ONLY
ohe = OneHotEncoder(sparse_output=False, handle_unknown="ignore")
scaler = StandardScaler()

ohe.fit(real_train[cat_cols])
scaler.fit(real_train[num_cols])

def preprocess_for_model(df_subset):
    X_cat = df_subset[cat_cols]
    X_num = df_subset[num_cols]
    y_out = df_subset[target_col].astype(int)

    X_cat_enc = ohe.transform(X_cat)
    X_num_scaled = scaler.transform(X_num)

    X_all = np.hstack([X_cat_enc, X_num_scaled])
    return X_all, y_out


#### 6.2.2 Baseline: Train on REAL, Test on REAL

This is our **upper bound** for performance.


In [None]:
X_real_train, y_real_train = preprocess_for_model(real_train)
X_real_test, y_real_test = preprocess_for_model(real_test)

rf_real = RandomForestClassifier(
    n_estimators=300,
    max_depth=None,
    random_state=42,
    n_jobs=-1
)
rf_real.fit(X_real_train, y_real_train)

y_proba_real = rf_real.predict_proba(X_real_test)[:, 1]
y_pred_real = (y_proba_real >= 0.5).astype(int)

auc_real = roc_auc_score(y_real_test, y_proba_real)
f1_real = f1_score(y_real_test, y_pred_real)

auc_real, f1_real


#### 6.2.3 TSTR: Train on SYNTHETIC, Test on REAL

We now:

- Use our `synthetic` dataframe as training data.
- Evaluate on the **same real test set** as above.

This tells us how good models trained purely on synthetic data are for predicting `not.fully.paid` on real loans.


In [None]:
X_syn_train, y_syn_train = preprocess_for_model(synthetic)

rf_syn = RandomForestClassifier(
    n_estimators=300,
    max_depth=None,
    random_state=42,
    n_jobs=-1
)
rf_syn.fit(X_syn_train, y_syn_train)

y_proba_syn = rf_syn.predict_proba(X_real_test)[:, 1]
y_pred_syn = (y_proba_syn >= 0.5).astype(int)

auc_syn = roc_auc_score(y_real_test, y_proba_syn)
f1_syn = f1_score(y_real_test, y_pred_syn)

auc_syn, f1_syn


#### 6.2.4 Compare utility: TRTR vs TSTR


In [None]:
pd.DataFrame(
    {
        "AUC": [auc_real, auc_syn],
        "F1": [f1_real, f1_syn]
    },
    index=["Train REAL, Test REAL", "Train SYNTHETIC, Test REAL"]
)


### 6.3 Privacy: Approximate Memorization Check

CTGAN can overfit and memorize real rows, which is a privacy risk.

A simple heuristic:
- Take numeric features from synthetic data.
- For each synthetic point, compute the distance to the **nearest real point**.
- If many synthetic points are at extremely small distance, it may indicate memorization.

This is not a formal privacy guarantee, but a useful metric.


In [None]:
# Sample down for speed
real_num = real[num_cols].sample(5000, random_state=42)
syn_num = syn[num_cols].sample(5000, random_state=42)

nn = NearestNeighbors(n_neighbors=1)
nn.fit(real_num)

distances, indices = nn.kneighbors(syn_num)
distances = distances.flatten()

pd.Series(distances).describe()


Interpretation:

- Very small distances (e.g., many < 1e-6 after scaling) might indicate memorization.
- Larger distances suggest synthetic records are not exact copies.

For a production‑grade system, you would combine such checks with more formal privacy metrics (e.g., membership inference tests, differential privacy variants of CTGAN).


## 7. What did we learn?

In this notebook we:

1. Treated **`not.fully.paid` as the key outcome** for loan default risk.
2. Trained **CTGAN** on mixed‑type loan data to generate synthetic loans.
3. Evaluated **fidelity**:
   - KS test per numeric feature.
   - Correlation structure differences.
4. Evaluated **utility** via **Train‑Synthetic‑Test‑Real (TSTR)** against a real‑trained baseline.
5. Ran a simple **privacy heuristic** using nearest‑neighbor distances.



#Part 2 -  TVAE: Tabular Variational Autoencoder

We now train **TVAE**, another SDV model for tabular data.  
TVAE models the joint distribution using a variational autoencoder instead of an adversarial game.

We will:

1. Train TVAE on the same columns as CTGAN.
2. Generate synthetic loans.
3. Reuse the same evaluation pipeline (fidelity + TSTR utility).


In [None]:
from sdv.single_table import TVAESynthesizer

In [None]:
from sdv.metadata import SingleTableMetadata

tvae_data = df[cat_cols + num_cols + [target_col]].copy()

discrete_cols = cat_cols + ["inq.last.6mths", "delinq.2yrs", "pub.rec", target_col]

# Create metadata from the training data
metadata = SingleTableMetadata()
metadata.detect_from_dataframe(tvae_data)

# Explicitly set discrete columns in the metadata
for col in discrete_cols:
    metadata.update_column(column_name=col, sdtype='categorical')

tvae = TVAESynthesizer(
    metadata=metadata,
    epochs=300,         # similar budget to CTGAN for fairness
    batch_size=500
)

tvae.fit(tvae_data)

### Generate synthetic loans with TVAE

We generate the same number of records (10,000) to make comparisons fair.


In [None]:
n_synth = 10_000
synthetic_tvae = tvae.sample(n_synth)

synthetic_tvae.head()


In [None]:
print("TVAE synthetic outcome distribution:")
print(synthetic_tvae[target_col].value_counts(normalize=True))


## Fidelity: CTGAN vs TVAE (KS test)

We compute the KS statistic per numeric feature for:

- Real vs **CTGAN** synthetic
- Real vs **TVAE** synthetic

Lower KS ⇒ closer match to real marginal distribution. [web:42][web:140]


In [None]:
real = df[cat_cols + num_cols + [target_col]].copy()

def ks_per_feature(real_df, syn_df, num_cols):
    results = {}
    for col in num_cols:
        r = real_df[col].sample(min(5000, len(real_df)), random_state=42)
        s = syn_df[col].sample(min(5000, len(syn_df)), random_state=42)
        stat, pval = ks_2samp(r, s)
        results[col] = {"ks_stat": stat, "p_value": pval}
    return pd.DataFrame(results).T

ks_ctgan = ks_per_feature(real, synthetic, num_cols)
ks_tvae  = ks_per_feature(real, synthetic_tvae, num_cols)

ks_compare = pd.DataFrame({
    "KS_CTGAN": ks_ctgan["ks_stat"],
    "KS_TVAE": ks_tvae["ks_stat"]
}).sort_values("KS_CTGAN")

ks_compare


You can quickly see:

- Which model better fits each numeric feature.
- Whether one model tends to systematically have lower KS across features.


### Correlation structure: CTGAN vs TVAE

We compare correlation matrices as before, now for both synthesizers. [web:143]


In [None]:
real_corr = real[num_cols].corr()
ctgan_corr = synthetic[num_cols].corr()
tvae_corr  = synthetic_tvae[num_cols].corr()

ctgan_corr_diff = (real_corr - ctgan_corr).abs()
tvae_corr_diff  = (real_corr - tvae_corr).abs()

corr_compare = pd.DataFrame({
    "mean_abs_diff_CTGAN": ctgan_corr_diff.mean(),
    "mean_abs_diff_TVAE": tvae_corr_diff.mean()
}).sort_values("mean_abs_diff_CTGAN")

corr_compare


## Utility: TSTR for CTGAN vs TVAE

We reuse the same **Train‑Synthetic‑Test‑Real** pipeline: [web:39][web:148]

- Encoders (`ohe`) and `scaler` were fit on **real_train**.
- `preprocess_for_model` converts any dataframe to model‑ready `X`, `y`.


In [None]:
X_tvae_train, y_tvae_train = preprocess_for_model(synthetic_tvae)

rf_tvae = RandomForestClassifier(
    n_estimators=300,
    max_depth=None,
    random_state=42,
    n_jobs=-1
)
rf_tvae.fit(X_tvae_train, y_tvae_train)

y_proba_tvae = rf_tvae.predict_proba(X_real_test)[:, 1]
y_pred_tvae = (y_proba_tvae >= 0.5).astype(int)

auc_tvae = roc_auc_score(y_real_test, y_proba_tvae)
f1_tvae = f1_score(y_real_test, y_pred_tvae)

auc_tvae, f1_tvae


### Compare TRTR, CTGAN‑TSTR, TVAE‑TSTR


In [None]:
utility_df = pd.DataFrame(
    {
        "AUC": [auc_real, auc_syn, auc_tvae],
        "F1":  [f1_real,  f1_syn,  f1_tvae]
    },
    index=[
        "Train REAL, Test REAL",
        "Train CTGAN, Test REAL",
        "Train TVAE, Test REAL"
    ]
)

utility_df


Discussion prompts:

- Which synthesizer gives a classifier whose AUC/F1 is closer to the **real‑trained** baseline?
- Are there differences in calibration or class balance that might explain performance? [web:148][web:143]


## Privacy heuristic: nearest‑neighbor distance (CTGAN vs TVAE)

We reuse the **nearest‑neighbor distance** approach to compare memorization risk for the two models. [web:140][web:146]


In [None]:
# Sample real numeric subset for reference
real_num = real[num_cols].sample(5000, random_state=42)

nn = NearestNeighbors(n_neighbors=1)
nn.fit(real_num)

# CTGAN
syn_ctgan_num = synthetic[num_cols].sample(5000, random_state=42)
dist_ctgan, _ = nn.kneighbors(syn_ctgan_num)
dist_ctgan = dist_ctgan.flatten()

# TVAE
syn_tvae_num = synthetic_tvae[num_cols].sample(5000, random_state=42)
dist_tvae, _ = nn.kneighbors(syn_tvae_num)
dist_tvae = dist_tvae.flatten()

privacy_df = pd.DataFrame(
    {
        "CTGAN_dist": dist_ctgan,
        "TVAE_dist": dist_tvae
    }
)

privacy_df.describe()


Interpretation:

- Higher typical nearest‑neighbor distances suggest less memorization.
- If one model has consistently much smaller distances, it may be overfitting more to specific real points. [web:140][web:146]


## CTGAN vs TVAE: what to observe


1. **Fidelity (KS & correlation):**
   - For which features does CTGAN best mimic the real distribution?
   - Where does TVAE do better?
   - Is one model more consistent across features?

2. **Utility (TSTR AUC & F1):**
   - Which synthetic dataset produces more useful classifiers for predicting `not.fully.paid`?
   - How close are they to the real‑trained baseline?

3. **Privacy heuristic (NN distances):**
   - Does either model appear to memorize real records more?

This gives a concrete, model‑agnostic way to discuss **fidelity–utility–privacy trade‑offs** across two popular tabular synthesizers on a realistic loan default problem.


#Required Task 13
Load the file `fraud_transactions.csv` and create a synthetic data set of 5000 records. Evaluate the quality of the synthetic data created.