# Synthetic Digital Footprint Data Generation

This notebook generates synthetic digital footprint variables for credit scoring analysis, using published distributions and correlations from Berg et al. (2020). # Synthetic Digital Footprint Data with Copula and Cramér’s V Dependency Structure
Generate a synthetic dataset using marginal distributions and the Cramér’s V matrix from Berg et al. (2020), ready for downstream CTGAN if desired.



In [7]:
import numpy as np
import pandas as pd
from scipy.stats import norm


## Variables and Definitions

### Digital Footprint Variables (from Table 2)
- **Credit bureau score (quintile):** [1–5], quintiles of external credit score.
- **Device type:** Desktop, Tablet, Mobile.
- **Operating system:** Windows, iOS, Android, Macintosh, Other.
- **E-mail host:** Gmx, Web, T-Online, Gmail, Yahoo, Hotmail, Other.
- **Channel:** Paid, Direct, Affiliate, Organic, Other.
- **Checkout time:** Evening (6pm–midnight), Night (midnight–6am), Morning (6am–noon), Afternoon (noon–6pm).
- **Do-not-track setting:** Yes/No.
- **Name in e-mail:** Yes/No (real name present in e-mail).
- **Number in e-mail:** Yes/No (number present in e-mail).
- **Is lowercase:** Yes/No (e-mail is all lowercase).
- **E-mail error:** Yes/No (typo in e-mail).

### Control Variables (see Table A1, regression notes)
- **Age:** In years, from credit bureau (simulate as int 18–80 if no marginal).
- **Gender:** Female/Male (simulate as Bernoulli, check marginal in paper).
- **Order amount:** Purchase amount in EUR (simulate with log-normal or normal, see paper for mean/stdev if available).
- **Item category:** 16 categories (simulate as categorical, uniform if no marginal).
- **Month:** Categorical (Oct 2015–Dec 2016, i.e., 15 categories).

If more marginals are available in the paper, add them here.


##Define Variable Schemas (Frequencies & Default Rates)


In [25]:
N = 100000  # Number of synthetic records for strong tail coverage
schemas = {
    "credit_score_quintile": ["Q1", "Q2", "Q3", "Q4", "Q5"],
    "device_type": ["Desktop", "Tablet", "Mobile", "Do-not-track"],
    "os": ["Windows", "iOS", "Android", "Macintosh", "Other", "Do-not-track"],
    "email_host": ["Gmx", "Web", "T-Online", "Gmail", "Yahoo", "Hotmail", "Other"],
    "channel": ["Paid", "Direct", "Affiliate", "Organic", "Other", "Do-not-track"],
    "checkout_time": ["Evening", "Night", "Morning", "Afternoon"],
    "name_in_email": ["No", "Yes"],
    "number_in_email": ["No", "Yes"],
    "is_lowercase": ["No", "Yes"],
    "email_error": ["No", "Yes"],
    # ... add controls below ...
    "age_quintile": ["Q1", "Q2", "Q3", "Q4", "Q5"],  # or "age" if you use actual age buckets
    "gender": ["Female", "Male"],
    "order_amount": ["Q1", "Q2", "Q3", "Q4", "Q5"],
    "item_category": ["Cat1", "Cat2", "Cat3", "Cat4", "Cat5"],  # fill with actual categories if possible
    "month": ["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]
}


# Set marginal frequencies as you did before (proportions must sum to 1 for each variable)
# For brevity, you can just paste your frequency dicts from previous steps.
# Example marginal distributions (replace/extend with paper data where available!)
marginals = {
    "credit_bureau_quintile":     [0.20, 0.20, 0.20, 0.20, 0.20],  # Even quintiles
    "device_type":                [0.57, 0.18, 0.11, 0.14],  # Desktop, Tablet, Mobile, Do-not-track (from Table 2)
    "operating_system":           [0.49, 0.16, 0.11, 0.08, 0.01, 0.14], # Windows, iOS, Android, Macintosh, Other, Do-not-track
    "email_host":                 [0.23, 0.22, 0.12, 0.11, 0.05, 0.04, 0.24], # gmx, web, t-online, gmail, yahoo, hotmail, other
    "channel":                    [0.44, 0.18, 0.10, 0.07, 0.07, 0.14],  # Paid, Direct, Affiliate, Organic, Other, Do-not-track
    "checkout_time":              [0.43, 0.03, 0.18, 0.36],  # Evening, Night, Morning, Afternoon
    "do_not_track":               [0.86, 0.14],  # No, Yes
    "name_in_email":              [0.28, 0.72],  # No, Yes (from Table 2)
    "number_in_email":            [0.84, 0.16],  # No, Yes
    "is_lowercase":               [0.99, 0.01],  # No, Yes
    "email_error":                [0.92, 0.08],  # No, Yes
    "age":                        None, # Placeholder; see note below.
    "gender":                     [0.66, 0.34], # If not available, simulate as balanced.
    "order_amount":               None, # Placeholder; see note below.
    "item_category":              [1/16.]*16, # Uniform if nothing better.
    "month":                      [1/15.]*15  # Uniform over 15 months if nothing better.
}
# Note: Age/order amount can be simulated as normal/lognormal if no distribution found—see next cell.


simultate the age and the order amounts

In [26]:
import numpy as np
import pandas as pd

N = 100000  # Sample size, consistent with earlier definition

# --- Age: Normal Distribution, Clipped to Empirical Range ---
age_mean = 45.06  # Mean from paper
age_std = 13.31   # Standard deviation from paper
age_min = 18      # Minimum age
age_max = 80      # Maximum age (small % >70, but 80 as upper bound)

ages = np.random.normal(loc=age_mean, scale=age_std, size=N)
ages = np.clip(np.round(ages), age_min, age_max).astype(int)

# --- Order Amount: Log-Normal Distribution, Matched to Mean and Median ---
# Paper reports: mean = 318, median = 219, sd = 317, IQR = 120–400
# For log-normal: median = exp(mu), mean = exp(mu + sigma^2/2)
order_median = 219  # Median from paper
order_mean = 318    # Mean from paper

mu = np.log(order_median)               # mu = ln(median)
sigma = np.sqrt(2 * (np.log(order_mean) - mu))  # Solve for sigma

order_amounts = np.random.lognormal(mean=mu, sigma=sigma, size=N)
order_amounts = np.clip(order_amounts, 10, 1500)  # Clip to plausible range
order_amounts = np.round(order_amounts, 2)        # Round to 2 decimal places

# Initialize synthetic DataFrame with continuous variables
synthetic = pd.DataFrame({
    "age": ages,
    "order_amount": order_amounts
})

# Bin into quintiles for use with Cramér’s V matrix
synthetic['age_quintile'] = pd.qcut(synthetic['age'], 5, labels=["Q1", "Q2", "Q3", "Q4", "Q5"])
synthetic['order_amount_quintile'] = pd.qcut(synthetic['order_amount'], 5, labels=["Q1", "Q2", "Q3", "Q4", "Q5"])

# Display first few rows to verify
synthetic.head()

Unnamed: 0,age,order_amount,age_quintile,order_amount_quintile
0,29,127.3,Q1,Q2
1,36,298.13,Q2,Q4
2,66,199.29,Q5,Q3
3,42,366.43,Q2,Q4
4,35,150.19,Q2,Q2


##Load and Clean the Cramér’s V Matrix

In [27]:
import numpy as np
import pandas as pd
from scipy.stats import norm

# Define variables in the order of the Cramér’s V matrix
variables = [
    "credit_score_quintile",
    "device_type",
    "os",
    "email_host",
    "channel",
    "checkout_time",
    "name_in_email",
    "number_in_email",
    "is_lowercase",
    "email_error",
    "age_quintile",
    "order_amount_quintile",
    "item_category",
    "month"
]

# Define categories for each variable (consistent with schemas)
categories = {
    "credit_score_quintile": ["Q1", "Q2", "Q3", "Q4", "Q5"],
    "device_type": ["Desktop", "Tablet", "Mobile", "Do-not-track"],
    "os": ["Windows", "iOS", "Android", "Macintosh", "Other", "Do-not-track"],
    "email_host": ["Gmx", "Web", "T-Online", "Gmail", "Yahoo", "Hotmail", "Other"],
    "channel": ["Paid", "Direct", "Affiliate", "Organic", "Other", "Do-not-track"],
    "checkout_time": ["Evening", "Night", "Morning", "Afternoon"],
    "name_in_email": ["No", "Yes"],
    "number_in_email": ["No", "Yes"],
    "is_lowercase": ["No", "Yes"],
    "email_error": ["No", "Yes"],
    "age_quintile": ["Q1", "Q2", "Q3", "Q4", "Q5"],
    "order_amount_quintile": ["Q1", "Q2", "Q3", "Q4", "Q5"],
    "item_category": [f"Cat{i}" for i in range(1, 17)],  # 16 categories
    "month": ["Oct15", "Nov15", "Dec15", "Jan16", "Feb16", "Mar16", "Apr16", 
              "May16", "Jun16", "Jul16", "Aug16", "Sep16", "Oct16", "Nov16", "Dec16"]  # 15 months
}

# Define marginal probabilities (from your marginals dictionary, adjusted)
marginals_list = [
    [0.20, 0.20, 0.20, 0.20, 0.20],  # credit_score_quintile
    [0.57, 0.18, 0.11, 0.14],        # device_type
    [0.49, 0.16, 0.11, 0.08, 0.01, 0.14],  # os
    [0.23, 0.22, 0.12, 0.11, 0.05, 0.04, 0.24],  # email_host
    [0.44, 0.18, 0.10, 0.07, 0.07, 0.14],  # channel
    [0.43, 0.03, 0.18, 0.36],        # checkout_time
    [0.28, 0.72],                    # name_in_email
    [0.84, 0.16],                    # number_in_email
    [0.99, 0.01],                    # is_lowercase
    [0.92, 0.08],                    # email_error
    [0.20, 0.20, 0.20, 0.20, 0.20],  # age_quintile
    [0.20, 0.20, 0.20, 0.20, 0.20],  # order_amount_quintile
    [1/16]*16,                       # item_category (uniform over 16)
    [1/15]*15                        # month (uniform over 15)
]

# Load Cramér’s V matrix (from your existing code)
cramers_v_array = np.array([
    [1.00, 0.07, 0.05, 0.07, 0.03, 0.03, 0.01, 0.07, 0.02, 0.00, 0.2, 0.01, 0.05, 0.01],
    [0.07, 1.00, 0.71, 0.07, 0.06, 0.04, 0.05, 0.06, 0.07, 0.01, 0.12, 0.03, 0.05, 0.06],
    [0.05, 0.71, 1.00, 0.08, 0.06, 0.04, 0.06, 0.08, 0.06, 0.01, 0.1, 0.02, 0.04, 0.03],
    [0.07, 0.07, 0.08, 1.00, 0.03, 0.03, 0.08, 0.18, 0.04, 0.06, 0.16, 0.02, 0.02, 0.01],
    [0.03, 0.06, 0.06, 0.03, 1.00, 0.02, 0.01, 0.02, 0.04, 0.02, 0.09, 0.04, 0.06, 0.13],
    [0.03, 0.04, 0.04, 0.03, 0.02, 1.00, 0.01, 0.01, 0.01, 0.01, 0.06, 0.01, 0.03, 0.02],
    [0.01, 0.05, 0.06, 0.08, 0.01, 0.01, 1.00, 0.22, 0.01, 0.02, 0.04, 0.01, 0.03, 0.01],
    [0.07, 0.06, 0.08, 0.18, 0.02, 0.01, 0.22, 1.00, 0.02, 0.00, 0.06, 0.01, 0.04, 0.01],
    [0.02, 0.07, 0.06, 0.04, 0.04, 0.01, 0.01, 0.02, 1.00, 0.03, 0.03, 0.02, 0.02, 0.02],
    [0.00, 0.01, 0.01, 0.06, 0.02, 0.01, 0.02, 0.00, 0.03, 1.00, 0.03, 0.01, 0.01, 0.01],
    [0.2, 0.12, 0.1, 0.16, 0.09, 0.06, 0.04, 0.06, 0.03, 0.03, 1.00, 0.05, 0.11, 0.03],
    [0.01, 0.03, 0.02, 0.02, 0.04, 0.01, 0.01, 0.01, 0.02, 0.01, 0.05, 1.00, 0.27, 0.02],
    [0.05, 0.05, 0.04, 0.02, 0.06, 0.03, 0.03, 0.04, 0.02, 0.01, 0.11, 0.27, 1.00, 0.11],
    [0.01, 0.06, 0.03, 0.01, 0.13, 0.02, 0.01, 0.01, 0.02, 0.01, 0.03, 0.02, 0.11, 1.00]
])

# Check if Cramér’s V matrix is positive semi-definite
eigvals = np.linalg.eigvalsh(cramers_v_array)
if np.any(eigvals < 0):
    print("Warning: Cramér’s V matrix is not positive semi-definite. Adjusting to identity for simplicity.")
    cramers_v_array = np.eye(14)  # Fallback to independent variables

# Compute thresholds for each variable
thresholds_list = []
for p in marginals_list:
    cumprob = np.cumsum(p)[:-1]  # Cumulative probs excluding 1.0
    thresholds = norm.ppf(cumprob)
    thresholds_list.append(thresholds)

# Generate multivariate normal data with Cramér’s V as covariance
Z = np.random.multivariate_normal(mean=np.zeros(14), cov=cramers_v_array, size=N)

# Map continuous Z to categorical variables
for i, var in enumerate(variables):
    thresholds = thresholds_list[i]
    z = Z[:, i]
    # Assign categories based on where z falls relative to thresholds
    cat_indices = np.searchsorted(thresholds, z, side='right')
    synthetic[var] = [categories[var][idx] for idx in cat_indices]

# Add gender independently (not in Cramér’s V matrix)
synthetic['gender'] = np.random.choice(['Female', 'Male'], size=N, p=[0.66, 0.34])

# Display first few rows to verify
synthetic.head()


Unnamed: 0,age,order_amount,age_quintile,order_amount_quintile,credit_score_quintile,device_type,os,email_host,channel,checkout_time,name_in_email,number_in_email,is_lowercase,email_error,item_category,month,gender
0,29,127.3,Q5,Q2,Q1,Desktop,Windows,Web,Other,Evening,No,No,No,No,Cat4,May16,Male
1,36,298.13,Q2,Q3,Q3,Do-not-track,iOS,Gmail,Direct,Afternoon,No,No,No,No,Cat1,Dec15,Female
2,66,199.29,Q4,Q2,Q2,Desktop,Macintosh,T-Online,Paid,Afternoon,Yes,No,No,No,Cat10,Apr16,Female
3,42,366.43,Q3,Q2,Q4,Desktop,Windows,Other,Other,Morning,Yes,No,No,No,Cat1,Mar16,Female
4,35,150.19,Q2,Q1,Q1,Desktop,Windows,Other,Affiliate,Afternoon,Yes,No,No,Yes,Cat9,Jun16,Male


# Inspect the shape and preview the data (without default rates)

In [28]:
# Preview shape and sample of the synthetic dataset
print("Synthetic data shape:", synthetic.shape)
synthetic.head()
synthetic.describe(include='all')  # Quick summary for all columns


Synthetic data shape: (100000, 17)


Unnamed: 0,age,order_amount,age_quintile,order_amount_quintile,credit_score_quintile,device_type,os,email_host,channel,checkout_time,name_in_email,number_in_email,is_lowercase,email_error,item_category,month,gender
count,100000.0,100000.0,100000,100000,100000,100000,100000,100000,100000,100000,100000,100000,100000,100000,100000,100000,100000
unique,,,5,5,5,4,6,7,6,4,2,2,2,2,16,15,2
top,,,Q3,Q3,Q1,Desktop,Windows,Other,Paid,Evening,Yes,No,No,No,Cat1,Dec15,Female
freq,,,20345,20101,20108,56992,48925,23060,43924,42915,72255,84053,98924,91910,6466,6889,65835
mean,45.16405,308.978784,,,,,,,,,,,,,,,
std,13.004586,282.897754,,,,,,,,,,,,,,,
min,18.0,10.0,,,,,,,,,,,,,,,
25%,36.0,122.76,,,,,,,,,,,,,,,
50%,45.0,218.7,,,,,,,,,,,,,,,
75%,54.0,389.4875,,,,,,,,,,,,,,,


Assign Default Status (TARGET) Using Per-Category Rates

In [29]:
# Define the default_rates dictionary (as above, use the categories from your schemas)
default_rates = {
    "credit_score_quintile": {"Q1": 0.0212, "Q2": 0.0102, "Q3": 0.0068, "Q4": 0.0047, "Q5": 0.0039},
    "device_type": {"Desktop": 0.0074, "Tablet": 0.0091, "Mobile": 0.0214, "Do-not-track": 0.0088},
    "os": {"Windows": 0.0074, "iOS": 0.0107, "Android": 0.0179, "Macintosh": 0.0069, "Other": 0.0109, "Do-not-track": 0.0088},
    "email_host": {"Gmx": 0.0082, "Web": 0.0086, "T-Online": 0.0051, "Gmail": 0.0125, "Yahoo": 0.0196, "Hotmail": 0.0145, "Other": 0.0090},
    "channel": {"Paid": 0.0111, "Direct": 0.0084, "Affiliate": 0.0064, "Organic": 0.0086, "Other": 0.0069, "Do-not-track": 0.0088},
    "checkout_time": {"Evening": 0.0085, "Night": 0.0197, "Morning": 0.0109, "Afternoon": 0.0089},
    "name_in_email": {"No": 0.0124, "Yes": 0.0082},
    "number_in_email": {"No": 0.0084, "Yes": 0.0141},
    "is_lowercase": {"No": 0.0084, "Yes": 0.0214},
    "email_error": {"No": 0.0088, "Yes": 0.0509},
}

# Assign the TARGET column
cat_vars = [var for var in default_rates if var in synthetic.columns]
N = synthetic.shape[0]
default_probs = np.zeros(N)
for var in cat_vars:
    default_probs += synthetic[var].map(default_rates[var]).values
default_probs /= len(cat_vars)
# --- INSERT ADJUSTMENT HERE, IF NEEDED ---
desired_mean = 0.0094
current_mean = default_probs.mean()
scaling_factor = desired_mean / current_mean
adjusted_probs = np.clip(default_probs * scaling_factor, 0, 1)

# Assign TARGET using adjusted probabilities
synthetic['TARGET'] = (np.random.rand(N) < adjusted_probs).astype(int)

print("Synthetic default rate:", synthetic['TARGET'].mean())
n_defaults = synthetic['TARGET'].sum()
n_total = len(synthetic)
print(f"Number of defaults: {n_defaults} out of {n_total} ({n_defaults / n_total:.4%})")


Synthetic default rate: 0.00918
Number of defaults: 918 out of 100000 (0.9180%)


Compare Summary Statistics to Berg et al.

In [31]:
# Summarize categorical variables: frequency tables
for col in cat_vars:
    print(f"\n{col} value counts:")
    print(synthetic[col].value_counts(normalize=True).sort_index())

# For continuous (age, order_amount)
print("\nAge summary:")
print(synthetic['age'].describe())
print("\nOrder amount summary:")
print(synthetic['order_amount'].describe())

# Default rate by category (optional, as in Table 2)
for col in cat_vars:
    print(f"\nDefault rate by {col}:")
    print(synthetic.groupby(col)['TARGET'].mean().round(4))



credit_score_quintile value counts:
credit_score_quintile
Q1    0.20108
Q2    0.19836
Q3    0.20074
Q4    0.20050
Q5    0.19932
Name: proportion, dtype: float64

device_type value counts:
device_type
Desktop         0.56992
Do-not-track    0.13985
Mobile          0.11081
Tablet          0.17942
Name: proportion, dtype: float64

os value counts:
os
Android         0.10915
Do-not-track    0.14985
Macintosh       0.07968
Other           0.01027
Windows         0.48925
iOS             0.16180
Name: proportion, dtype: float64

email_host value counts:
email_host
Gmail       0.10967
Gmx         0.22933
Hotmail     0.03990
Other       0.23060
T-Online    0.11864
Web         0.22107
Yahoo       0.05079
Name: proportion, dtype: float64

channel value counts:
channel
Affiliate       0.10101
Direct          0.18143
Do-not-track    0.13838
Organic         0.06984
Other           0.07010
Paid            0.43924
Name: proportion, dtype: float64

checkout_time value counts:
checkout_time
Afternoon  

Export Your Synthetic Data

In [32]:
synthetic.to_csv("/home/frederickerleigh/Dokumente/Fintech Seminar/NewCode/FintechSeminar-Synthetic-Dataset/fintech-credit-scoring-seminar/data/processed/synthetic_digital_footprint_with_target.csv", index=False)


## Copula-Based Synthetic Data Generation

This section generates synthetic digital footprint data using a Gaussian copula, the empirical marginals, and the Cramér’s V matrix from Berg et al. (2020), following standard practices in synthetic data literature.


In [45]:
import numpy as np
import pandas as pd
from scipy.stats import norm

# Assume N, categories, marginals_list, and cramers_v_array are already defined and match your previous cells.
# The order of variables must match that of cramers_v_array!
variables = [
    "credit_score_quintile", "device_type", "os", "email_host", "channel", "checkout_time",
    "name_in_email", "number_in_email", "is_lowercase", "email_error",
    "age_quintile", "order_amount_quintile", "item_category", "month"
]

N = 100000 # or whatever your sample size is


### Compute Thresholds

For each variable, we convert marginal probabilities into Z-score cut points for the copula simulation.


In [46]:
# For each variable, compute thresholds that divide the standard normal into the same proportions as marginals
thresholds_list = []
for marg in marginals_list:
    cumprob = np.cumsum(marg)[:-1]
    thresholds = norm.ppf(cumprob)
    thresholds_list.append(thresholds)


### Simulate Correlated Normal Latent Variables

Use the Cramér’s V matrix as a dependency structure (covariance).


In [47]:
# 1. Check Cramér’s V matrix (should be positive semi-definite)
eigvals = np.linalg.eigvalsh(cramers_v_array)
if np.any(eigvals < 0):
    print("Warning: Cramér’s V matrix not PSD. Using identity (independence) instead.")
    cramers_v_array = np.eye(len(variables))

# 2. Simulate multivariate normal
Z = np.random.multivariate_normal(mean=np.zeros(len(variables)), cov=cramers_v_array, size=N)

# 3. Map each column to category using thresholds
copula_df = pd.DataFrame()
for i, var in enumerate(variables):
    thresh = thresholds_list[i]
    z_col = Z[:, i]
    # Assign categories based on cut points
    idx = np.searchsorted(thresh, z_col, side='right')
    copula_df[var] = [categories[var][j] for j in idx]

# Add continuous variables from synthetic DataFrame
copula_df['age'] = synthetic['age']
copula_df['order_amount'] = synthetic['order_amount']

### Add Gender and Continuous Controls

Add gender independently (if needed), and join your simulated age/order_amount as before.


In [48]:
copula_df['gender'] = np.random.choice(['Female', 'Male'], size=N, p=[0.66, 0.34])
copula_df['age'] = synthetic['age']
copula_df['order_amount'] = synthetic['order_amount']

### Assign Default Status

Assign TARGET as before, by averaging per-category default rates (using the copula_df columns and your default_rates dictionary).


In [49]:
cat_vars = [var for var in default_rates if var in copula_df.columns]
default_probs = np.zeros(N)
for var in cat_vars:
    default_probs += copula_df[var].map(default_rates[var]).values
default_probs /= len(cat_vars)
copula_df['TARGET'] = (np.random.rand(N) < default_probs).astype(int)
print("Copula synthetic default rate:", copula_df['TARGET'].mean())


Copula synthetic default rate: 0.01022


### Inspect and Save

Check the summary and optionally export your copula-generated synthetic dataset.


In [50]:
copula_df.to_csv("/home/frederickerleigh/Dokumente/Fintech Seminar/NewCode/FintechSeminar-Synthetic-Dataset/fintech-credit-scoring-seminar/data/processed/synthetic_digital_footprint_copula.csv", index=False)
print("Copula synthetic data", copula_df.shape)
copula_df.head()


Copula synthetic data (100000, 16)


Unnamed: 0,credit_score_quintile,device_type,os,email_host,channel,checkout_time,name_in_email,number_in_email,is_lowercase,email_error,age_quintile,order_amount_quintile,item_category,month,gender,TARGET
0,Q3,Desktop,Windows,Yahoo,Do-not-track,Evening,No,No,No,No,Q5,Q1,Cat16,Nov15,Female,0
1,Q3,Desktop,Windows,Other,Paid,Morning,Yes,Yes,No,No,Q5,Q1,Cat1,Aug16,Female,0
2,Q3,Do-not-track,iOS,Other,Paid,Afternoon,Yes,No,No,No,Q5,Q1,Cat8,Aug16,Male,0
3,Q2,Mobile,Windows,Gmail,Other,Evening,Yes,No,No,No,Q3,Q5,Cat10,Apr16,Female,0
4,Q2,Desktop,Windows,Gmx,Do-not-track,Afternoon,Yes,No,No,No,Q2,Q5,Cat16,Mar16,Female,0


## CTGAN-Based Synthetic Data Generation

In this section, we use the Conditional Tabular GAN (CTGAN) model from the SDV library to generate synthetic digital footprint data. CTGAN learns the distributions and relationships directly from the seed data and produces high-fidelity synthetic samples, using a neural network–based approach.


In [40]:
# If not installed, run this in a terminal/cell:
# !pip install sdv

from ctgan import CTGAN
import pandas as pd
import numpy as np

# Load your original synthetic dataframe (the “classic” one with all columns, including TARGET)
seed_df = pd.read_csv("/home/frederickerleigh/Dokumente/Fintech Seminar/NewCode/FintechSeminar-Synthetic-Dataset/fintech-credit-scoring-seminar/data/processed/synthetic_digital_footprint_with_target.csv")  # or just use 'synthetic' if in memory

# If you need to limit the number of samples, do so here (optional):
# seed_df = seed_df.sample(n=10000, random_state=42)


### Train CTGAN

We train the CTGAN model on the seed data for a modest number of epochs. More epochs = higher fidelity, but longer runtime. For research, 100–300 epochs is common.


In [41]:
ctgan = CTGAN(epochs=300, verbose=True, cuda=True)  # Use cuda=True if on GPU

# Optional: List categorical columns for CTGAN
categorical_columns = [
    "credit_score_quintile", "device_type", "os", "email_host", "channel", "checkout_time",
    "name_in_email", "number_in_email", "is_lowercase", "email_error",
    "age_quintile", "order_amount_quintile", "item_category", "month", "gender"
]
# You can exclude "TARGET" if treating it as the output, or include for conditional generation.

ctgan.fit(seed_df, discrete_columns=categorical_columns)



Attempting to run cuBLAS, but there was no current CUDA context! Attempting to set the primary context... (Triggered internally at /pytorch/aten/src/ATen/cuda/CublasHandlePool.cpp:181.)

Gen. (-3.84) | Discrim. (0.28): 100%|██████████| 300/300 [23:14<00:00,  4.65s/it] 


### Sample Synthetic Data

Once trained, we generate a new synthetic dataset of the same size as the original.


In [42]:
N = len(seed_df)
ctgan_synth = ctgan.sample(N)
ctgan_synth.to_csv("/home/frederickerleigh/Dokumente/Fintech Seminar/NewCode/FintechSeminar-Synthetic-Dataset/fintech-credit-scoring-seminar/data/processed{synthetic_digital_footprint_ctgan.csv", index=False)


### Quick Check

Preview the summary statistics and value counts for key columns.


In [43]:
print(ctgan_synth.head())
print(ctgan_synth.describe(include='all'))

# Optionally, value counts for categorical columns
for col in categorical_columns:
    print(f"{col} value counts:")
    print(ctgan_synth[col].value_counts(normalize=True).sort_index())


   age  order_amount age_quintile order_amount_quintile credit_score_quintile  \
0   35    307.037708           Q2                    Q5                    Q4   
1   51    317.069365           Q4                    Q3                    Q5   
2   46    151.610027           Q5                    Q2                    Q3   
3   37    399.188162           Q2                    Q3                    Q2   
4   46    103.616078           Q3                    Q4                    Q2   

    device_type       os email_host       channel checkout_time name_in_email  \
0        Tablet  Android      Other       Organic     Afternoon            No   
1        Mobile  Windows      Other         Other       Evening           Yes   
2  Do-not-track  Windows      Yahoo  Do-not-track       Evening           Yes   
3       Desktop  Windows        Gmx        Direct         Night            No   
4       Desktop  Android      Other  Do-not-track       Morning            No   

  number_in_email is_lower