# # 04: Subsampling and Preparation for Modeling
#
# This notebook prepares synthetic datasets (manual, copula, CTGAN) for modeling by:
# - Subsampling each to match the exact number of defaults and non-defaults in the Home Credit dataset (~8% default rate).
# - Aligning variable columns across all datasets.
# - Exporting ready-to-model merged (Home Credit + synthetic footprint variables) and separate datasets.
#
# **Key Notes:**
# - The Home Credit dataset has a default rate of ~8%, higher than Berg et al. (~0.9%).
# - Subsampling ensures exact matches for default/non-default counts for fair modeling comparisons.
# - Random seeds are set for reproducibility.
# - All datasets are validated for shape, column consistency, and default rates after each major step.
# - File paths are kept as provided, assuming they are correct in your environment.
# - Error handling and validation checks ensure robustness.


# %% [markdown]
# ## Load and Standardize Datasets
#
# Load the Home Credit and synthetic datasets, standardize the target column to 'TARGET', and validate dataset shapes and target column presence. Add error handling for file loading.

In [29]:
import pandas as pd
import numpy as np
import os
from sklearn.utils import resample

# Set random seed for reproducibility
np.random.seed(42)

# Define file paths (unchanged as per instruction)
file_paths = {
    "homecredit": "/home/frederickerleigh/Dokumente/Fintech Seminar/NewCode/FintechSeminar-Synthetic-Dataset/fintech-credit-scoring-seminar/data/processed/home_credit_sample.csv",
    "manual": "/home/frederickerleigh/Dokumente/Fintech Seminar/NewCode/FintechSeminar-Synthetic-Dataset/fintech-credit-scoring-seminar/data/processed/synthetic_digital_footprint_with_target.csv",
    "copula": "/home/frederickerleigh/Dokumente/Fintech Seminar/NewCode/FintechSeminar-Synthetic-Dataset/fintech-credit-scoring-seminar/data/processed/synthetic_digital_footprint_copula.csv",
    "ctgan": "/home/frederickerleigh/Dokumente/Fintech Seminar/NewCode/FintechSeminar-Synthetic-Dataset/fintech-credit-scoring-seminar/data/processed{synthetic_digital_footprint_ctgan.csv"
}

# Load datasets with error handling
datasets = {}
for name, path in file_paths.items():
    try:
        datasets[name] = pd.read_csv(path)
        print(f"Successfully loaded {name} from {path}")
    except Exception as e:
        print(f"Error loading {name}: {e}")
        raise

homecredit = datasets["homecredit"]
manual = datasets["manual"]
copula = datasets["copula"]
ctgan = datasets["ctgan"]

# Standardize target column to 'TARGET'
for df, name in [(homecredit, "Home Credit"), (manual, "Manual"), (copula, "Copula"), (ctgan, "CTGAN")]:
    if 'DEFAULT' in df.columns and 'TARGET' not in df.columns:
        df.rename(columns={'DEFAULT': 'TARGET'}, inplace=True)
    elif 'TARGET' not in df.columns:
        raise ValueError(f"Target column not found in {name} dataset")
    print(f"{name}: Target column standardized to 'TARGET'")

# Initial shape and default count check
n_total = len(homecredit)
n_defaults = int(homecredit['TARGET'].sum())
n_nondefaults = n_total - n_defaults
expected_default_rate = n_defaults / n_total

print("\nDataset Shape Check:")
print(f"Home Credit: {homecredit.shape}")
print(f"Manual Synth: {manual.shape}")
print(f"Copula Synth: {copula.shape}")
print(f"CTGAN Synth: {ctgan.shape}")
print(f"\nHome Credit: {n_defaults} defaults, {n_nondefaults} non-defaults, default rate: {expected_default_rate:.2%}")

# Check available samples in synthetic datasets
for label, df in [("Manual", manual), ("Copula", copula), ("CTGAN", ctgan)]:
    available_defaults = df[df['TARGET'] == 1].shape[0]
    available_nondefaults = df[df['TARGET'] == 0].shape[0]
    print(f"{label}: {available_defaults} defaults, {available_nondefaults} non-defaults")
    if available_defaults < n_defaults or available_nondefaults < n_nondefaults:
        print(f"⚠️ WARNING: Not enough samples in {label} synthetic dataset! Will sample with replacement.")



Successfully loaded homecredit from /home/frederickerleigh/Dokumente/Fintech Seminar/NewCode/FintechSeminar-Synthetic-Dataset/fintech-credit-scoring-seminar/data/processed/home_credit_sample.csv
Successfully loaded manual from /home/frederickerleigh/Dokumente/Fintech Seminar/NewCode/FintechSeminar-Synthetic-Dataset/fintech-credit-scoring-seminar/data/processed/synthetic_digital_footprint_with_target.csv
Successfully loaded copula from /home/frederickerleigh/Dokumente/Fintech Seminar/NewCode/FintechSeminar-Synthetic-Dataset/fintech-credit-scoring-seminar/data/processed/synthetic_digital_footprint_copula.csv
Successfully loaded ctgan from /home/frederickerleigh/Dokumente/Fintech Seminar/NewCode/FintechSeminar-Synthetic-Dataset/fintech-credit-scoring-seminar/data/processed{synthetic_digital_footprint_ctgan.csv
Home Credit: Target column standardized to 'TARGET'
Manual: Target column standardized to 'TARGET'
Copula: Target column standardized to 'TARGET'
CTGAN: Target column standardized t

# %% [markdown]
# ## Validate Default Rates
#
# Calculate and display the default rates for each dataset to ensure alignment with the Home Credit dataset's ~8% default rate.

In [30]:
def print_defaults(df, label):
    n = len(df)
    n_def = int(df['TARGET'].sum())
    rate = n_def / n
    print(f"{label}: {n} rows, {n_def} defaults ({rate:.2%})")

print_defaults(homecredit, "Home Credit")
print_defaults(manual, "Manual Synthetic")
print_defaults(copula, "Copula Synthetic")
print_defaults(ctgan, "CTGAN Synthetic")


Home Credit: 10000 rows, 807 defaults (8.07%)
Manual Synthetic: 100000 rows, 934 defaults (0.93%)
Copula Synthetic: 100000 rows, 964 defaults (0.96%)
CTGAN Synthetic: 100000 rows, 219 defaults (0.22%)


## Subsample Each Dataset to 10,000 Rows, Stratified by TARGET

- Maintains the default/non-default proportion
- Repeats with replacement if not enough default rows in small synthetic sets


In [31]:
from sklearn.utils import resample

def exact_stratified(df, n_def, n_nondef, target_col='TARGET', random_state=42):
    """Subsample dataset to match exact default and non-default counts."""
    df_def = df[df[target_col] == 1]
    df_nondef = df[df[target_col] == 0]
    rep_def = n_def > len(df_def)
    rep_nondef = n_nondef > len(df_nondef)
    
    # Log replacement usage
    if rep_def or rep_nondef:
        print(f"Sampling with replacement for {label}: {rep_def and 'defaults' or ''} {rep_nondef and 'non-defaults' or ''}")
    
    # Perform stratified sampling
    sampled_def = resample(df_def, n_samples=n_def, replace=rep_def, random_state=random_state)
    sampled_nondef = resample(df_nondef, n_samples=n_nondef, replace=rep_nondef, random_state=random_state)
    result = pd.concat([sampled_def, sampled_nondef], axis=0).sample(frac=1, random_state=random_state).reset_index(drop=True)
    
    # Validate output
    if len(result) != (n_def + n_nondef) or result[target_col].sum() != n_def:
        raise ValueError(f"Subsampling failed for {label}: Expected {n_def} defaults and {n_nondef} non-defaults, got {result[target_col].sum()} defaults and {len(result) - result[target_col].sum()} non-defaults")
    return result

# Subsample synthetic datasets
manual_matched = exact_stratified(manual, n_defaults, n_nondefaults, random_state=42)
copula_matched = exact_stratified(copula, n_defaults, n_nondefaults, random_state=42)
ctgan_matched = exact_stratified(ctgan, n_defaults, n_nondefaults, random_state=42)

# Validate subsampled datasets
print("\nSubsampled Dataset Validation:")
for label, df in [("Manual", manual_matched), ("Copula", copula_matched), ("CTGAN", ctgan_matched)]:
    print_defaults(df, f"{label} Subsampled")
    print(f"{label} Shape: {df.shape}, Columns: {list(df.columns)[:5]}...")


Sampling with replacement for CTGAN: defaults 

Subsampled Dataset Validation:
Manual Subsampled: 10000 rows, 807 defaults (8.07%)
Manual Shape: (10000, 18), Columns: ['age', 'order_amount', 'age_quintile', 'order_amount_quintile', 'credit_score_quintile']...
Copula Subsampled: 10000 rows, 807 defaults (8.07%)
Copula Shape: (10000, 18), Columns: ['credit_score_quintile', 'device_type', 'os', 'email_host', 'channel']...
CTGAN Subsampled: 10000 rows, 807 defaults (8.07%)
CTGAN Shape: (10000, 18), Columns: ['age', 'order_amount', 'age_quintile', 'order_amount_quintile', 'credit_score_quintile']...


## Extract Footprint Variables
#
# Keep only the specified footprint variables and the target column for synthetic datasets. Rename the target to 'DEFAULT_SYNTH' to avoid column name conflicts during merging.


## Extract Footprint Variables
#
# Keep only the specified footprint variables and the target column for synthetic datasets. Rename the target to 'DEFAULT_SYNTH' to avoid column name conflicts during merging.

In [36]:
# Define footprint variables
footprint_vars = [
    "device_type", "os", "email_host", "channel",
    "checkout_time", "name_in_email", "number_in_email", "is_lowercase", "email_error"
]

# Validate footprint variables exist in synthetic datasets
for label, df in [("Manual", manual_matched), ("Copula", copula_matched), ("CTGAN", ctgan_matched)]:
    missing_vars = [var for var in footprint_vars if var not in df.columns]
    if missing_vars:
        print(f"⚠️ WARNING: {label} missing footprint variables: {missing_vars}. Consider regenerating dataset.")

# Keep only footprint variables and TARGET
keep_cols = footprint_vars + ['TARGET']
manual_fp = manual_matched[keep_cols].copy()
copula_fp = copula_matched[keep_cols].copy()
ctgan_fp = ctgan_matched[keep_cols].copy()

# Rename TARGET to DEFAULT_SYNTH
manual_fp = manual_fp.rename(columns={'TARGET': 'DEFAULT_SYNTH'})
copula_fp = copula_fp.rename(columns={'TARGET': 'DEFAULT_SYNTH'})
ctgan_fp = ctgan_fp.rename(columns={'TARGET': 'DEFAULT_SYNTH'})

# Ensure no duplicate TARGET columns
for df, label in [(manual_fp, "Manual"), (copula_fp, "Copula"), (ctgan_fp, "CTGAN")]:
    if 'TARGET' in df.columns:
        df.drop(columns='TARGET', inplace=True)
    print(f"{label} Footprint Columns: {list(df.columns)}")




Manual Footprint Columns: ['device_type', 'os', 'email_host', 'channel', 'checkout_time', 'name_in_email', 'number_in_email', 'is_lowercase', 'email_error', 'DEFAULT_SYNTH']
Copula Footprint Columns: ['device_type', 'os', 'email_host', 'channel', 'checkout_time', 'name_in_email', 'number_in_email', 'is_lowercase', 'email_error', 'DEFAULT_SYNTH']
CTGAN Footprint Columns: ['device_type', 'os', 'email_host', 'channel', 'checkout_time', 'name_in_email', 'number_in_email', 'is_lowercase', 'email_error', 'DEFAULT_SYNTH']


## Inspect Footprint Datasets
#
# Verify the columns in the footprint datasets to ensure correct renaming and column selection.


In [37]:
print("manual_fp", manual_fp.columns    )
print("copula_fp", copula_fp.columns)
print("ctgan_fp", ctgan_fp.columns)

manual_fp Index(['device_type', 'os', 'email_host', 'channel', 'checkout_time',
       'name_in_email', 'number_in_email', 'is_lowercase', 'email_error',
       'DEFAULT_SYNTH'],
      dtype='object')
copula_fp Index(['device_type', 'os', 'email_host', 'channel', 'checkout_time',
       'name_in_email', 'number_in_email', 'is_lowercase', 'email_error',
       'DEFAULT_SYNTH'],
      dtype='object')
ctgan_fp Index(['device_type', 'os', 'email_host', 'channel', 'checkout_time',
       'name_in_email', 'number_in_email', 'is_lowercase', 'email_error',
       'DEFAULT_SYNTH'],
      dtype='object')


## Merge Home Credit with Synthetic Footprint Datasets
#
# Merge the full Home Credit dataset (all columns) with each synthetic footprint dataset, aligning rows by TARGET/DEFAULT_SYNTH values. Validate row counts and default alignment.

In [38]:
def merge_by_target(home, synth, random_state=42):
    """Merge Home Credit (all columns) with synthetic footprint dataset by sorting on target columns."""
    # Validate input shapes and target counts
    if len(home) != len(synth):
        raise ValueError(f"Row count mismatch: Home Credit ({len(home)}) vs Synthetic ({len(synth)})")
    if home['TARGET'].sum() != synth['DEFAULT_SYNTH'].sum():
        raise ValueError(f"Default count mismatch: Home Credit ({home['TARGET'].sum()}) vs Synthetic ({synth['DEFAULT_SYNTH'].sum()})")
    
    # Sort by target columns
    home_sorted = home.sort_values("TARGET").reset_index(drop=True)
    synth_sorted = synth.sort_values("DEFAULT_SYNTH").reset_index(drop=True)
    
    # Merge side by side
    merged = pd.concat([home_sorted, synth_sorted], axis=1, keys=['home', 'synth'])
    merged.columns = [f"{s}_{c}" for s, c in merged.columns]
    
    # Validate merged dataset
    if len(merged) != len(home):
        raise ValueError(f"Merged dataset row count ({len(merged)}) does not match input ({len(home)})")
    return merged

# Perform merges
merged_manual = merge_by_target(homecredit, manual_fp)
merged_copula = merge_by_target(homecredit, copula_fp)
merged_ctgan = merge_by_target(homecredit, ctgan_fp)

# Validate merged datasets
print("\nMerged Dataset Validation:")
for label, df in [("Manual", merged_manual), ("Copula", merged_copula), ("CTGAN", merged_ctgan)]:
    print(f"{label} Merged Shape: {df.shape}")
    print(f"{label} Merged Columns (first 5): {list(df.columns)[:5]}...")
    print(f"{label} Merged Default Count: {df['home_TARGET'].sum()}")


Merged Dataset Validation:
Manual Merged Shape: (10000, 44)
Manual Merged Columns (first 5): ['home_EXT_SOURCE_1', 'home_EXT_SOURCE_2', 'home_EXT_SOURCE_3', 'home_AMT_CREDIT', 'home_AMT_ANNUITY']...
Manual Merged Default Count: 807
Copula Merged Shape: (10000, 44)
Copula Merged Columns (first 5): ['home_EXT_SOURCE_1', 'home_EXT_SOURCE_2', 'home_EXT_SOURCE_3', 'home_AMT_CREDIT', 'home_AMT_ANNUITY']...
Copula Merged Default Count: 807
CTGAN Merged Shape: (10000, 44)
CTGAN Merged Columns (first 5): ['home_EXT_SOURCE_1', 'home_EXT_SOURCE_2', 'home_EXT_SOURCE_3', 'home_AMT_CREDIT', 'home_AMT_ANNUITY']...
CTGAN Merged Default Count: 807


## Save Subsampled and Merged Datasets
#
# Save all datasets to CSV files and verify successful writes by checking file existence and row counts.

In [39]:
# Define output paths (using same directory as input for consistency)
output_path = "/home/frederickerleigh/Dokumente/Fintech Seminar/NewCode/FintechSeminar-Synthetic-Dataset/fintech-credit-scoring-seminar/data/processed/merged"
output_files = {
    "matched_manual_footprint.csv": manual_fp,
    "matched_copula_footprint.csv": copula_fp,
    "matched_ctgan_footprint.csv": ctgan_fp,
    "merged_homecredit_manual_fp.csv": merged_manual,
    "merged_homecredit_copula_fp.csv": merged_copula,
    "merged_homecredit_ctgan_fp.csv": merged_ctgan
}

# Save datasets and validate
for file_name, df in output_files.items():
    full_path = os.path.join(output_path, file_name)
    try:
        df.to_csv(full_path, index=False)
        if os.path.exists(full_path):
            saved_df = pd.read_csv(full_path)
            print(f"Saved {file_name}: {len(saved_df)} rows, {saved_df.shape[1]} columns")
        else:
            raise FileNotFoundError(f"Failed to save {file_name}")
    except Exception as e:
        print(f"Error saving {file_name}: {e}")
        raise

Saved matched_manual_footprint.csv: 10000 rows, 10 columns
Saved matched_copula_footprint.csv: 10000 rows, 10 columns
Saved matched_ctgan_footprint.csv: 10000 rows, 10 columns
Saved merged_homecredit_manual_fp.csv: 10000 rows, 44 columns
Saved merged_homecredit_copula_fp.csv: 10000 rows, 44 columns
Saved merged_homecredit_ctgan_fp.csv: 10000 rows, 44 columns


## Final Summary
#
# All datasets have been subsampled, processed, merged, and saved successfully. Key details:
# - **Subsampled Datasets**: `matched_manual_footprint.csv`, `matched_copula_footprint.csv`, `matched_ctgan_footprint.csv` contain only footprint variables and `DEFAULT_SYNTH`, with the same number of defaults (`{n_defaults}`) and non-defaults (`{n_nondefaults}`) as Home Credit.
# - **Merged Datasets**: `merged_homecredit_manual_fp.csv`, `merged_homecredit_copula_fp.csv`, `merged_homecredit_ctgan_fp.csv` combine all Home Credit columns with synthetic footprint variables, aligned by TARGET/DEFAULT_SYNTH.
# - **Default Rate**: All datasets maintain the Home Credit default rate of ~8%.
# - **Columns**: Merged datasets include all Home Credit columns plus {len(footprint_vars)} footprint variables (plus DEFAULT_SYNTH).
# - **Readiness**: All datasets are ready for modeling with consistent shapes, column names, and default proportions.
#
# The datasets are saved in `{output_path}` and can be used for downstream modeling tasks.