# 05: Finalize and Clean Merged Datasets

This notebook standardizes the target column in each merged dataset (manual, copula, CTGAN), ensuring that only one clear `TARGET` variable exists for modeling. After this step, all datasets are ready for further encoding or modeling.


In [8]:
import pandas as pd


In [9]:
# Adjust file paths as needed
merged_manual = pd.read_csv("/home/frederickerleigh/Dokumente/Fintech Seminar/NewCode/FintechSeminar-Synthetic-Dataset/fintech-credit-scoring-seminar/data/processed/merged/merged_homecredit_manual_fp.csv")
merged_copula = pd.read_csv("/home/frederickerleigh/Dokumente/Fintech Seminar/NewCode/FintechSeminar-Synthetic-Dataset/fintech-credit-scoring-seminar/data/processed/merged/merged_homecredit_copula_fp.csv")
merged_ctgan = pd.read_csv("/home/frederickerleigh/Dokumente/Fintech Seminar/NewCode/FintechSeminar-Synthetic-Dataset/fintech-credit-scoring-seminar/data/processed/merged/merged_homecredit_ctgan_fp.csv")
print("Loaded merged datasets.")


Loaded merged datasets.


Inspect Columns Before Cleaning

In [10]:
print("Manual columns:", merged_manual.columns.tolist())
print("Copula columns:", merged_copula.columns.tolist())
print("CTGAN columns:", merged_ctgan.columns.tolist())


Manual columns: ['home_EXT_SOURCE_1', 'home_EXT_SOURCE_2', 'home_EXT_SOURCE_3', 'home_AMT_CREDIT', 'home_AMT_ANNUITY', 'home_AMT_GOODS_PRICE', 'home_DAYS_BIRTH', 'home_CODE_GENDER', 'home_CNT_CHILDREN', 'home_CNT_FAM_MEMBERS', 'home_NAME_EDUCATION_TYPE', 'home_NAME_FAMILY_STATUS', 'home_NAME_HOUSING_TYPE', 'home_NAME_INCOME_TYPE', 'home_AMT_INCOME_TOTAL', 'home_REGION_POPULATION_RELATIVE', 'home_REGION_RATING_CLIENT', 'home_REGION_RATING_CLIENT_W_CITY', 'home_DAYS_EMPLOYED', 'home_DAYS_REGISTRATION', 'home_DAYS_ID_PUBLISH', 'home_FLAG_MOBIL', 'home_FLAG_EMP_PHONE', 'home_FLAG_WORK_PHONE', 'home_FLAG_EMAIL', 'home_WEEKDAY_APPR_PROCESS_START', 'home_HOUR_APPR_PROCESS_START', 'home_REG_REGION_NOT_LIVE_REGION', 'home_REG_REGION_NOT_WORK_REGION', 'home_LIVE_REGION_NOT_WORK_REGION', 'home_REG_CITY_NOT_LIVE_CITY', 'home_REG_CITY_NOT_WORK_CITY', 'home_LIVE_CITY_NOT_WORK_CITY', 'home_TARGET', 'synth_device_type', 'synth_os', 'synth_email_host', 'synth_channel', 'synth_checkout_time', 'synth_nam

Clean Up—Keep Only One TARGET Column

In [11]:
# This will:
# - Drop any duplicate target columns (e.g., DEFAULT_SYNTH)
# - Rename 'home_TARGET' to 'TARGET' if needed
# - Ensures 'TARGET' is the only target column left

def standardize_target(df):
    # Find columns to drop (e.g., any column with 'DEFAULT' or 'SYNTH', except 'home_TARGET' or 'TARGET')
    drop_cols = [col for col in df.columns if (
        ('DEFAULT' in col.upper() or 'SYNTH' in col.upper())
        and col not in ['home_TARGET', 'TARGET'])]
    df = df.drop(columns=drop_cols)
    # Rename 'home_TARGET' to 'TARGET' if present
    if 'home_TARGET' in df.columns:
        df = df.rename(columns={'home_TARGET': 'TARGET'})
    # (Optional) Move TARGET to end
    if 'TARGET' in df.columns:
        cols = [c for c in df.columns if c != 'TARGET'] + ['TARGET']
        df = df[cols]
    return df

manual_ready = standardize_target(merged_manual)
copula_ready = standardize_target(merged_copula)
ctgan_ready = standardize_target(merged_ctgan)
print("Target columns standardized.")


Target columns standardized.


Quick Sanity Check—Does TARGET Look Correct?

In [12]:
print("Manual default rate:", manual_ready['TARGET'].mean())
print("Copula default rate:", copula_ready['TARGET'].mean())
print("CTGAN default rate:", ctgan_ready['TARGET'].mean())


Manual default rate: 0.0807
Copula default rate: 0.0807
CTGAN default rate: 0.0807


Save Cleaned Merged Datasets

In [13]:
manual_ready.to_csv("/home/frederickerleigh/Dokumente/Fintech Seminar/NewCode/FintechSeminar-Synthetic-Dataset/fintech-credit-scoring-seminar/data/processed/merged/ready/merged_manual_ready.csv", index=False)
copula_ready.to_csv("/home/frederickerleigh/Dokumente/Fintech Seminar/NewCode/FintechSeminar-Synthetic-Dataset/fintech-credit-scoring-seminar/data/processed/merged/ready/merged_copula_ready.csv", index=False)
ctgan_ready.to_csv("/home/frederickerleigh/Dokumente/Fintech Seminar/NewCode/FintechSeminar-Synthetic-Dataset/fintech-credit-scoring-seminar/data/processed/merged/ready/merged_ctgan_ready.csv", index=False)
print("Cleaned merged datasets saved.")


Cleaned merged datasets saved.


# 05: Cleaning Merged Datasets — Remove Extra Target Columns

This notebook ensures all merged datasets contain only one correct target column (`TARGET` or `home_TARGET`) and removes unnecessary duplicates (such as `synth_DEFAULT_SYNTH`). This is necessary before encoding, modeling, and final analysis.


In [None]:
# Load merged datasets
import pandas as pd

merged_manual = pd.read_csv("/home/frederickerleigh/Dokumente/Fintech Seminar/NewCode/FintechSeminar-Synthetic-Dataset/fintech-credit-scoring-seminar/data/processed/merged/merged_homecredit_manual_fp.csv")
merged_copula = pd.read_csv("/home/frederickerleigh/Dokumente/Fintech Seminar/NewCode/FintechSeminar-Synthetic-Dataset/fintech-credit-scoring-seminar/data/processed/merged/merged_homecredit_copula_fp.csv")
merged_ctgan = pd.read_csv("/home/frederickerleigh/Dokumente/Fintech Seminar/NewCode/FintechSeminar-Synthetic-Dataset/fintech-credit-scoring-seminar/data/processed/merged/merged_homecredit_ctgan_fp.csv")

print("Shapes:")
print("Manual:", merged_manual.shape)
print("Copula:", merged_copula.shape)
print("CTGAN:", merged_ctgan.shape)
