# 1. Data Loading and Initial Setup

1.  Load and clean the raw data by removing duplicates and imputing missing values.
2.  Create a single, hold-out test set.
3.  From the main training set, generate multiple training datasets with varying **imbalance ratios (IR)** by undersampling the majority class.
4.  For each IR, create **multiple repetitions** with different random samples of the minority class.
5.  For each IR and repetition, create a size-matched **control dataset** with the original class ratio.
6.  Preprocess and save all generated datasets.

# Dataset Configuration

In [None]:
DATASET_NAME = "ilpd"

print(f"Dataset: {DATASET_NAME}")

Dataset: ilpd


In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.utils import resample
from pathlib import Path
import sys

sys.path.append(str(Path("../../../../").resolve()))
from config.config import get_config

config = get_config()

RAW_PATH = Path("../../../../data/raw/Indian_Liver_Patient_Dataset_ILPD.csv")
PROCESSED_PATH = Path(f"../../../../data/processed/{DATASET_NAME}/")

TARGET_FEATURE = "Selector"
CLASS_LIVER_DISEASE = 1  # Majority class
CLASS_NO_DISEASE = 2     # Minority class

RANDOM_STATE = config.experiment.random_state
IMBALANCE_RATIOS = config.experiment.imbalance_ratios
N_REPETITIONS = config.experiment.n_repetitions

RAW_PATH.parent.mkdir(parents=True, exist_ok=True)
PROCESSED_PATH.mkdir(parents=True, exist_ok=True)

print(f"Configuration:")
print(f"  • Dataset: {DATASET_NAME}")
print(f"  • Raw data path: {RAW_PATH}")
print(f"  • Processed data path: {PROCESSED_PATH}")
print(f"  • Target feature: {TARGET_FEATURE}")
print(f"\nLoading the raw dataset...")

# Load raw data with column names
column_names = ['Age', 'Gender', 'TB', 'DB', 'Alkphos', 'Sgpt', 'Sgot', 'TP', 'ALB', 'A/G_Ratio', 'Selector']
df_raw = pd.read_csv(RAW_PATH, names=column_names)

print(f"Raw dataset loaded. Shape: {df_raw.shape}")

Configuration:
  • Dataset: ilpd
  • Raw data path: ../../../../data/raw/Indian_Liver_Patient_Dataset_ILPD.csv
  • Processed data path: ../../../../data/processed/ilpd
  • Target feature: Selector

Loading the raw dataset...
Raw dataset loaded. Shape: (583, 11)


# 2. Data Cleaning: Remove Duplicates and Handle Missing Values

As determined in the EDA:
- Remove 13 duplicate rows (2.23%)
- Impute 4 missing values in A/G_Ratio column using median imputation

In [3]:
print("Starting data cleaning...")
df_cleaned = df_raw.copy()

# 1. Remove duplicate rows
initial_shape = df_cleaned.shape[0]
df_cleaned = df_cleaned.drop_duplicates()
duplicates_removed = initial_shape - df_cleaned.shape[0]
print(f"Removed {duplicates_removed} duplicate rows")

# 2. Impute missing values in A/G_Ratio with median
missing_before = df_cleaned['A/G_Ratio'].isnull().sum()
df_cleaned['A/G_Ratio'] = df_cleaned['A/G_Ratio'].fillna(df_cleaned['A/G_Ratio'].median())
print(f"Imputed {missing_before} missing values in A/G_Ratio with median")

# 3. Encode Gender as binary (Male=1, Female=0)
df_cleaned['Gender'] = df_cleaned['Gender'].map({'Male': 1, 'Female': 0})
print("Encoded Gender as binary (Male=1, Female=0)")

print(f"\nData cleaned. Shape: {df_cleaned.shape}")
print("Missing values after cleaning:", df_cleaned.isnull().sum().sum())

Starting data cleaning...
Removed 13 duplicate rows
Imputed 4 missing values in A/G_Ratio with median
Encoded Gender as binary (Male=1, Female=0)

Data cleaned. Shape: (570, 11)
Missing values after cleaning: 0


# 3. Confirming Majority and Minority Classes

For our imbalance experiments, we need to clearly define the majority and minority classes.
- **Class 1 (Majority):** Liver Disease (416 patients, 71.36%)
- **Class 2 (Minority):** No Disease (167 patients, 28.64%)
- **Natural Imbalance Ratio:** 2.49:1

In [4]:
print("Target variable distribution:")
print(df_cleaned[TARGET_FEATURE].value_counts().sort_index())
print(f"\nNatural imbalance ratio: {df_cleaned[TARGET_FEATURE].value_counts()[CLASS_LIVER_DISEASE] / df_cleaned[TARGET_FEATURE].value_counts()[CLASS_NO_DISEASE]:.2f}:1")

Target variable distribution:
Selector
1    406
2    164
Name: count, dtype: int64

Natural imbalance ratio: 2.48:1


# 4. Create a Hold-Out Test Set

We perform a one-time stratified split to create a final test set. All experimental datasets will be generated from the `train_full_df`.

In [5]:
X = df_cleaned.drop(TARGET_FEATURE, axis=1)
y = df_cleaned[[TARGET_FEATURE]]

X_train_full, X_test, y_train_full, y_test = train_test_split(
    X, y,
    test_size=config.experiment.test_size,
    random_state=RANDOM_STATE,
    stratify=y
)

train_full_df = pd.concat([X_train_full, y_train_full], axis=1)

print(f"Full training set shape: {train_full_df.shape}")
print(f"Hold-out test set shape: {X_test.shape}")

# Show class distribution in training set
print(f"\nTraining set class distribution:")
print(train_full_df[TARGET_FEATURE].value_counts().sort_index())

Full training set shape: (456, 11)
Hold-out test set shape: (114, 10)

Training set class distribution:
Selector
1    325
2    131
Name: count, dtype: int64


# 5. Generate Imbalanced and Control Datasets with Multiple Repetitions

For each IR, we now create multiple repetitions by sampling 
different subsets of the minority class. This allows us to test whether methods 
work reliably across different minority class samples.

1.  Start with the **full majority class** ('Liver Disease').
2.  **Undersample the minority class** ('No Disease') to achieve the desired Imbalance Ratio (IR).
3.  **Repeat this sampling N_REPETITIONS times** with different random seeds.
4.  Create a size-matched **control dataset** for each IR and repetition.

  * **Methodology:** "To generate datasets with varying degrees of class imbalance, the majority class was held constant while the minority class was progressively undersampled to achieve imbalance ratios from 1:1 to 100:1. It should be noted that this approach intrinsically links a higher imbalance ratio with a smaller number of minority class samples."
  * **Discussion:** When interpreting your results, we can't claim that the degradation in synthetic data quality is *only* due to the imbalance ratio.

In [6]:
liver_disease_df = train_full_df[train_full_df[TARGET_FEATURE] == CLASS_LIVER_DISEASE]
no_disease_df = train_full_df[train_full_df[TARGET_FEATURE] == CLASS_NO_DISEASE]
n_minority_available = len(no_disease_df)
n_majority_available = len(liver_disease_df)

print(f"\nFull training set composition: {n_majority_available} majority (Liver Disease), {n_minority_available} minority (No Disease).")
print(f"\nGenerating datasets with {N_REPETITIONS} repetitions per imbalance ratio...")

generated_datasets = {}

for ir in IMBALANCE_RATIOS:
    print(f"\n")
    print(f"Processing Imbalance Ratio (IR) = {ir}:1")
    print(f"\n")
    
    for rep_id in range(1, N_REPETITIONS + 1):
        print(f"\n  Repetition {rep_id}/{N_REPETITIONS}")
        
        # Use different random seed for each repetition
        rep_seed = RANDOM_STATE + (ir * 1000) + rep_id
        
        if ir == 1:
            # For 1:1 ratio, undersample majority to match minority
            majority_undersampled = resample(
                liver_disease_df,
                replace=False,
                n_samples=n_minority_available, 
                random_state=rep_seed 
            )
            imbalanced_df = pd.concat([majority_undersampled, no_disease_df])
            
        else:
            # Keep all majority class samples
            majority_full_set = liver_disease_df
            
            # Calculate required minority samples for desired IR
            n_minority_imbalanced = int(n_majority_available / ir)

            if n_minority_imbalanced > n_minority_available:
                print(f"    SKIPPING: Cannot create {ir}:1 ratio as it requires more minority samples than available.")
                continue
            if n_minority_imbalanced < 1:
                print(f"    SKIPPING: Ratio {ir}:1 results in zero minority samples.")
                continue

            # Undersample minority class
            minority_undersampled = resample(
                no_disease_df,
                replace=False,
                n_samples=n_minority_imbalanced,
                random_state=rep_seed 
            )

            imbalanced_df = pd.concat([majority_full_set, minority_undersampled])

        total_size = len(imbalanced_df)
        
        dataset_key = f'imbalanced_ir_{ir}_rep{rep_id}'
        generated_datasets[dataset_key] = imbalanced_df
        
        n_maj = len(imbalanced_df[imbalanced_df[TARGET_FEATURE] == CLASS_LIVER_DISEASE])
        n_min = len(imbalanced_df[imbalanced_df[TARGET_FEATURE] == CLASS_NO_DISEASE])
        print(f"    Imbalanced set created: {total_size} samples ({n_maj} majority, {n_min} minority)")

        # Create size-matched control dataset with original class ratio
        if total_size >= len(train_full_df):
            control_df = train_full_df.copy()
        else:
            control_df, _ = train_test_split(
                train_full_df,
                train_size=total_size,
                random_state=rep_seed,  
                stratify=train_full_df[TARGET_FEATURE]
            )
        
        control_key = f'control_ir_{ir}_rep{rep_id}'
        generated_datasets[control_key] = control_df
        print(f"    Control set created:      {len(control_df)} samples (original class ratio)")

print(f"\nDataset generation complete!")
print(f"Total datasets created: {len(generated_datasets)}")
print(f"  - Imbalanced: {len([k for k in generated_datasets.keys() if 'imbalanced' in k])}")
print(f"  - Control: {len([k for k in generated_datasets.keys() if 'control' in k])}")


Full training set composition: 325 majority (Liver Disease), 131 minority (No Disease).

Generating datasets with 10 repetitions per imbalance ratio...


Processing Imbalance Ratio (IR) = 1:1



  Repetition 1/10
    Imbalanced set created: 262 samples (131 majority, 131 minority)
    Control set created:      262 samples (original class ratio)

  Repetition 2/10
    Imbalanced set created: 262 samples (131 majority, 131 minority)
    Control set created:      262 samples (original class ratio)

  Repetition 3/10
    Imbalanced set created: 262 samples (131 majority, 131 minority)
    Control set created:      262 samples (original class ratio)

  Repetition 4/10
    Imbalanced set created: 262 samples (131 majority, 131 minority)
    Control set created:      262 samples (original class ratio)

  Repetition 5/10
    Imbalanced set created: 262 samples (131 majority, 131 minority)
    Control set created:      262 samples (original class ratio)

  Repetition 6/10
    Imbalanced set cr

# 6. Preprocessing and Saving All Datasets

We fit the scaler **once** on the full training data. Then, we transform all generated training sets and the hold-out test set using this single, consistent scaler.

In [None]:
FEATURES_TO_SCALE = ['Age', 'Gender', 'TB', 'DB', 'Alkphos', 'Sgpt', 'Sgot', 'TP', 'ALB', 'A/G_Ratio']

scaler = StandardScaler()
scaler.fit(X_train_full[FEATURES_TO_SCALE])

print("Scaling and saving datasets...\n")

for name, df in generated_datasets.items():
    X_temp = df.drop(columns=[TARGET_FEATURE])
    y_temp = df[[TARGET_FEATURE]]

    X_processed = scaler.transform(X_temp[FEATURES_TO_SCALE])
    X_processed_df = pd.DataFrame(X_processed, columns=FEATURES_TO_SCALE)
    
    final_df = X_processed_df.reset_index(drop=True)
    final_df[TARGET_FEATURE] = y_temp.reset_index(drop=True)
    
    save_path = PROCESSED_PATH / f"train_{name}.csv"
    final_df.to_csv(save_path, index=False)
    print(f"Saved: {save_path.name}")

X_test_processed = scaler.transform(X_test[FEATURES_TO_SCALE])
X_test_processed_df = pd.DataFrame(X_test_processed, columns=FEATURES_TO_SCALE)

test_df = X_test_processed_df.reset_index(drop=True)
test_df[TARGET_FEATURE] = y_test.reset_index(drop=True)
test_df.to_csv(PROCESSED_PATH / "test.csv", index=False)

print(f"\nSaved test set: test.csv")
print(f"Total training files: {len(generated_datasets)}")

Scaling and saving datasets...

Saved: train_imbalanced_ir_1_rep1.csv
Saved: train_control_ir_1_rep1.csv
Saved: train_imbalanced_ir_1_rep2.csv
Saved: train_control_ir_1_rep2.csv
Saved: train_imbalanced_ir_1_rep3.csv
Saved: train_control_ir_1_rep3.csv
Saved: train_imbalanced_ir_1_rep4.csv
Saved: train_control_ir_1_rep4.csv
Saved: train_imbalanced_ir_1_rep5.csv
Saved: train_control_ir_1_rep5.csv
Saved: train_imbalanced_ir_1_rep6.csv
Saved: train_control_ir_1_rep6.csv
Saved: train_imbalanced_ir_1_rep7.csv
Saved: train_control_ir_1_rep7.csv
Saved: train_imbalanced_ir_1_rep8.csv
Saved: train_control_ir_1_rep8.csv
Saved: train_imbalanced_ir_1_rep9.csv
Saved: train_control_ir_1_rep9.csv
Saved: train_imbalanced_ir_1_rep10.csv
Saved: train_control_ir_1_rep10.csv
Saved: train_imbalanced_ir_3_rep1.csv
Saved: train_control_ir_3_rep1.csv
Saved: train_imbalanced_ir_3_rep2.csv
Saved: train_control_ir_3_rep2.csv
Saved: train_imbalanced_ir_3_rep3.csv
Saved: train_control_ir_3_rep3.csv
Saved: train_imba

In [None]:
import json
from datetime import datetime

metadata = {
    "dataset_name": DATASET_NAME,
    "target_feature": TARGET_FEATURE,
    "processing_timestamp": datetime.now().strftime("%Y-%m-%d %H:%M:%S"),
    "n_train_files": len(generated_datasets),
    "imbalance_ratios": IMBALANCE_RATIOS,
    "n_repetitions": N_REPETITIONS,
    "random_state": RANDOM_STATE,
    "test_size": len(test_df),
    "features": FEATURES_TO_SCALE,
    "majority_class": CLASS_LIVER_DISEASE,
    "minority_class": CLASS_NO_DISEASE
}

metadata_path = PROCESSED_PATH / "metadata.json"
with open(metadata_path, 'w') as f:
    json.dump(metadata, f, indent=2)

print(f"\nMetadata saved to: {metadata_path}")
print("\nProcessing complete! All datasets are ready for experiments.")


Metadata saved to: ../../../../data/processed/ilpd/metadata.json

Processing complete! All datasets are ready for experiments.
