# 1. Data Loading and Initial Setup

1.  Load and clean the raw data using MICE imputation.
2.  Create a single, hold-out test set.
3.  From the main training set, generate multiple training datasets with varying **imbalance ratios (IR)** by undersampling the majority class.
4.  For each IR, create **multiple repetitions** with different random samples of the minority class.
5.  For each IR and repetition, create a size-matched **control dataset** with the original class ratio.
6.  Preprocess and save all generated datasets.


In [2]:
import pandas as pd
import numpy as np
from ucimlrepo import fetch_ucirepo
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.utils import resample
from pathlib import Path

RAW_PATH = Path("../data/raw/mammographic_mass.csv")
PROCESSED_PATH = Path("../data/processed/")
TARGET_FEATURE = 'Severity'
RANDOM_STATE = 42
IMBALANCE_RATIOS = [1, 5, 10, 20, 50, 100] 
N_REPETITIONS = 3  

RAW_PATH.parent.mkdir(parents=True, exist_ok=True)
PROCESSED_PATH.mkdir(parents=True, exist_ok=True)

print("Fetching and loading the raw dataset...")
mammographic_mass = fetch_ucirepo(id=161)
X_features = mammographic_mass.data.features
y_target = mammographic_mass.data.targets
df_raw = pd.concat([X_features, y_target], axis=1)
df_raw.to_csv(RAW_PATH, index=False)
print(f"Raw dataset loaded and saved. Shape: {df_raw.shape}")


Fetching and loading the raw dataset...


KeyboardInterrupt: 

# 2. Data Cleaning: Imputing Missing Values

As determined in the EDA, we use MICE to impute missing values to avoid introducing bias.

In [None]:
cols_to_impute = ['BI-RADS', 'Age', 'Shape', 'Margin', 'Density']
imputer = IterativeImputer(max_iter=10, random_state=RANDOM_STATE)

print("Starting MICE imputation...")
df_cleaned = df_raw.copy()
df_cleaned[cols_to_impute] = imputer.fit_transform(df_raw[cols_to_impute])

df_cleaned['BI-RADS'] = np.clip(df_cleaned['BI-RADS'], 1, 5)
df_cleaned['Shape'] = np.clip(df_cleaned['Shape'], 1, 4)
df_cleaned['Margin'] = np.clip(df_cleaned['Margin'], 1, 5)
df_cleaned['Density'] = np.clip(df_cleaned['Density'], 1, 4)
for col in cols_to_impute:
    df_cleaned[col] = df_cleaned[col].round().astype(int)

print(f"Data cleaned and imputed. Shape: {df_cleaned.shape}")
print("\nMissing values after imputation:", df_cleaned.isnull().sum().sum())


Starting MICE imputation...
Data cleaned and imputed. Shape: (961, 6)

Missing values after imputation: 0


# 3. Confirming Majority and Minority Classes
#
For our imbalance experiments, we need to clearly define the majority and minority classes.
- **Class 0 (Majority):** Benign
- **Class 1 (Minority):** Malignant


In [None]:
print("Target variable distribution:")
print(df_cleaned[TARGET_FEATURE].value_counts())


Target variable distribution:
Severity
0    516
1    445
Name: count, dtype: int64


# 4. Create a Hold-Out Test Set

We perform a one-time stratified split to create a final test set. All experimental datasets will be generated from the `train_full_df`.


In [None]:
X = df_cleaned.drop(TARGET_FEATURE, axis=1)
y = df_cleaned[[TARGET_FEATURE]]

X_train_full, X_test, y_train_full, y_test = train_test_split(
    X, y, test_size=0.2, random_state=RANDOM_STATE, stratify=y
)

train_full_df = pd.concat([X_train_full, y_train_full], axis=1)

print(f"Full training set shape: {train_full_df.shape}")
print(f"Hold-out test set shape: {X_test.shape}")

Full training set shape: (768, 6)
Hold-out test set shape: (193, 5)


# 5. Generate Imbalanced and Control Datasets with Multiple Repetitions

For each IR, we now create multiple repetitions by sampling 
different subsets of the minority class. This allows us to test whether methods 
work reliably across different minority class samples.

1.  Start with the **full majority class** ('Benign').
2.  **Undersample the minority class** ('Malignant') to achieve the desired Imbalance Ratio (IR).
3.  **Repeat this sampling N_REPETITIONS times** with different random seeds.
4.  Create a size-matched **control dataset** for each IR and repetition.



  * **Methodology:** "To generate datasets with varying degrees of class imbalance, the majority class was held constant at 412 samples while the minority class was progressively undersampled to achieve imbalance ratios from 5:1 to 100:1. It should be noted that this approach intrinsically links a higher imbalance ratio with a smaller number of minority class samples."
  * **Discussion:** When interpreting your results, we can't claim that the degradation in synthetic data quality is *only* due to the imbalance ratio. 

In [None]:
malignant_df = train_full_df[train_full_df[TARGET_FEATURE] == 1]
benign_df = train_full_df[train_full_df[TARGET_FEATURE] == 0]
n_minority_available = len(malignant_df)
n_majority_available = len(benign_df)

print(f"\nFull training set composition: {n_majority_available} majority (Benign), {n_minority_available} minority (Malignant).")
print(f"\nGenerating datasets with {N_REPETITIONS} repetitions per imbalance ratio...")

generated_datasets = {}

for ir in IMBALANCE_RATIOS:
    print(f"\n{'='*80}")
    print(f"Processing Imbalance Ratio (IR) = {ir}:1")
    print(f"{'='*80}")
    
    for rep_id in range(1, N_REPETITIONS + 1):
        print(f"\n  Repetition {rep_id}/{N_REPETITIONS}")
        
        # Use different random seed for each repetition
        rep_seed = RANDOM_STATE + (ir * 1000) + rep_id
        
        if ir == 1:
            # For balanced case, undersample majority to match minority
            majority_undersampled = resample(
                benign_df,
                replace=False,
                n_samples=n_minority_available, 
                random_state=rep_seed 
            )
            imbalanced_df = pd.concat([majority_undersampled, malignant_df])
            
        else:
            majority_full_set = benign_df
            
            n_minority_imbalanced = int(n_majority_available / ir)

            if n_minority_imbalanced > n_minority_available:
                print(f"    SKIPPING: Cannot create {ir}:1 ratio as it requires more minority samples than available.")
                continue
            if n_minority_imbalanced < 1:
                print(f"    SKIPPING: Ratio {ir}:1 results in zero minority samples.")
                continue

            # Sample different minority instances each repetition
            minority_undersampled = resample(
                malignant_df,
                replace=False,
                n_samples=n_minority_imbalanced,
                random_state=rep_seed 
            )

            imbalanced_df = pd.concat([majority_full_set, minority_undersampled])

        total_size = len(imbalanced_df)
        
        # Store with repetition ID in the key
        dataset_key = f'imbalanced_ir_{ir}_rep{rep_id}'
        generated_datasets[dataset_key] = imbalanced_df
        
        n_maj = len(imbalanced_df[imbalanced_df[TARGET_FEATURE] == 0])
        n_min = len(imbalanced_df[imbalanced_df[TARGET_FEATURE] == 1])
        print(f"    Imbalanced set created: {total_size} samples ({n_maj} majority, {n_min} minority)")

        if total_size >= len(train_full_df):
            control_df = train_full_df.copy()
        else:
            control_df, _ = train_test_split(
                train_full_df,
                train_size=total_size,
                random_state=rep_seed,  
                stratify=train_full_df[TARGET_FEATURE]
            )
        
        control_key = f'control_ir_{ir}_rep{rep_id}'
        generated_datasets[control_key] = control_df
        print(f"    Control set created:      {len(control_df)} samples (original class ratio)")

print(f"Dataset generation complete!")
print(f"Total datasets created: {len(generated_datasets)}")
print(f"  - Imbalanced: {len([k for k in generated_datasets.keys() if 'imbalanced' in k])}")
print(f"  - Control: {len([k for k in generated_datasets.keys() if 'control' in k])}")


NameError: name 'train_full_df' is not defined

# 6. Preprocessing and Saving All Datasets

We fit the scaler **once** on the full training data. Then, we transform all generated training sets and the hold-out test set using this single, consistent scaler.


In [None]:
FEATURES_TO_SCALE = ['BI-RADS', 'Age', 'Shape', 'Margin', 'Density']

scaler = StandardScaler()
scaler.fit(X_train_full[FEATURES_TO_SCALE])

print("Scaling and saving datasets...\n")

for name, df in generated_datasets.items():
    X_temp = df.drop(columns=[TARGET_FEATURE])
    y_temp = df[[TARGET_FEATURE]]

    X_processed = scaler.transform(X_temp[FEATURES_TO_SCALE])
    X_processed_df = pd.DataFrame(X_processed, columns=FEATURES_TO_SCALE)
    
    final_df = X_processed_df.reset_index(drop=True)
    final_df[TARGET_FEATURE] = y_temp.reset_index(drop=True)
    
    save_path = PROCESSED_PATH / f"train_{name}.csv"
    final_df.to_csv(save_path, index=False)
    print(f"Saved: {save_path.name}")

X_test_processed = scaler.transform(X_test[FEATURES_TO_SCALE])
X_test_processed_df = pd.DataFrame(X_test_processed, columns=FEATURES_TO_SCALE)

test_df = X_test_processed_df.reset_index(drop=True)
test_df[TARGET_FEATURE] = y_test.reset_index(drop=True)
test_df.to_csv(PROCESSED_PATH / "test.csv", index=False)

print("Preprocessing complete. All datasets are ready for experiments.")


Scaling and saving datasets...

Saved: train_imbalanced_ir_1_rep1.csv
Saved: train_control_ir_1_rep1.csv
Saved: train_imbalanced_ir_1_rep2.csv
Saved: train_control_ir_1_rep2.csv
Saved: train_imbalanced_ir_1_rep3.csv
Saved: train_control_ir_1_rep3.csv
Saved: train_imbalanced_ir_5_rep1.csv
Saved: train_control_ir_5_rep1.csv
Saved: train_imbalanced_ir_5_rep2.csv
Saved: train_control_ir_5_rep2.csv
Saved: train_imbalanced_ir_5_rep3.csv
Saved: train_control_ir_5_rep3.csv
Saved: train_imbalanced_ir_10_rep1.csv
Saved: train_control_ir_10_rep1.csv
Saved: train_imbalanced_ir_10_rep2.csv
Saved: train_control_ir_10_rep2.csv
Saved: train_imbalanced_ir_10_rep3.csv
Saved: train_control_ir_10_rep3.csv
Saved: train_imbalanced_ir_20_rep1.csv
Saved: train_control_ir_20_rep1.csv
Saved: train_imbalanced_ir_20_rep2.csv
Saved: train_control_ir_20_rep2.csv
Saved: train_imbalanced_ir_20_rep3.csv
Saved: train_control_ir_20_rep3.csv
Saved: train_imbalanced_ir_50_rep1.csv
Saved: train_control_ir_50_rep1.csv
Saved