# üìì 02 ‚Äì Train / Validation / Test Split 

This notebook performs a **stratified split** of the ISIC 2019 dataset into
training, validation, and test sets.

Key objectives:
- Prevent data leakage
- Preserve class distributions across splits
- Organize images into a folder structure compatible with deep learning frameworks
- Generate CSV metadata files for reproducible experiments

This split is reused across both TensorFlow and PyTorch models to ensure fair comparison.


In [1]:
import os
import sys
import shutil
from tqdm import tqdm

import pandas as pd
from sklearn.model_selection import train_test_split

# Add project root to Python path
current_dir = os.getcwd()
parent_dir = os.path.dirname(current_dir)
sys.path.insert(0, parent_dir)

from data_cleaning.paths import PROCESSED_DATA, WORKED_IMGS, DATA


## üìÑ Load Processed Metadata

We load the cleaned and preprocessed ISIC 2019 metadata generated during
the data preparation stage.


In [2]:
df = pd.read_csv(os.path.join(PROCESSED_DATA, "ISIC_2019_Training_GroundTruth.csv"))

# Absolute path to processed images
df["filepath"] = df["image"].apply(
    lambda x: os.path.join(WORKED_IMGS, f"{x}.jpg")
)

df.head()

Unnamed: 0,image,MEL,NV,BCC,AK,BKL,DF,VASC,SCC,UNK,label_name,label_idx,filepath
0,ISIC_0000000,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,NV,1,c:\Users\hasee\Documents\Python_works\Image_cl...
1,ISIC_0000001,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,NV,1,c:\Users\hasee\Documents\Python_works\Image_cl...
2,ISIC_0000002,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,MEL,0,c:\Users\hasee\Documents\Python_works\Image_cl...
3,ISIC_0000003,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,NV,1,c:\Users\hasee\Documents\Python_works\Image_cl...
4,ISIC_0000004,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,MEL,0,c:\Users\hasee\Documents\Python_works\Image_cl...


## ‚úÇÔ∏è Stratified Dataset Split

We split the dataset as follows:
- **70% Training**
- **15% Validation**
- **15% Test**

Stratification is applied on `label_idx` to preserve class distribution,
which is critical in imbalanced medical datasets.


In [3]:
# Train vs temp (70 / 30)
train_df, temp_df = train_test_split(
    df,
    test_size=0.3,
    stratify=df["label_idx"],
    random_state=42
)

# Validation vs Test (15 / 15)
val_df, test_df = train_test_split(
    temp_df,
    test_size=0.5,
    stratify=temp_df["label_idx"],
    random_state=42
)

print("Train:", len(train_df))
print("Validation:", len(val_df))
print("Test:", len(test_df))


Train: 17731
Validation: 3800
Test: 3800


## üìÅ Organizing Files on Disk

Images are copied into the following directory structure:

```
DATA/
‚îú‚îÄ‚îÄ train/
‚îÇ   ‚îú‚îÄ‚îÄ MEL/
‚îÇ   ‚îú‚îÄ‚îÄ NV/
‚îÇ   ‚îî‚îÄ‚îÄ ...
‚îú‚îÄ‚îÄ val/
‚îÇ   ‚îú‚îÄ‚îÄ MEL/
‚îÇ   ‚îî‚îÄ‚îÄ ...
‚îî‚îÄ‚îÄ test/
    ‚îú‚îÄ‚îÄ MEL/
    ‚îî‚îÄ‚îÄ ...
```

This structure is compatible with both TensorFlow and PyTorch dataloaders.



In [5]:
def move_files(df: pd.DataFrame, split_name: str):
    for _, row in tqdm(df.iterrows(), total=len(df)):
        label = row["label_name"]
        src = row["filepath"]
        dst = os.path.join(DATA, split_name, label)
        os.makedirs(dst, exist_ok=True)
        shutil.copy(src, dst)

# move_files(train_df, "train") # to move files accordingly
# move_files(val_df, "val")     # to move files accordingly
# move_files(test_df, "test")   # to move files accordingly

100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 17731/17731 [00:32<00:00, 549.07it/s]
100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 3800/3800 [00:06<00:00, 548.80it/s]
100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 3800/3800 [00:06<00:00, 558.47it/s]


## üíæ Save Split Metadata

CSV files are saved for each split to ensure reproducibility.


In [6]:
train_df.drop(columns=["filepath"]).to_csv(
    os.path.join(DATA, "train", "split_train.csv"),
    index=False
)

val_df.drop(columns=["filepath"]).to_csv(
    os.path.join(DATA, "val", "split_val.csv"),
    index=False
)

test_df.drop(columns=["filepath"]).to_csv(
    os.path.join(DATA, "test", "split_test.csv"),
    index=False
)

## üîç File Integrity Check

We verify that all images referenced in the CSV files exist on disk.
Missing files are reported and filtered out.


In [7]:
def check_files(split_name, df):
    print(f"\n{split_name} SET")
    df["filepath"] = df.apply(
        lambda row: os.path.join(DATA, split_name.lower(), row["label_name"], f"{row['image']}.jpg"),
        axis=1
    )
    
    df["exists"] = df["filepath"].apply(os.path.exists)
    print("Total:", len(df))
    print("Found:", df["exists"].sum())
    print("Missing:", len(df) - df["exists"].sum())
    
    return df[df["exists"]].copy()

train_df = check_files("Train", train_df)
val_df   = check_files("Val", val_df)
test_df  = check_files("Test", test_df)



Train SET
Total: 17731
Found: 17731
Missing: 0

Val SET
Total: 3800
Found: 3800
Missing: 0

Test SET
Total: 3800
Found: 3800
Missing: 0


## üìä Final Dataset Summary


In [8]:
total = len(train_df) + len(val_df) + len(test_df)

print(f"Total samples: {total}")
print(f"Train: {len(train_df)} ({len(train_df)/total*100:.1f}%)")
print(f"Validation: {len(val_df)} ({len(val_df)/total*100:.1f}%)")
print(f"Test: {len(test_df)} ({len(test_df)/total*100:.1f}%)")


Total samples: 25331
Train: 17731 (70.0%)
Validation: 3800 (15.0%)
Test: 3800 (15.0%)


## üß™ Class Distribution per Split


In [9]:
for name, split_df in [("Train", train_df), ("Validation", val_df), ("Test", test_df)]:
    print(f"\n{name} distribution:")
    counts = split_df["label_name"].value_counts()
    for label, count in counts.items():
        print(f"{label}: {count} ({count/len(split_df)*100:.1f}%)")


Train distribution:
NV: 9012 (50.8%)
MEL: 3165 (17.9%)
BCC: 2326 (13.1%)
BKL: 1837 (10.4%)
AK: 607 (3.4%)
SCC: 440 (2.5%)
VASC: 177 (1.0%)
DF: 167 (0.9%)

Validation distribution:
NV: 1932 (50.8%)
MEL: 678 (17.8%)
BCC: 498 (13.1%)
BKL: 394 (10.4%)
AK: 130 (3.4%)
SCC: 94 (2.5%)
VASC: 38 (1.0%)
DF: 36 (0.9%)

Test distribution:
NV: 1931 (50.8%)
MEL: 679 (17.9%)
BCC: 499 (13.1%)
BKL: 393 (10.3%)
AK: 130 (3.4%)
SCC: 94 (2.5%)
VASC: 38 (1.0%)
DF: 36 (0.9%)


### ‚úÖ Outcome

- Dataset successfully split with no data leakage
- Class distributions preserved across splits
- Images organized for deep learning pipelines
- Metadata saved for reproducibility

This split will be reused for:
- TensorFlow training
- PyTorch training
- Model comparison and evaluation
