This notebook takes the resized images produced in 01_data_merge_and_cleaning.ipynb and performs:

Deterministic shuffling

Train / Validation / Test splitting (80% / 10% / 10%)

Class‑balanced distribution

Dataset statistics export

It follows the dataset distribution methodology described in the LEAD‑CNN paper.

In this notebook we:


1. Load the cleaned dataset produced in Notebook 01
2. Randomly shuffle images per class (deterministic)
3. Split data into training, validation, and test sets (80/10/10)
4. Save files into their respective folders
5. Generate dataset statistics for verification

In [None]:
# MRI Dataset Splitting (Train / Validation / Test)

In [2]:
import os
from pathlib import Path
import shutil
import random
import pandas as pd
from tqdm import tqdm

In [3]:
# Dataset paths
CLEAN_DIR = Path(r"..\\data\\cleaned_data")
STATS_DIR = Path(r"..\\data\\dataset_stats")
STATS_DIR.mkdir(parents=True, exist_ok=True)


# Configuration
CLASSES = ['glioma', 'meningioma', 'notumor', 'pituitary']
SPLIT_RATIOS = {
'train': 0.8,
'val': 0.1,
'test': 0.1
}


RANDOM_SEED = 42
random.seed(RANDOM_SEED)

Manual Check

Before proceeding, ensure the following directory exists and contains images:

`cleaned_data/train/<class_name>/*.jpg`

The validation and test folders should currently be empty

In [5]:
for cls in CLASSES:
  train_dir = CLEAN_DIR / 'train' / cls
  files = list(train_dir.glob('*'))
  print(f"{cls}: {len(files)} images in cleaned train folder")

glioma: 1621 images in cleaned train folder
meningioma: 1645 images in cleaned train folder
notumor: 2000 images in cleaned train folder
pituitary: 1757 images in cleaned train folder


In [6]:
def clear_directory(directory: Path):
  """Remove all files from a directory."""
  for f in directory.glob('*'):
    if f.is_file():
      f.unlink()


def split_list(items, train_ratio, val_ratio, test_ratio):
  """Split list into train/val/test using provided ratios."""
  n = len(items)


  n_train = int(n * train_ratio)
  n_val = int(n * val_ratio)


  train_items = items[:n_train]
  val_items = items[n_train:n_train + n_val]
  test_items = items[n_train + n_val:]


  return train_items, val_items, test_items

In [7]:
# On Hind Sight, Better to Clear Existing Validation and Test Folders

for split in ['val', 'test']:
  for cls in CLASSES:
    clear_directory(CLEAN_DIR / split / cls)


print("Validation and test folders cleared.")

Validation and test folders cleared.


In [8]:
# Statistics On teh Dataset Splitting Would Be Useful for Analysis Later

dataset_stats = []


for cls in CLASSES:
  print(f"\nProcessing class: {cls}")


  class_train_dir = CLEAN_DIR / 'train' / cls
  images = list(class_train_dir.glob('*'))


  if len(images) == 0:
    print(f"No images found for {cls}, skipping.")
    continue


  # Shuffle deterministically
  random.shuffle(images)


  # Split
  train_imgs, val_imgs, test_imgs = split_list(
    images,
    SPLIT_RATIOS['train'],
    SPLIT_RATIOS['val'],
    SPLIT_RATIOS['test']
    )


  # Move validation images
  for img_path in tqdm(val_imgs, desc=f"{cls} → val", leave=False):
    shutil.move(str(img_path), str(CLEAN_DIR / 'val' / cls / img_path.name))


  # Move test images
  for img_path in tqdm(test_imgs, desc=f"{cls} → test", leave=False):
    shutil.move(str(img_path), str(CLEAN_DIR / 'test' / cls / img_path.name))


  dataset_stats.append({
    'class': cls,
    'train': len(train_imgs),
    'val': len(val_imgs),
    'test': len(test_imgs),
    'total': len(images)
  })


  print(f"{cls} → Train: {len(train_imgs)}, Val: {len(val_imgs)}, Test: {len(test_imgs)}")


print("\nDataset splitting completed.")


Processing class: glioma


                                                                 

glioma → Train: 1296, Val: 162, Test: 163

Processing class: meningioma


                                                                     

meningioma → Train: 1316, Val: 164, Test: 165

Processing class: notumor


                                                                  

notumor → Train: 1600, Val: 200, Test: 200

Processing class: pituitary


                                                                    

pituitary → Train: 1405, Val: 175, Test: 177

Dataset splitting completed.




In [9]:
# Saving Dataset Statistics

stats_df = pd.DataFrame(dataset_stats)
stats_df.loc['Total'] = ['TOTAL', stats_df.train.sum(), stats_df.val.sum(), stats_df.test.sum(), stats_df.total.sum()]


stats_path = STATS_DIR / 'split_summary.csv'
stats_df.to_csv(stats_path, index=False)


stats_df, stats_path

(            class  train  val  test  total
 0          glioma   1296  162   163   1621
 1      meningioma   1316  164   165   1645
 2         notumor   1600  200   200   2000
 3       pituitary   1405  175   177   1757
 Total       TOTAL   5617  701   705   7023,
 WindowsPath('../data/dataset_stats/split_summary.csv'))

In [10]:
# Double Checking the Number of Images in Each Split After Moving

for split in ['train', 'val', 'test']:
  print(f"\n{split.upper()} set:")
  for cls in CLASSES:
    count = len(list((CLEAN_DIR / split / cls).glob('*')))
    print(f" {cls}: {count}")


TRAIN set:
 glioma: 1296
 meningioma: 1316
 notumor: 1600
 pituitary: 1405

VAL set:
 glioma: 162
 meningioma: 164
 notumor: 200
 pituitary: 175

TEST set:
 glioma: 163
 meningioma: 165
 notumor: 200
 pituitary: 177


After executing all the code, desired output would be
``
cleaned_data/
  train/<class>/
  val/<class>/
  test/<class>/``

And you will/should also have a dataset Statistic File
`data/dataset_stats/split_summary.csv`