MRI Dataset Cleaning

We will perform teh following steps to the data
Merge
Resize
Split and Save

Merging both datasets of Training and Test from the orignial data and randomly splitting after resizeing to ensure the data is randomised and properly representing the system without deviating from the orginal implementation methods of the paper

In [None]:
%pip install opencv-python

Collecting opencv-python
  Downloading opencv_python-4.12.0.88-cp37-abi3-win_amd64.whl.metadata (19 kB)
Collecting numpy<2.3.0,>=2 (from opencv-python)
  Downloading numpy-2.2.6-cp312-cp312-win_amd64.whl.metadata (60 kB)
Downloading opencv_python-4.12.0.88-cp37-abi3-win_amd64.whl (39.0 MB)
   ---------------------------------------- 0.0/39.0 MB ? eta -:--:--
   -- ------------------------------------- 2.4/39.0 MB 11.2 MB/s eta 0:00:04
   ---- ----------------------------------- 4.7/39.0 MB 11.4 MB/s eta 0:00:04
   ------ --------------------------------- 6.8/39.0 MB 11.3 MB/s eta 0:00:03
   --------- ------------------------------ 9.4/39.0 MB 11.5 MB/s eta 0:00:03
   ----------- ---------------------------- 11.5/39.0 MB 11.8 MB/s eta 0:00:03
   -------------- ------------------------- 14.4/39.0 MB 11.8 MB/s eta 0:00:03
   ----------------- ---------------------- 16.8/39.0 MB 11.7 MB/s eta 0:00:02
   ------------------- -------------------- 19.4/39.0 MB 11.9 MB/s eta 0:00:02
   --------

  You can safely remove it manually.
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
contourpy 1.2.0 requires numpy<2.0,>=1.20, but you have numpy 2.2.6 which is incompatible.
gensim 4.3.3 requires numpy<2.0,>=1.18.5, but you have numpy 2.2.6 which is incompatible.
numba 0.60.0 requires numpy<2.1,>=1.22, but you have numpy 2.2.6 which is incompatible.


In [1]:
import cv2
import numpy as np

print("OpenCV version:", cv2.__version__)
print("NumPy version:", np.__version__)

OpenCV version: 4.12.0
NumPy version: 2.2.6


In [1]:
# Importing Libraries
import os
import shutil
from pathlib import Path
import cv2
import numpy as np
from sklearn.model_selection import train_test_split
from tqdm import tqdm

Since the raw data is already stored in the project repository, we'll call it from there rather than kaggle directly.

We will define the specific path for the raw data folders and the file path for the folder we will store the cleaned data

In [None]:
# Define Paths
RAW_TRAIN_DIR = Path(r"..\\data\\raw_data\\Training")
RAW_TEST_DIR  = Path(r"..\\data\\raw_data\\Testing")
CLEAN_DIR     = Path(r"..\\data\\cleaned_data")

Next we will create the folder structure in the clean directory since we plan to split the data into train, test and validation datasets.

In [5]:
# Configuration 
CLASSES = ['glioma', 'meningioma', 'notumor', 'pituitary']
IMG_SIZE = (224, 224)
SPLIT_RATIOS = {'train': 0.8, 'val': 0.1, 'test': 0.1}

# Create Folder Structure 
for split in ['train', 'val', 'test']:
    for cls in CLASSES:
        path = CLEAN_DIR / split / cls
        path.mkdir(parents=True, exist_ok=True)

print("Folder structure created under:", CLEAN_DIR)

Folder structure created under: ..\data\cleaned_data


Next we prepare the function to Load, Resize and collect the images, resizing them to 224x224 which provide the best results.

In [6]:
# Function: Load, Resize, and Collect Image Paths
def load_and_resize_images(class_name):
    """
    Combines training and testing images for a given class,
    resizes to 224x224, and returns a list of (image_array, filename).
    """
    images = []
    # Combine both train and test directories
    for data_dir in [RAW_TRAIN_DIR, RAW_TEST_DIR]:
        class_dir = data_dir / class_name
        if not class_dir.exists():
            print(f" Skipped missing directory: {class_dir}")
            continue
        for img_file in class_dir.glob('*'):
            try:
                img = cv2.imread(str(img_file))
                if img is None:
                    continue
                img_resized = cv2.resize(img, IMG_SIZE)
                images.append((img_resized, img_file.name))
            except Exception as e:
                print(f"Error processing {img_file}: {e}")
    return images

Then we store the now resized images, passing it to our saving function

In [7]:
# Function: Save Images 
def save_images(images, save_dir, class_name):
    """
    Saves a list of (image_array, filename) into a given directory.
    """
    for img_array, filename in images:
        save_path = save_dir / class_name / filename
        cv2.imwrite(str(save_path), img_array)

Now we process the classes for each type of tumor

In [None]:
# Process Each Class (Create missing raw dirs, handle 0/1 images, deterministic split)
for cls in CLASSES:
    print(f"\n Processing class: {cls}")
    images = load_and_resize_images(cls)
    total_images = len(images)
    print(f"Total images found (merged): {total_images}")

    # Skip if no images found
    if total_images == 0:
        print(f" Skipping '{cls}' â€” no images found in Training or Testing directories.")
        continue

    # Skip splitting if too few images for 3 sets
    if total_images < 3:
        print(f" Too few images ({total_images}) for splitting. Saving all to 'train' folder.")
        save_images(tqdm(images, leave=False), CLEAN_DIR / 'train', cls)
        continue

    # Prepare numpy arrays for splitting
    X = np.arange(total_images)
    y = [cls] * total_images

    # Split: train (80%) + temp (20%)
    X_train, X_temp, y_train, y_temp = train_test_split(
        X, y, test_size=(1 - SPLIT_RATIOS['train']), random_state=42, shuffle=True
    )
    # Split temp into val (10%) and test (10%)
    X_val, X_test, y_val, y_test = train_test_split(
        X_temp, y_temp, test_size=0.5, random_state=42, shuffle=True
    )

    # Map indices to image data
    train_imgs = [images[i] for i in X_train]
    val_imgs   = [images[i] for i in X_val]
    test_imgs  = [images[i] for i in X_test]

    # Save into respective folders
    print("   Saving training images...")
    save_images(tqdm(train_imgs, leave=False), CLEAN_DIR / 'train', cls)

    print("   Saving validation images...")
    save_images(tqdm(val_imgs, leave=False), CLEAN_DIR / 'val', cls)

    print("   Saving testing images...")
    save_images(tqdm(test_imgs, leave=False), CLEAN_DIR / 'test', cls)

    print(f" Done: {cls} | Train: {len(train_imgs)} | Val: {len(val_imgs)} | Test: {len(test_imgs)}")

print("\n Dataset successfully cleaned, resized, and split!")



 Processing class: glioma
Total images found (merged): 1621
   Saving training images...


                                                    

   Saving validation images...


                                                  

   Saving testing images...


                                                  

 Done: glioma | Train: 1296 | Val: 162 | Test: 163

 Processing class: meningioma
Total images found (merged): 1645
   Saving training images...


                                                    

   Saving validation images...


                                                  

   Saving testing images...


                                                  

 Done: meningioma | Train: 1316 | Val: 164 | Test: 165

 Processing class: notumor
Total images found (merged): 2000
   Saving training images...


                                                    

   Saving validation images...


                                                  

   Saving testing images...


                                                  

 Done: notumor | Train: 1600 | Val: 200 | Test: 200

 Processing class: pituitary
Total images found (merged): 1757
   Saving training images...


                                                    

   Saving validation images...


                                                  

   Saving testing images...


                                                  

 Done: pituitary | Train: 1405 | Val: 176 | Test: 176

 Dataset successfully cleaned, resized, and split!


