# Week-2: BraTS2020 Data Preprocessing (FULL DATASET - 369 PATIENTS)

## Overview
This notebook processes **ALL 369 patients** from the BraTS2020 training dataset for maximum performance.

**Key Features:**
- ✅ Process all 369 patients automatically
- ✅ Generate 6,000-10,000 patches total
- ✅ Tumor-focused sampling (85% tumor patches)
- ✅ Robust error handling
- ✅ Progress tracking and checkpointing

**Expected Output:**
- ~369 patients × ~17-27 patches = **6,273-9,963 patches**
- Train set: **~5,000-8,000 patches**
- Val set: **~1,250-2,000 patches**

**Processing Time:**
- Estimated: **6-10 hours** (depends on your CPU)
- Recommended: Run overnight

**Storage Required:**
- ~40-50 GB for preprocessed patches

In [1]:
# Cell 1: Import Libraries

import nibabel as nib
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import glob
from pathlib import Path
from sklearn.model_selection import train_test_split
from tqdm import tqdm
import warnings
import time
from datetime import datetime
warnings.filterwarnings('ignore')

# Set random seed
np.random.seed(42)

# Set style
sns.set_style("whitegrid")
sns.set_context("notebook", font_scale=1.1)

print("✓ All libraries imported successfully")
print(f"NumPy version: {np.__version__}")
print(f"Start time: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

✓ All libraries imported successfully
NumPy version: 2.3.4
Start time: 2025-10-30 16:33:39


In [2]:
# Cell 2: Configuration (FULL DATASET - 369 PATIENTS)

# Paths
BASE_PATH = os.path.join('..', 'BraTS2020_training_data', 'MICCAI_BraTS2020_TrainingData')
OUTPUT_DIR = os.path.join('..', 'processed_data')

# AUTO-DISCOVER ALL PATIENTS
print("="*70)
print("SCANNING FOR PATIENTS...")
print("="*70)

# Find all patient directories
patient_dirs = glob.glob(os.path.join(BASE_PATH, 'BraTS20_Training_*'))
PATIENT_IDS = sorted([os.path.basename(p) for p in patient_dirs])

print(f"Found {len(PATIENT_IDS)} patient directories")
if len(PATIENT_IDS) > 0:
    print(f"First patient: {PATIENT_IDS[0]}")
    print(f"Last patient: {PATIENT_IDS[-1]}")
else:
    print("⚠️  WARNING: No patient directories found!")
    print(f"   Check if BASE_PATH is correct: {BASE_PATH}")

# Preprocessing parameters
TARGET_SHAPE = (128, 128, 128)
PATCH_SIZE = (64, 64, 64)
STRIDE = (32, 32, 32)
TRAIN_VAL_SPLIT = 0.2

# Normalization
PERCENTILE_LOWER = 0.5
PERCENTILE_UPPER = 99.5

# Tumor-focused sampling (BALANCED)
TUMOR_THRESHOLD = 0.0001
BACKGROUND_SAMPLE_RATIO = 0.20  # Keep 20% of background patches for better balance

# Processing options
SAVE_CHECKPOINT_EVERY = 50  # Save progress every 50 patients
MAX_PATIENTS = None  # Set to a number to limit (e.g., 100 for testing), None for all

print("\n" + "="*70)
print("PREPROCESSING CONFIGURATION (FULL DATASET)")
print("="*70)
print(f"Total patients found: {len(PATIENT_IDS)}")
if MAX_PATIENTS:
    print(f"Processing limit: {MAX_PATIENTS} patients (for testing)")
    PATIENT_IDS = PATIENT_IDS[:MAX_PATIENTS]
else:
    print(f"Processing: ALL {len(PATIENT_IDS)} patients")

print(f"\nExpected patches per patient: ~17-27")
print(f"Expected total patches: ~{len(PATIENT_IDS) * 20} (conservative estimate)")
print(f"Expected train patches: ~{int(len(PATIENT_IDS) * 20 * 0.8)}")
print(f"Expected val patches: ~{int(len(PATIENT_IDS) * 20 * 0.2)}")

print(f"\nPatch parameters:")
print(f"  Target shape: {TARGET_SHAPE}")
print(f"  Patch size: {PATCH_SIZE}")
print(f"  Stride: {STRIDE}")

print(f"\n🎯 Tumor sampling:")
print(f"  Tumor threshold: {TUMOR_THRESHOLD}")
print(f"  Background ratio: {BACKGROUND_SAMPLE_RATIO} (20%)")
print(f"  Expected tumor patches: ~85%")

print(f"\n💾 Storage:")
print(f"  Estimated size: ~{len(PATIENT_IDS) * 20 * 5 / 1000:.1f} GB")

print(f"\n⏱️  Processing time:")
print(f"  Estimated: {len(PATIENT_IDS) * 1.5 / 60:.1f} hours")
print(f"  Checkpoint every: {SAVE_CHECKPOINT_EVERY} patients")

print("="*70)

if len(PATIENT_IDS) == 0:
    raise ValueError("No patients found! Check your BASE_PATH.")

SCANNING FOR PATIENTS...
Found 369 patient directories
First patient: BraTS20_Training_001
Last patient: BraTS20_Training_369

PREPROCESSING CONFIGURATION (FULL DATASET)
Total patients found: 369
Processing: ALL 369 patients

Expected patches per patient: ~17-27
Expected total patches: ~7380 (conservative estimate)
Expected train patches: ~5904
Expected val patches: ~1476

Patch parameters:
  Target shape: (128, 128, 128)
  Patch size: (64, 64, 64)
  Stride: (32, 32, 32)

🎯 Tumor sampling:
  Tumor threshold: 0.0001
  Background ratio: 0.2 (20%)
  Expected tumor patches: ~85%

💾 Storage:
  Estimated size: ~36.9 GB

⏱️  Processing time:
  Estimated: 9.2 hours
  Checkpoint every: 50 patients


In [3]:
# Cell 3: Create Output Directory Structure

def create_directory_structure(base_dir):
    """Create organized folder structure."""
    directories = [
        os.path.join(base_dir, 'train', 'images'),
        os.path.join(base_dir, 'train', 'masks'),
        os.path.join(base_dir, 'val', 'images'),
        os.path.join(base_dir, 'val', 'masks'),
        os.path.join(base_dir, 'metadata'),
        os.path.join(base_dir, 'checkpoints')  # For saving progress
    ]
    
    for directory in directories:
        os.makedirs(directory, exist_ok=True)
    
    return directories

created_dirs = create_directory_structure(OUTPUT_DIR)

print("✓ Directory structure created:")
for directory in created_dirs:
    print(f"  • {directory}")

✓ Directory structure created:
  • ..\processed_data\train\images
  • ..\processed_data\train\masks
  • ..\processed_data\val\images
  • ..\processed_data\val\masks
  • ..\processed_data\metadata
  • ..\processed_data\checkpoints


In [4]:
# Cell 4: Load NIfTI Volumes Function

def load_nifti_volumes(patient_path, patient_id):
    """Load all MRI modalities and segmentation mask."""
    modalities = {
        'flair': os.path.join(patient_path, f"{patient_id}_flair.nii"),
        't1': os.path.join(patient_path, f"{patient_id}_t1.nii"),
        't1ce': os.path.join(patient_path, f"{patient_id}_t1ce.nii"),
        't2': os.path.join(patient_path, f"{patient_id}_t2.nii"),
        'seg': os.path.join(patient_path, f"{patient_id}_seg.nii")
    }
    
    # Also try .nii.gz extension
    volumes = {}
    for modality, file_path in modalities.items():
        if os.path.exists(file_path):
            volumes[modality] = nib.load(file_path).get_fdata()
        elif os.path.exists(file_path + '.gz'):
            volumes[modality] = nib.load(file_path + '.gz').get_fdata()
    
    return volumes

print("✓ Load function defined")

✓ Load function defined


In [5]:
# Cell 5: Preprocessing Functions

def convert_to_float32(volume):
    return volume.astype(np.float32)

def clip_intensity_percentiles(volume, lower=0.5, upper=99.5):
    if np.max(volume) == 0:
        return volume
    lower_bound = np.percentile(volume[volume > 0], lower)
    upper_bound = np.percentile(volume[volume > 0], upper)
    return np.clip(volume, lower_bound, upper_bound)

def zscore_normalize(volume):
    if np.max(volume) == 0:
        return volume
    mask = volume > 0
    mean = np.mean(volume[mask])
    std = np.std(volume[mask])
    if std > 0:
        volume[mask] = (volume[mask] - mean) / std
    return np.nan_to_num(volume, nan=0.0, posinf=0.0, neginf=0.0)

def center_crop_or_pad(volume, target_shape):
    current_shape = volume.shape
    padded_volume = np.zeros(target_shape, dtype=volume.dtype)
    
    slices = []
    for current, target in zip(current_shape, target_shape):
        if current < target:
            pad_before = (target - current) // 2
            slices.append((pad_before, pad_before + current))
        else:
            crop_before = (current - target) // 2
            slices.append((crop_before, crop_before + target))
    
    if all(c >= t for c, t in zip(current_shape, target_shape)):
        padded_volume = volume[
            slices[0][0]:slices[0][1],
            slices[1][0]:slices[1][1],
            slices[2][0]:slices[2][1]
        ]
    else:
        src_slices = []
        for current, target in zip(current_shape, target_shape):
            if current < target:
                src_slices.append(slice(0, current))
            else:
                crop = (current - target) // 2
                src_slices.append(slice(crop, crop + target))
        padded_volume[
            slices[0][0]:slices[0][1],
            slices[1][0]:slices[1][1],
            slices[2][0]:slices[2][1]
        ] = volume[src_slices[0], src_slices[1], src_slices[2]]
    
    return padded_volume

def preprocess_volume(volume, is_mask=False):
    volume = convert_to_float32(volume)
    if not is_mask:
        volume = clip_intensity_percentiles(volume, PERCENTILE_LOWER, PERCENTILE_UPPER)
        volume = zscore_normalize(volume)
    volume = center_crop_or_pad(volume, TARGET_SHAPE)
    return volume

print("✓ Preprocessing functions defined")

✓ Preprocessing functions defined


In [6]:
# Cell 6: Patch Extraction Functions

def extract_patches_3d(volume, patch_size, stride):
    d, h, w = volume.shape
    pd, ph, pw = patch_size
    sd, sh, sw = stride
    
    patches = []
    positions = []
    
    for z in range(0, d - pd + 1, sd):
        for y in range(0, h - ph + 1, sh):
            for x in range(0, w - pw + 1, sw):
                patch = volume[z:z+pd, y:y+ph, x:x+pw]
                patches.append(patch)
                positions.append((z, y, x))
    
    return patches, positions

def calculate_tumor_ratio(mask_patch):
    total_voxels = mask_patch.size
    tumor_voxels = np.count_nonzero(mask_patch)
    return tumor_voxels / total_voxels

def extract_patches_from_volumes(volumes, patch_size, stride, 
                                 tumor_threshold=0.0001, 
                                 background_ratio=0.20):
    modality_list = ['flair', 't1', 't1ce', 't2']
    stacked_volume = np.stack([volumes[mod] for mod in modality_list], axis=0)
    seg_volume = volumes['seg']
    
    _, positions = extract_patches_3d(volumes['flair'], patch_size, stride)
    
    tumor_patches = []
    background_patches = []
    
    for idx, (z, y, x) in enumerate(positions):
        pd, ph, pw = patch_size
        image_patch = stacked_volume[:, z:z+pd, y:y+ph, x:x+pw]
        mask_patch = seg_volume[z:z+pd, y:y+ph, x:x+pw]
        tumor_ratio = calculate_tumor_ratio(mask_patch)
        
        patch_dict = {
            'image': image_patch,
            'mask': mask_patch,
            'position': (z, y, x),
            'tumor_ratio': tumor_ratio
        }
        
        if tumor_ratio >= tumor_threshold:
            tumor_patches.append(patch_dict)
        else:
            background_patches.append(patch_dict)
    
    n_background_keep = int(len(background_patches) * background_ratio)
    if n_background_keep < len(background_patches):
        background_indices = np.random.choice(len(background_patches), 
                                             n_background_keep, 
                                             replace=False)
        background_patches = [background_patches[i] for i in background_indices]
    
    return tumor_patches, background_patches

print("✓ Patch extraction functions defined")

✓ Patch extraction functions defined


In [7]:
# Cell 7: Process ALL Patients with Progress Tracking

print("\n" + "="*70)
print("PROCESSING ALL PATIENTS")
print("="*70)
print(f"Total patients to process: {len(PATIENT_IDS)}")
print(f"Estimated time: {len(PATIENT_IDS) * 1.5 / 60:.1f} hours")
print(f"Start time: {datetime.now().strftime('%H:%M:%S')}")
print("\n💡 TIP: This will take several hours. Consider running overnight.")
print("="*70 + "\n")

all_tumor_patches = []
all_background_patches = []
processed_patients = []
failed_patients = []

start_time = time.time()

for patient_idx, patient_id in enumerate(PATIENT_IDS, 1):
    patient_start = time.time()
    
    # Progress header
    if patient_idx % 10 == 1 or patient_idx == len(PATIENT_IDS):
        elapsed = time.time() - start_time
        elapsed_hours = elapsed / 3600
        remaining = (elapsed / patient_idx) * (len(PATIENT_IDS) - patient_idx) if patient_idx > 0 else 0
        remaining_hours = remaining / 3600
        
        print(f"\n{'='*70}")
        print(f"Progress: {patient_idx}/{len(PATIENT_IDS)} ({100*patient_idx/len(PATIENT_IDS):.1f}%)")
        print(f"Elapsed: {elapsed_hours:.2f}h | Remaining: ~{remaining_hours:.2f}h")
        print(f"Current: {patient_id}")
        print(f"{'='*70}")
    
    patient_path = os.path.join(BASE_PATH, patient_id)
    
    if not os.path.exists(patient_path):
        failed_patients.append((patient_id, "Directory not found"))
        print(f"⚠️  [{patient_idx}/{len(PATIENT_IDS)}] {patient_id}: Directory not found")
        continue
    
    try:
        # Load volumes
        raw_volumes = load_nifti_volumes(patient_path, patient_id)
        
        # Check completeness
        required_modalities = ['flair', 't1', 't1ce', 't2', 'seg']
        if not all(mod in raw_volumes for mod in required_modalities):
            failed_patients.append((patient_id, "Missing modalities"))
            print(f"⚠️  [{patient_idx}/{len(PATIENT_IDS)}] {patient_id}: Missing modalities")
            continue
        
        # Preprocess
        preprocessed_volumes = {}
        for modality in ['flair', 't1', 't1ce', 't2']:
            preprocessed_volumes[modality] = preprocess_volume(raw_volumes[modality], is_mask=False)
        preprocessed_volumes['seg'] = preprocess_volume(raw_volumes['seg'], is_mask=True)
        
        # Extract patches
        tumor_patches, background_patches = extract_patches_from_volumes(
            preprocessed_volumes,
            PATCH_SIZE,
            STRIDE,
            tumor_threshold=TUMOR_THRESHOLD,
            background_ratio=BACKGROUND_SAMPLE_RATIO
        )
        
        # Add patient ID
        for patch in tumor_patches:
            patch['patient_id'] = patient_id
        for patch in background_patches:
            patch['patient_id'] = patient_id
        
        all_tumor_patches.extend(tumor_patches)
        all_background_patches.extend(background_patches)
        processed_patients.append(patient_id)
        
        patient_time = time.time() - patient_start
        total_patches = len(tumor_patches) + len(background_patches)
        print(f"✓ [{patient_idx}/{len(PATIENT_IDS)}] {patient_id}: {total_patches} patches ({patient_time:.1f}s)")
        
        # Save checkpoint
        if patient_idx % SAVE_CHECKPOINT_EVERY == 0:
            checkpoint_data = {
                'tumor_patches': all_tumor_patches,
                'background_patches': all_background_patches,
                'processed_patients': processed_patients,
                'failed_patients': failed_patients,
                'progress': patient_idx
            }
            checkpoint_path = os.path.join(OUTPUT_DIR, 'checkpoints', f'checkpoint_{patient_idx}.npy')
            np.save(checkpoint_path, checkpoint_data, allow_pickle=True)
            print(f"💾 Checkpoint saved: {patient_idx} patients processed")
        
    except Exception as e:
        failed_patients.append((patient_id, str(e)))
        print(f"❌ [{patient_idx}/{len(PATIENT_IDS)}] {patient_id}: Error - {str(e)[:50]}")
        continue

# Combine all patches
all_patches = all_tumor_patches + all_background_patches
np.random.shuffle(all_patches)

total_time = time.time() - start_time

print(f"\n{'='*70}")
print("PROCESSING COMPLETE!")
print(f"{'='*70}")
print(f"Total time: {total_time/3600:.2f} hours")
print(f"Successfully processed: {len(processed_patients)}/{len(PATIENT_IDS)} patients")
print(f"Failed: {len(failed_patients)} patients")
print(f"\nTotal patches: {len(all_patches)}")
print(f"Tumor patches: {len(all_tumor_patches)} ({100*len(all_tumor_patches)/len(all_patches):.1f}%)")
print(f"Background patches: {len(all_background_patches)} ({100*len(all_background_patches)/len(all_patches):.1f}%)")
print(f"Average patches per patient: {len(all_patches)//len(processed_patients)}")
print(f"{'='*70}")

if len(failed_patients) > 0:
    print(f"\n⚠️  Failed patients ({len(failed_patients)}):")
    for pid, reason in failed_patients[:10]:  # Show first 10
        print(f"  • {pid}: {reason}")
    if len(failed_patients) > 10:
        print(f"  ... and {len(failed_patients)-10} more")


PROCESSING ALL PATIENTS
Total patients to process: 369
Estimated time: 9.2 hours
Start time: 16:33:39

💡 TIP: This will take several hours. Consider running overnight.


Progress: 1/369 (0.3%)
Elapsed: 0.00h | Remaining: ~0.00h
Current: BraTS20_Training_001
✓ [1/369] BraTS20_Training_001: 27 patches (2.0s)
✓ [2/369] BraTS20_Training_002: 17 patches (2.1s)
✓ [3/369] BraTS20_Training_003: 15 patches (2.1s)
✓ [4/369] BraTS20_Training_004: 15 patches (2.1s)
✓ [5/369] BraTS20_Training_005: 11 patches (2.2s)
✓ [6/369] BraTS20_Training_006: 26 patches (2.2s)
✓ [7/369] BraTS20_Training_007: 15 patches (2.4s)
✓ [8/369] BraTS20_Training_008: 15 patches (2.5s)
✓ [9/369] BraTS20_Training_009: 24 patches (1.9s)
✓ [10/369] BraTS20_Training_010: 19 patches (2.1s)

Progress: 11/369 (3.0%)
Elapsed: 0.01h | Remaining: ~0.20h
Current: BraTS20_Training_011
✓ [11/369] BraTS20_Training_011: 18 patches (2.4s)
✓ [12/369] BraTS20_Training_012: 17 patches (2.3s)
✓ [13/369] BraTS20_Training_013: 24 patches (1.8

In [8]:
# Cell 8: Train-Validation Split

def stratified_split_patches(patches, val_ratio=0.2):
    tumor_patches = [p for p in patches if p['tumor_ratio'] >= TUMOR_THRESHOLD]
    background_patches = [p for p in patches if p['tumor_ratio'] < TUMOR_THRESHOLD]
    
    n_tumor_val = int(len(tumor_patches) * val_ratio)
    tumor_val = tumor_patches[:n_tumor_val]
    tumor_train = tumor_patches[n_tumor_val:]
    
    n_bg_val = int(len(background_patches) * val_ratio)
    bg_val = background_patches[:n_bg_val]
    bg_train = background_patches[n_bg_val:]
    
    train_patches = tumor_train + bg_train
    val_patches = tumor_val + bg_val
    
    np.random.shuffle(train_patches)
    np.random.shuffle(val_patches)
    
    return train_patches, val_patches

if len(all_patches) > 0:
    train_patches, val_patches = stratified_split_patches(all_patches, TRAIN_VAL_SPLIT)

    print("="*70)
    print("TRAIN-VALIDATION SPLIT")
    print("="*70)
    print(f"Training patches: {len(train_patches)} ({100*len(train_patches)/(len(train_patches)+len(val_patches)):.1f}%)")
    print(f"  Tumor: {sum(1 for p in train_patches if p['tumor_ratio'] >= TUMOR_THRESHOLD)}")
    print(f"  Background: {sum(1 for p in train_patches if p['tumor_ratio'] < TUMOR_THRESHOLD)}")

    print(f"\nValidation patches: {len(val_patches)} ({100*len(val_patches)/(len(train_patches)+len(val_patches)):.1f}%)")
    print(f"  Tumor: {sum(1 for p in val_patches if p['tumor_ratio'] >= TUMOR_THRESHOLD)}")
    print(f"  Background: {sum(1 for p in val_patches if p['tumor_ratio'] < TUMOR_THRESHOLD)}")
    print("="*70)

TRAIN-VALIDATION SPLIT
Training patches: 5789 (80.0%)
  Tumor: 5385
  Background: 404

Validation patches: 1446 (20.0%)
  Tumor: 1346
  Background: 100


In [9]:
# Cell 9: Save Preprocessed Patches (FIXED - Memory Optimized)

def save_patches(patches, split_name, output_dir):
    """
    Save patches as .npy files with memory optimization.
    Includes error handling and memory cleanup.
    """
    image_dir = os.path.join(output_dir, split_name, 'images')
    mask_dir = os.path.join(output_dir, split_name, 'masks')
    
    print(f"\nSaving {split_name} patches...")
    
    saved_count = 0
    failed_count = 0
    
    for idx, patch in enumerate(tqdm(patches, desc=f"Saving {split_name}")):
        try:
            # Save image
            image_path = os.path.join(image_dir, f"patch_{idx:05d}.npy")
            image_data = patch['image'].copy()  # Make explicit copy
            np.save(image_path, image_data, allow_pickle=False)
            del image_data  # Free memory immediately
            
            # Save mask
            mask_path = os.path.join(mask_dir, f"patch_{idx:05d}.npy")
            mask_data = patch['mask'].copy()  # Make explicit copy
            np.save(mask_path, mask_data, allow_pickle=False)
            del mask_data  # Free memory immediately
            
            saved_count += 1
            
            # Periodic garbage collection for large datasets
            if (idx + 1) % 100 == 0:
                import gc
                gc.collect()
            
        except Exception as e:
            print(f"\n⚠️  Error saving patch {idx}: {str(e)}")
            failed_count += 1
            
            # Try alternative save method (compressed)
            try:
                print(f"   Attempting compressed save...")
                np.savez_compressed(image_path.replace('.npy', '.npz'), 
                                   data=patch['image'])
                np.savez_compressed(mask_path.replace('.npy', '.npz'), 
                                   data=patch['mask'])
                saved_count += 1
                failed_count -= 1
                print(f"   ✓ Compressed save successful")
            except Exception as e2:
                print(f"   ❌ Compressed save also failed: {str(e2)}")
    
    print(f"\n✓ Saved {saved_count}/{len(patches)} {split_name} patches")
    if failed_count > 0:
        print(f"⚠️  Failed to save {failed_count} patches")
    print(f"  Images: {image_dir}")
    print(f"  Masks: {mask_dir}")

if len(all_patches) > 0:
    # Save train patches
    save_patches(train_patches, 'train', OUTPUT_DIR)
    
    # Clean up memory between saves
    import gc
    gc.collect()
    
    # Save validation patches
    save_patches(val_patches, 'val', OUTPUT_DIR)
    
    print("\n" + "="*70)
    print("✓ ALL PATCHES SAVED SUCCESSFULLY")
    print("="*70)
else:
    print("\n❌ No patches to save!")


Saving train patches...


Saving train: 100%|██████████| 5789/5789 [05:52<00:00, 16.41it/s]



✓ Saved 5789/5789 train patches
  Images: ..\processed_data\train\images
  Masks: ..\processed_data\train\masks

Saving val patches...


Saving val: 100%|██████████| 1446/1446 [01:22<00:00, 17.45it/s]


✓ Saved 1446/1446 val patches
  Images: ..\processed_data\val\images
  Masks: ..\processed_data\val\masks

✓ ALL PATCHES SAVED SUCCESSFULLY





In [10]:
# Cell 10: Save Metadata

if len(all_patches) > 0:
    metadata = {
        'num_patients_attempted': len(PATIENT_IDS),
        'num_patients_processed': len(processed_patients),
        'num_patients_failed': len(failed_patients),
        'patient_ids': processed_patients,
        'failed_patients': failed_patients,
        'target_shape': TARGET_SHAPE,
        'patch_size': PATCH_SIZE,
        'stride': STRIDE,
        'train_val_split': TRAIN_VAL_SPLIT,
        'tumor_threshold': TUMOR_THRESHOLD,
        'background_sample_ratio': BACKGROUND_SAMPLE_RATIO,
        'n_train_patches': len(train_patches),
        'n_val_patches': len(val_patches),
        'n_tumor_patches': len(all_tumor_patches),
        'n_background_patches': len(all_background_patches),
        'tumor_patch_percentage': 100 * len(all_tumor_patches) / len(all_patches),
        'processing_time_hours': (time.time() - start_time) / 3600,
        'modalities': ['flair', 't1', 't1ce', 't2'],
        'normalization': 'z-score',
        'dataset': 'BraTS2020_Full'
    }

    metadata_path = os.path.join(OUTPUT_DIR, 'metadata', 'preprocessing_config.npy')
    np.save(metadata_path, metadata)
    print(f"✓ Metadata saved: {metadata_path}")

✓ Metadata saved: ..\processed_data\metadata\preprocessing_config.npy


In [11]:
# Cell 11: Final Summary Report

print("\n" + "="*70)
print(" "*15 + "📊 WEEK-2 FINAL SUMMARY (FULL DATASET)")
print("="*70)

if len(all_patches) > 0:
    print(f"\n{'DATASET STATISTICS'}")
    print("-"*70)
    print(f"  Patients attempted: {len(PATIENT_IDS)}")
    print(f"  Successfully processed: {len(processed_patients)} ({100*len(processed_patients)/len(PATIENT_IDS):.1f}%)")
    print(f"  Failed: {len(failed_patients)} ({100*len(failed_patients)/len(PATIENT_IDS):.1f}%)")
    print(f"  Average patches/patient: {len(all_patches)//len(processed_patients)}")

    print(f"\n{'PATCH STATISTICS'}")
    print("-"*70)
    print(f"  Total patches: {len(all_patches):,}")
    print(f"  Training: {len(train_patches):,} ({100*len(train_patches)/(len(train_patches)+len(val_patches)):.1f}%)")
    print(f"  Validation: {len(val_patches):,} ({100*len(val_patches)/(len(train_patches)+len(val_patches)):.1f}%)")
    print(f"  Tumor patches: {len(all_tumor_patches):,} ({100*len(all_tumor_patches)/len(all_patches):.1f}%)")
    print(f"  Background patches: {len(all_background_patches):,} ({100*len(all_background_patches)/len(all_patches):.1f}%)")

    print(f"\n{'PROCESSING TIME'}")
    print("-"*70)
    total_hours = (time.time() - start_time) / 3600
    print(f"  Total time: {total_hours:.2f} hours")
    print(f"  Time per patient: {total_hours*60/len(processed_patients):.2f} minutes")

    print(f"\n{'STORAGE'}")
    print("-"*70)
    patch_size_bytes = np.prod(PATCH_SIZE) * 4 * 4
    mask_size_bytes = np.prod(PATCH_SIZE) * 4
    total_size_gb = (len(train_patches) + len(val_patches)) * (patch_size_bytes + mask_size_bytes) / (1024**3)
    print(f"  Estimated size: {total_size_gb:.2f} GB")

    print(f"\n{'DATA QUALITY ASSESSMENT'}")
    print("-"*70)
    if len(all_patches) >= 5000:
        print(f"  ✅ EXCELLENT: {len(all_patches):,} patches")
        print(f"     This is professional-grade training data!")
    elif len(all_patches) >= 3000:
        print(f"  ✅ VERY GOOD: {len(all_patches):,} patches")
        print(f"     Sufficient for high-quality training.")
    elif len(all_patches) >= 1000:
        print(f"  ✅ GOOD: {len(all_patches):,} patches")
        print(f"     Acceptable for training.")
    else:
        print(f"  ⚠️  BORDERLINE: {len(all_patches):,} patches")

    print("\n" + "="*70)
    print("✅ WEEK-2 PREPROCESSING COMPLETE (FULL DATASET)!")
    print(f"🎯 Generated {len(all_patches):,} patches from {len(processed_patients)} patients")
    print(f"💾 Data saved to: {OUTPUT_DIR}")
    print("🚀 Ready for Week-4: Training with Maximum Data!")
    print("="*70)
    print(f"\nEnd time: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")


               📊 WEEK-2 FINAL SUMMARY (FULL DATASET)

DATASET STATISTICS
----------------------------------------------------------------------
  Patients attempted: 369
  Successfully processed: 368 (99.7%)
  Failed: 3 (0.8%)
  Average patches/patient: 19

PATCH STATISTICS
----------------------------------------------------------------------
  Total patches: 7,235
  Training: 5,789 (80.0%)
  Validation: 1,446 (20.0%)
  Tumor patches: 6,731 (93.0%)
  Background patches: 504 (7.0%)

PROCESSING TIME
----------------------------------------------------------------------
  Total time: 0.47 hours
  Time per patient: 0.08 minutes

STORAGE
----------------------------------------------------------------------
  Estimated size: 35.33 GB

DATA QUALITY ASSESSMENT
----------------------------------------------------------------------
  ✅ EXCELLENT: 7,235 patches
     This is professional-grade training data!

✅ WEEK-2 PREPROCESSING COMPLETE (FULL DATASET)!
🎯 Generated 7,235 patches from 368 pat