# Preprocessing Pipeline: Resample and Crop

This notebook orchestrates the full preprocessing pipeline for all subjects. It reads configuration from `configs/main_config.yaml`, imports functions from `src/preprocessing.py`, and processes each subject.

## 1. Setup and Imports

Load necessary libraries, configuration file, and custom preprocessing functions.

In [2]:
import yaml
from pathlib import Path
from tqdm import tqdm
import sys

import SimpleITK as sitk

# Add src directory to the Python path to import preprocessing functions
project_root = Path.cwd().resolve().parents[1]
sys.path.append(str(project_root))

from src.preprocessing import calculate_resample_template, resample_image, remove_small_objects, get_bounding_box, crop_to_bounding_box

# Load main configuration
config_path = project_root / 'configs' / 'main_config.yaml'
with open(config_path, 'r') as f:
    config = yaml.safe_load(f)

print("Configuration loaded successfully:")
print(yaml.dump(config, indent=2))

Configuration loaded successfully:
dataset:
  modalities:
  - pet
  - ct
paths:
  data_root: data/0_nifti
  output_dir: results/
  participants_tsv: metadata/participants.tsv
preprocessing:
  crop_method: lung_ROI
  min_lung_size: 100
  output_dir: data/1_preprocessed
  target_spacing:
  - 1.5
  - 1.5
  - 3.0



## 2. Define Paths and Identify Subjects

Use the loaded configuration to set up input and output directories and get a list of all subjects to process.

In [3]:
def preprocess_subject(subject_id: str, source_dir: Path, output_dir: Path, config: dict):
    """
    Main function to preprocess a single subject: resample and crop.

    :param subject_id: The ID of the subject (e.g., 'sub-AKHABDALLAADELAHMED20091023').
    :param source_dir: The root directory of the raw data.
    :param output_dir: The root directory to save preprocessed data.
    :param config: A dictionary containing preprocessing parameters.
    """
    subject_source_dir = source_dir / subject_id
    
    # 1. Define and check all file paths
    path_map = {
        "ct": subject_source_dir / f"{subject_id}_ct.nii.gz",
        "pet": subject_source_dir / f"{subject_id}_pet.nii.gz",
        "seg_lung": subject_source_dir / f"{subject_id}_seg-lung.nii.gz",
        "seg_lesion": subject_source_dir / f"{subject_id}_seg-lesion.nii.gz",
        "seg_lesion_refined": subject_source_dir / f"{subject_id}_seg-lesion-refined.nii.gz",
    }

    # Check for core files required for the entire process
    if not all(p.exists() for p in [path_map["ct"], path_map["pet"], path_map["seg_lung"]]):
        print(f"Warning: Skipping {subject_id} due to missing CT, PET, or lung segmentation.")
        return

    # 2. Read all existing images
    images = {key: sitk.ReadImage(str(path)) for key, path in path_map.items() if path.exists()}
    
    # 3. Resample
    target_spacing = config['preprocessing']['target_spacing']
    template_image = calculate_resample_template(images["ct"], images["pet"], target_spacing)
    
    resampled_images = {}
    for key, img in images.items():
        interpolator = sitk.sitkNearestNeighbor if "seg" in key else sitk.sitkLinear
        default_value = 0 if "pet" in key or "seg" in key else -1024
        resampled_images[key] = resample_image(img, template_image, default_value, interpolator)

    # 4. Crop
    min_lung_size = config['preprocessing']['min_lung_size']
    cleaned_lung_mask = remove_small_objects(resampled_images["seg_lung"], min_size=min_lung_size)
    
    try:
        bounding_box = get_bounding_box(cleaned_lung_mask)
    except ValueError as e:
        print(f"Warning: Could not find bounding box for {subject_id}. {e}")
        return

    cropped_images = {key: crop_to_bounding_box(img, bounding_box) for key, img in resampled_images.items()}

    # 5. Save the final cropped images
    subject_output_dir = output_dir / subject_id
    subject_output_dir.mkdir(exist_ok=True, parents=True)
    
    for key, img in cropped_images.items():
        output_filename = path_map[key].name
        sitk.WriteImage(img, str(subject_output_dir / output_filename))

In [4]:
# Define paths from config
source_data_dir = project_root / config['paths']['data_root']
preprocessed_data_dir = project_root / config['preprocessing']['output_dir']

# Create the output directory
preprocessed_data_dir.mkdir(exist_ok=True, parents=True)

# Get a list of all subject IDs from the source directory
all_subject_ids = [d.name for d in source_data_dir.iterdir() if d.is_dir() and d.name.startswith('sub-')]

print(f"Found {len(all_subject_ids)} subjects to process.")
print(f"Source directory: {source_data_dir}")
print(f"Output directory: {preprocessed_data_dir}")

Found 1030 subjects to process.
Source directory: /home/yaobo/Project/Lung-Cancer-Subtyping-Classification-V4.0/data/0_nifti
Output directory: /home/yaobo/Project/Lung-Cancer-Subtyping-Classification-V4.0/data/1_preprocessed


## 3. Run Preprocessing Loop

Iterate through all subjects and apply the `preprocess_subject` function.

In [None]:
with tqdm(all_subject_ids, desc="Preprocessing Subjects") as pbar:
    for subject_id in pbar:
        pbar.set_postfix_str(subject_id)
        try:
            preprocess_subject(
                subject_id=subject_id,
                source_dir=source_data_dir,
                output_dir=preprocessed_data_dir,
                config=config
            )
        except Exception as e:
            print(f"FATAL ERROR processing {subject_id}: {e}")

print("\nPreprocessing pipeline finished.")

Preprocessing Subjects:  63%|██████▎   | 648/1030 [23:03<12:42,  2.00s/it, sub-AKHECKLJACEKALFRED20221031]                   