# Part 1 of 3: Data Exploration

In this workshop, we will begin by introducing the Google Colab environment, which provides an interactive platform for running Python code in the cloud. We'll start by exploring the dataset that we will use for 3D medical image segmentation. This dataset includes medical imaging volumes, and we will examine both the input modalities and the labels used as ground truth for segmentation tasks. To understand the structure and properties of these 3D volumes, we will use SimpleITK, a powerful toolkit for image analysis, to inspect metadata, dimensions, spacing, and other relevant characteristics of the images in the dataset. This setup will help participants build a foundational understanding of the data before moving on to model training and evaluation.

## Downloading the Dataset

We will start by downloading a small part of the Pulpy3D dataset you can find the complete dataset using either of these links

- [Drive](https://drive.google.com/drive/folders/1M5iU1urLOp1rSxKOm7WCzodAKcZrqT5O?usp=sharing)
- [Ditto](https://ditto.ing.unimore.it/pulpy3d/)

In [1]:
# Download the scan volumes and IAN labels for the first 5 patients

DATA_LINKS = [
	{"patient": "P1", "data": "https://drive.google.com/uc?id=1R9frnx6GKYnkjzs_0wsxBburzzFAIgUA", "label": "https://drive.google.com/uc?id=1y1CDWys-pGbzyrSjtyppKCcMPKdTR93H"},
	{"patient": "P2", "data": "https://drive.google.com/uc?id=1aHuVrhyrBNbi3U49AnhcFQtoT3WJ7MSS", "label": "https://drive.google.com/uc?id=1oiwsOtw1ztIvxif4ZQsUkQNylB1UGx89"},
	{"patient": "P3", "data": "https://drive.google.com/uc?id=1X73Ioh3LbRDnrFyWRGBkFZbA7u5vMLXk", "label": "https://drive.google.com/uc?id=1xr9Smu6CWJE7E5kfCyTPUw1tiDdi0CHC"},
	{"patient": "P4", "data": "https://drive.google.com/uc?id=1NF3x4ssLysrFz9H_b7wBk0_eBytevAo7", "label": "https://drive.google.com/uc?id=11HzgLk9Mb5_JEdXQKWodRYM0o7WzLzDs"},
	{"patient": "P5", "data": "https://drive.google.com/uc?id=1ppGD--trgKzfliVGeD8XJzxd4H0Y0kCN", "label": "https://drive.google.com/uc?id=1MbyFsi9bXlPQ1CyhySPfBZnbcmW63P_C"},
]

In [2]:
import os
import gdown

# Create the dataset directory
os.makedirs('data', exist_ok=True)

# Download each patient
for entry in DATA_LINKS:
  # Create the patient directory that will hold
  # the data and label files
  patient_path = os.path.join('data', entry['patient'])
  os.makedirs(patient_path, exist_ok=True)

  # Specify data and label output file name
  data_output = os.path.join(patient_path, 'data.nii.gz')
  label_output = os.path.join(patient_path, 'label.nii.gz')

  # Download the files
  gdown.download(entry['data'], data_output, quiet=False)
  gdown.download(entry['label'], label_output, quiet=False)

Downloading...
From (original): https://drive.google.com/uc?id=1R9frnx6GKYnkjzs_0wsxBburzzFAIgUA
From (redirected): https://drive.google.com/uc?id=1R9frnx6GKYnkjzs_0wsxBburzzFAIgUA&confirm=t&uuid=8dc6e5ed-f6a6-4a77-a029-2dcbdf4d3d03
To: /content/data/P1/data.nii.gz
100%|██████████| 55.8M/55.8M [00:00<00:00, 66.5MB/s]
Downloading...
From: https://drive.google.com/uc?id=1y1CDWys-pGbzyrSjtyppKCcMPKdTR93H
To: /content/data/P1/label.nii.gz
100%|██████████| 799k/799k [00:00<00:00, 10.2MB/s]
Downloading...
From (original): https://drive.google.com/uc?id=1aHuVrhyrBNbi3U49AnhcFQtoT3WJ7MSS
From (redirected): https://drive.google.com/uc?id=1aHuVrhyrBNbi3U49AnhcFQtoT3WJ7MSS&confirm=t&uuid=b2da0006-9307-4acb-aafb-bee3b90566de
To: /content/data/P2/data.nii.gz
100%|██████████| 49.4M/49.4M [00:01<00:00, 30.0MB/s]
Downloading...
From: https://drive.google.com/uc?id=1oiwsOtw1ztIvxif4ZQsUkQNylB1UGx89
To: /content/data/P2/label.nii.gz
100%|██████████| 723k/723k [00:00<00:00, 9.41MB/s]
Downloading...
From 

## Exploring the Dataset

We explore our installed dataset to verify its content and gain a sense of the problem we are trying to figure out

In [3]:
# Installing Depedencies
!pip install SimpleITK

Collecting SimpleITK
  Downloading SimpleITK-2.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.9 kB)
Downloading SimpleITK-2.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (52.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m52.4/52.4 MB[0m [31m12.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: SimpleITK
Successfully installed SimpleITK-2.4.0


In [11]:
import SimpleITK as sitk
from ipywidgets import interact

# Take P1 for an example
data_path = "data/P1/data.nii.gz"
label_path = "data/P1/label.nii.gz"

# Read Nifti images into SimpleITK images
image = sitk.ReadImage(data_path)
label = sitk.ReadImage(label_path)

# Convert the SimpleITK images to a numpy arrays
image_array = sitk.GetArrayFromImage(image)
label_array = sitk.GetArrayFromImage(label)

In [15]:
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from IPython.display import clear_output

# Define a function to display a slice
def show_slice(slice_index, axis):
    # Clear previous output to avoid multiple images in the notebook
    clear_output(wait=True)

    # Select the appropriate slice based on the axis
    if axis == 0: # Depth axis
        slice_data = image_array[slice_index, :, :]
        slice_label = label_array[slice_index, :, :]
    elif axis == 1: # Height axis
        slice_data = image_array[:, slice_index, :]
        slice_label = label_array[:, slice_index, :]
    else: # Width axis
        slice_data = image_array[:, :, slice_index]
        slice_label = label_array[:, :, slice_index]

    # Display the slice
    plt.imshow(slice_data, cmap="gray")
    plt.imshow(slice_label, cmap=ListedColormap(["none", "red"]))
    plt.title(f"Slice {slice_index} along axis {axis}")
    plt.axis("off")
    plt.show()

In [16]:
from ipywidgets import interact

# Create an interactive widget for axis and slice selection
def interactive_slicing(axis=0):
    max_index = image_array.shape[axis] - 1
    interact(lambda slice_index: show_slice(slice_index, axis), slice_index=(0, max_index))

In [18]:
# Run the interactive widget
axis_interact = interact(interactive_slicing, axis=(0, 2))


interactive(children=(IntSlider(value=0, description='axis', max=2), Output()), _dom_classes=('widget-interact…

## Volume Properties

We start exploring some of the most important properties of volumes that are crucial when it comes to training models based on 3D volumes

In [19]:
import SimpleITK as sitk
from pprint import pprint

# Iterate over all patients
patients = ["P1", "P2", "P3", "P4", "P5"]
metadata = {}

for patient in patients:
  # Read image data
  image = sitk.ReadImage(f'data/{patient}/data.nii.gz')

  # Collect important properties
  metadata[patient] = {
    'spacing': image.GetSpacing(),
    'direction': image.GetDirection(),
    'origin': image.GetOrigin(),
    'size': image.GetSize()
  }

# Print to compare properties between volumes
pprint(metadata)

{'P1': {'direction': (-1.0, 0.0, 0.0, 0.0, -1.0, 0.0, 0.0, 0.0, 1.0),
        'origin': (0.0, 0.0, 0.0),
        'size': (370, 352, 170),
        'spacing': (1.0, 1.0, 1.0)},
 'P2': {'direction': (-1.0, 0.0, 0.0, 0.0, -1.0, 0.0, 0.0, 0.0, 1.0),
        'origin': (0.0, 0.0, 0.0),
        'size': (370, 319, 169),
        'spacing': (1.0, 1.0, 1.0)},
 'P3': {'direction': (-1.0, 0.0, 0.0, 0.0, -1.0, 0.0, 0.0, 0.0, 1.0),
        'origin': (0.0, 0.0, 0.0),
        'size': (370, 301, 168),
        'spacing': (1.0, 1.0, 1.0)},
 'P4': {'direction': (-1.0, 0.0, 0.0, 0.0, -1.0, 0.0, 0.0, 0.0, 1.0),
        'origin': (0.0, 0.0, 0.0),
        'size': (371, 356, 168),
        'spacing': (1.0, 1.0, 1.0)},
 'P5': {'direction': (-1.0, 0.0, 0.0, 0.0, -1.0, 0.0, 0.0, 0.0, 1.0),
        'origin': (0.0, 0.0, 0.0),
        'size': (371, 334, 168),
        'spacing': (1.0, 1.0, 1.0)}}
