<a href="https://colab.research.google.com/github/ImagingDataCommons/ai_medima_misc/blob/main/nnunet/notebooks/mwe.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **nnU-Net MWE**

Minimal Working Example for cloud-based analysis of data using the nnU-Net Thoracic Organs at Risk segmentation model.

## **Environment Setup**

This demo notebook is intended to be run using a GPU.

To access a free GPU on Colab:
`Edit > Notebooks Settings`.

From the dropdown menu under `Hardware accelerator`, select `GPU`. Let's check the Colab instance is indeed equipped with a GPU.

In [1]:
import os
import sys

import yaml

import time
import tqdm


# useful information
curr_dir = !pwd
curr_droid = !hostname
curr_pilot = !whoami

print(time.asctime(time.localtime()))

print("\nCurrent directory :", curr_dir[-1])
print("Hostname          :", curr_droid[-1])
print("Username          :", curr_pilot[-1])

print("Python version    :", sys.version.split('\n')[0])

Wed Jun 29 13:05:29 2022

Current directory : /content
Hostname          : 253f80cdea3c
Username          : root
Python version    : 3.7.13 (default, Apr 24 2022, 01:04:09) 


The authentication to Google is necessary to run BigQuery queries.

Every operation throughout the whole notebook (BigQuery, fetching data from the IDC buckets) is completely free. The only thing that is needed in order to run the notebook is the set-up of a Google Cloud project. In order for the notebook to work as intended, you will need to specify the name of the project in the cell after the authentication one.

In [23]:
from google.colab import auth
auth.authenticate_user()

In [3]:
from google.cloud import storage

project_name = "idc-sandbox-000"

Throughout this Colab notebook, for image pre-processing we will use [Plastimatch](https://plastimatch.org), a reliable and open source software for image computation. We will be running Plastimatch using the simple [PyPlastimatch](https://github.com/AIM-Harvard/pyplastimatch/tree/main/pyplastimatch) python wrapper. 

In [4]:
%%capture
!apt install plastimatch

In [5]:
# check plastimatch was correctly installed
!plastimatch --version

plastimatch version 1.7.0


We will use subversion to clone only a few subdirectories of a repository (this is still not simple to do using the git CLI).

In [6]:
%%capture
!apt install subversion

In [7]:
# check plastimatch was correctly installed
!svn --version | head -n 2

svn, version 1.9.7 (r1800392)
   compiled May 21 2022, 07:24:25 on x86_64-pc-linux-gnu

Copyright (C) 2017 The Apache Software Foundation.
This software consists of contributions made by many people;
see the NOTICE file for more information.
Subversion is open source software, see http://subversion.apache.org/

The following repository access (RA) modules are available:

* ra_svn : Module for accessing a repository using the svn network protocol.
  - with Cyrus SASL authentication
  - handles 'svn' scheme
* ra_local : Module for accessing a repository on local disk.
  - handles 'file' scheme
* ra_serf : Module for accessing a repository via WebDAV protocol using serf.
  - using serf 1.3.9 (compiled with 1.3.9)
  - handles 'http' scheme
  - handles 'https' scheme

The following authentication credential caches are available:

* Plaintext cache in /root/.subversion
* Gnome Keyring
* GPG-Agent
* KWallet (KDE)



Clone only the subfolders of `ImagingDataCommons/ai_medima_misc` we need to run this notebook.

In [8]:
!svn checkout https://github.com/ImagingDataCommons/ai_medima_misc/trunk/nnunet/src
!svn checkout https://github.com/ImagingDataCommons/ai_medima_misc/trunk/nnunet/data

A    src/README.md
A    src/utils
A    src/utils/eval.py
A    src/utils/gcs.py
A    src/utils/postprocessing.py
A    src/utils/preprocessing.py
A    src/utils/processing.py
Checked out revision 26.
A    data/README.md
A    data/dicomseg_base_metadata.json
Checked out revision 26.


Furthermore, to organise the DICOM data in a more common (and human-understandable) fashion after downloading those from the buckets, we will make use of [DICOMSort](https://github.com/pieper/dicomsort). 

DICOMSort is an open source tool for custom sorting and renaming of dicom files based on their specific DICOM tags. In our case, we will exploit DICOMSort to organise the DICOM data by `PatientID` and `Modality` - so that the final directory will look like the following:

```
data/raw/nsclc-radiomics/dicom/$PatientID
 └─── CT
       ├─── $SOPInstanceUID_slice0.dcm
       ├─── $SOPInstanceUID_slice1.dcm
       ├───  ...
       │
      RTSTRUCT 
       ├─── $SOPInstanceUID_RTSTRUCT.dcm
      SEG
       └─── $SOPInstanceUID_RTSEG.dcm

```

In [9]:
!mkdir -p src

!git clone https://github.com/pieper/dicomsort src/dicomsort
!git clone https://github.com/AIM-Harvard/pyplastimatch src/pyplastimatch

Cloning into 'src/dicomsort'...
remote: Enumerating objects: 126, done.[K
remote: Total 126 (delta 0), reused 0 (delta 0), pack-reused 126[K
Receiving objects: 100% (126/126), 37.03 KiB | 4.63 MiB/s, done.
Resolving deltas: 100% (63/63), done.
Cloning into 'src/pyplastimatch'...
remote: Enumerating objects: 333, done.[K
remote: Counting objects: 100% (76/76), done.[K
remote: Compressing objects: 100% (47/47), done.[K
remote: Total 333 (delta 26), reused 75 (delta 26), pack-reused 257[K
Receiving objects: 100% (333/333), 55.56 MiB | 25.55 MiB/s, done.
Resolving deltas: 100% (28/28), done.


Finally, we will use DCMQI for converting the resulting segmentation into standard DICOM SEG objects.

In [10]:
dcmqi_release_url = "https://github.com/QIICR/dcmqi/releases/download/v1.2.4/dcmqi-1.2.4-linux.tar.gz"
dcmqi_download_path = "/content/dcmqi-1.2.4-linux.tar.gz"
dcmqi_path = "/content/dcmqi-1.2.4-linux"

!wget -O $dcmqi_download_path $dcmqi_release_url

!tar -xvf $dcmqi_download_path

!mv $dcmqi_path/bin/* /bin

--2022-06-29 13:06:11--  https://github.com/QIICR/dcmqi/releases/download/v1.2.4/dcmqi-1.2.4-linux.tar.gz
Resolving github.com (github.com)... 192.30.255.112
Connecting to github.com (github.com)|192.30.255.112|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/50675718/04f07880-81ee-11eb-92ec-30c7426dae5d?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20220629%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20220629T130611Z&X-Amz-Expires=300&X-Amz-Signature=d20a9803bc0f3b736ea2231bddce1f9b7c668c0e1cd7b9ad35b2369911804669&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=50675718&response-content-disposition=attachment%3B%20filename%3Ddcmqi-1.2.4-linux.tar.gz&response-content-type=application%2Foctet-stream [following]
--2022-06-29 13:06:11--  https://objects.githubusercontent.com/github-production-release-asset-2e65be/50675718/04f07880-81ee-11eb-92ec-30c7426

---

In [11]:
%%capture
!pip install pydicom SimpleITK nnunet

In [12]:
import numpy as np
import pandas as pd
import SimpleITK as sitk

import src.pyplastimatch.pyplastimatch.pyplastimatch as pypla

from google.cloud import bigquery as bq

Provided everything was set up correctly, we can run the BigQuery query and get all the information we need to download the testing data from the IDC platform.

For this specific use case, we are going to be working with the NSCLC-Radiomics collection (Chest CT scans of lung cancer patients, with manual delineation of various organs at risk).

In [13]:
%%bigquery --project=$project_name cohort_df

SELECT
  PatientID,
  collection_id,
  StudyInstanceUID,
  SeriesInstanceUID,
  SOPInstanceUID,
  gcs_url
FROM
  `bigquery-public-data.idc_current.dicom_all`
WHERE
  Modality IN ("CT",
    "RTSTRUCT")
  AND Source_DOI = "10.7937/K9/TCIA.2015.PF0M9REI"
ORDER BY
  PatientID

In [15]:
# create the directory tree
!mkdir -p data models output

!mkdir -p data/raw 
!mkdir -p data/raw/tmp data/raw/nsclc-radiomics
!mkdir -p data/raw/nsclc-radiomics/dicom

!mkdir -p data/processed
!mkdir -p data/processed/nsclc-radiomics
!mkdir -p data/processed/nsclc-radiomics/nrrd
!mkdir -p data/processed/nsclc-radiomics/nii
!mkdir -p data/processed/nsclc-radiomics/dicomseg

!mkdir -p data/model_input/
!mkdir -p data/nnunet_output/

Download the segmentation model(s) from Zenodo. This can either be very fast (2m or even less) or very slow (up to 10m), probably depending on the traffic on the Zenodo's end and other factors.

If the download is taking a long time, consider interrupting the celle execution and running the cell again.

In [16]:
seg_model_url = "https://zenodo.org/record/4485926/files/Task055_SegTHOR.zip?download=1"
model_download_path = "/content/models/Task055_SegTHOR.zip"

!wget -O $model_download_path $seg_model_url

--2022-06-29 13:06:43--  https://zenodo.org/record/4485926/files/Task055_SegTHOR.zip?download=1
Resolving zenodo.org (zenodo.org)... 137.138.76.77
Connecting to zenodo.org (zenodo.org)|137.138.76.77|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5019434005 (4.7G) [application/octet-stream]
Saving to: ‘/content/models/Task055_SegTHOR.zip’


2022-06-29 13:16:47 (7.95 MB/s) - ‘/content/models/Task055_SegTHOR.zip’ saved [5019434005/5019434005]



Initialize a few environment variables [...]

In [17]:
os.environ["RESULTS_FOLDER"] = "/content/data/nnunet_output/"
os.environ["WEIGHTS_FOLDER"] = "/content/data/nnunet_output/nnUNet"

In [18]:
%%capture
!nnUNet_install_pretrained_model_from_zip $model_download_path

## **Parsing Cohort Information from BigQuery Tables**

We can check the various fields of the table we populated by running the BigQuery query.

This table will store one entry for each DICOM file in the dataset (therefore, expect thousands of rows!)

In [19]:
pat_id_list = sorted(list(set(cohort_df["PatientID"].values)))

print("Total number of unique Patient IDs:", len(pat_id_list))

display(cohort_df.info())

display(cohort_df.head())

Total number of unique Patient IDs: 422
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51652 entries, 0 to 51651
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   PatientID          51652 non-null  object
 1   collection_id      51652 non-null  object
 2   StudyInstanceUID   51652 non-null  object
 3   SeriesInstanceUID  51652 non-null  object
 4   SOPInstanceUID     51652 non-null  object
 5   gcs_url            51652 non-null  object
dtypes: object(6)
memory usage: 2.4+ MB


None

Unnamed: 0,PatientID,collection_id,StudyInstanceUID,SeriesInstanceUID,SOPInstanceUID,gcs_url
0,LUNG1-001,nsclc_radiomics,1.3.6.1.4.1.32722.99.99.2393413539117143687725...,1.3.6.1.4.1.32722.99.99.2279381215866080725084...,1.3.6.1.4.1.32722.99.99.6468474582136099606367...,gs://idc-open-cr/5bcda93e-ef26-4a58-a7b4-47832...
1,LUNG1-001,nsclc_radiomics,1.3.6.1.4.1.32722.99.99.2393413539117143687725...,1.3.6.1.4.1.32722.99.99.2989917765213423750108...,1.3.6.1.4.1.32722.99.99.1047764232230739912736...,gs://idc-open-cr/2b028478-80a6-4cc4-95d8-36bd1...
2,LUNG1-001,nsclc_radiomics,1.3.6.1.4.1.32722.99.99.2393413539117143687725...,1.3.6.1.4.1.32722.99.99.2989917765213423750108...,1.3.6.1.4.1.32722.99.99.1064644568755722921755...,gs://idc-open-cr/fdbe15bb-a030-4a8d-b041-b4a73...
3,LUNG1-001,nsclc_radiomics,1.3.6.1.4.1.32722.99.99.2393413539117143687725...,1.3.6.1.4.1.32722.99.99.2989917765213423750108...,1.3.6.1.4.1.32722.99.99.2781236900059730216785...,gs://idc-open-cr/375fdfe3-a6c7-4e6d-bf14-20fec...
4,LUNG1-001,nsclc_radiomics,1.3.6.1.4.1.32722.99.99.2393413539117143687725...,1.3.6.1.4.1.32722.99.99.2989917765213423750108...,1.3.6.1.4.1.32722.99.99.1077431100943926205431...,gs://idc-open-cr/d17ff2f7-de4f-4084-b1bc-eed8a...


---

## **Set Run Parameters**

From this cell, we can configure the nnU-Net inference step - specifying, for instance, the type of model we want to run (among the four different models the framework provides), whether we want to use test time augmentation, or whether we want to export the soft probability maps of the segmentation masks.


In [24]:
# FIXED PARAMETERS
data_base_path = "/content/data"
raw_base_path = "/content/data/raw/tmp"
sorted_base_path = "/content/data/raw/nsclc-radiomics/dicom"

processed_base_path = "/content/data/processed/nsclc-radiomics/"
processed_nrrd_path = os.path.join(processed_base_path, "nrrd")
processed_nifti_path = os.path.join(processed_base_path, "nii")

processed_dicomseg_path = os.path.join(processed_base_path, "dicomseg")
processed_dicompm_path = os.path.join(processed_base_path, "dicompm")

model_input_folder = "/content/data/model_input/"
model_output_folder = "/content/data/nnunet_output/"

dicomseg_json_path = "/content/data/dicomseg_base_metadata.json"

# -----------------
# nnU-Net pipeline parameters

# choose from: "2d", "3d_lowres", "3d_fullres", "3d_cascade_fullres"
nnunet_model = "3d_lowres"
use_tta = False
export_prob_maps = True


## **Running the Analysis for a Single Patient**

In [21]:
import src.utils.gcs as gcs
import src.utils.preprocessing as preprocessing
import src.utils.processing as processing
import src.utils.postprocessing as postprocessing

The following cell runs all the processing pipeline, from pre-processing to post-processing.

For the sake of simplicity, all the extra code was organised in scripts that are fully documented and can be found at [this GitHub repository](https://github.com/ImagingDataCommons/ai_medima_misc/tree/main/nnunet/src).

In [22]:
# sample patient - feel free to choose randomly!
pat_id = "LUNG1-004"

# -----------------
# init

print("Processing patient: %s"%(pat_id))

patient_df = cohort_df[cohort_df["PatientID"] == pat_id]

dicomseg_fn = pat_id + "_SEG.dcm"

input_nifti_fn = pat_id + "_0000.nii.gz"
input_nifti_path = os.path.join(model_input_folder, input_nifti_fn)

pred_nifti_fn = pat_id + ".nii.gz"
pred_nifti_path = os.path.join(model_output_folder, pred_nifti_fn)

pred_softmax_folder_name = "pred_softmax"
pred_softmax_folder_path = os.path.join(processed_nrrd_path, pat_id, pred_softmax_folder_name)

# -----------------
# cross-load the CT data from the IDC buckets, run the preprocessing

# data cross-loading
gcs.download_patient_data(raw_base_path = raw_base_path,
                          sorted_base_path = sorted_base_path,
                          patient_df = patient_df,
                          remove_raw = True)


# DICOM CT to NRRD - good to have for a number of reasons
preprocessing.pypla_dicom_ct_to_nrrd(sorted_base_path = sorted_base_path,
                                     processed_nrrd_path = processed_nrrd_path,
                                     pat_id = pat_id, verbose = True)

# -----------------
# DL-inference

# DICOM CT to NIfTI - required for the processing
preprocessing.pypla_dicom_ct_to_nifti(sorted_base_path = sorted_base_path,
                                      processed_nifti_path = processed_nifti_path,
                                      pat_id = pat_id, verbose = True)

# prepare the `model_input` folder for the inference phase
preprocessing.prep_input_data(processed_nifti_path = processed_nifti_path,
                              model_input_folder = model_input_folder,
                              pat_id = pat_id)

# run the DL-based prediction
processing.process_patient_nnunet(model_input_folder = model_input_folder,
                                  model_output_folder = model_output_folder, 
                                  nnunet_model = nnunet_model, use_tta = use_tta,
                                  export_prob_maps = export_prob_maps)

# convert the softmax predictions to NRRD files
postprocessing.numpy_to_nrrd(model_output_folder = model_output_folder,
                             processed_nrrd_path = processed_nrrd_path,
                             pat_id = pat_id,
                             output_folder_name = pred_softmax_folder_name)

# remove the NIfTI file the prediction was computed from
!rm $input_nifti_path
  
# -----------------
# post-processing
postprocessing.pypla_postprocess(processed_nrrd_path = processed_nrrd_path,
                                 model_output_folder = model_output_folder,
                                 pat_id = pat_id)

postprocessing.nrrd_to_dicomseg(sorted_base_path = sorted_base_path,
                                processed_base_path = processed_base_path,
                                dicomseg_json_path = dicomseg_json_path,
                                pat_id = pat_id)


Processing patient: LUNG1-004
Copying files from IDC buckets to /content/data/raw/tmp/LUNG1-004...
Done in 11.508 seconds.

Sorting DICOM files...
Done in 1.09926 seconds.
Sorted DICOM data saved at: /content/data/raw/nsclc-radiomics/dicom/LUNG1-004
Removing un-sorted data at /content/data/raw/tmp/LUNG1-004...
... Done.

Running 'plastimatch convert' with the specified arguments:
  --input /content/data/raw/nsclc-radiomics/dicom/LUNG1-004/CT
  --output-img /content/data/processed/nsclc-radiomics/nrrd/LUNG1-004/LUNG1-004_CT.nrrd
... Done.

Running 'plastimatch convert' with the specified arguments:
  --input /content/data/raw/nsclc-radiomics/dicom/LUNG1-004/CT
  --output-img /content/data/processed/nsclc-radiomics/nii/LUNG1-004/LUNG1-004_CT.nii.gz
... Done.
Copying /content/data/processed/nsclc-radiomics/nii/LUNG1-004/LUNG1-004_CT.nii.gz
to /content/data/model_input/LUNG1-004_0000.nii.gz...
... Done.
Running `nnUNet_predict` with `3d_lowres` model...
Processing file at /content/data/mod