<a href="https://colab.research.google.com/github/AIM-Harvard/aimi_alpha/blob/main/aimi/totalsegmentator/notebooks/totalseg_wholebody_mwe.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **ModelHub - Whole Body CT Segmentation**

This notebook provides an example of how to run an end-to-end (cloud-based) data analysis using the TotalSegmentator segmentation pipeline (whole body CT scan segmentation).

The way all the operations are executed - from pulling data to data postprocessing and the standardisation of the results - have the goal of promoting transparency and reproducibility.

## **Environment Setup**

This demo notebook is intended to be run using a GPU.

To access a free GPU on Colab:
`Edit > Notebooks Settings`.

From the dropdown menu under `Hardware accelerator`, select `GPU`. Let's check the Colab instance is indeed equipped with a GPU.

In [None]:
import os
import sys
import shutil

import yaml

import time
import tqdm


# useful information
curr_dir = !pwd
curr_droid = !hostname
curr_pilot = !whoami

print(time.asctime(time.localtime()))

print("\nCurrent directory :", curr_dir[-1])
print("Hostname          :", curr_droid[-1])
print("Username          :", curr_pilot[-1])

print("Python version    :", sys.version.split('\n')[0])

Wed Oct 12 20:20:50 2022

Current directory : /content
Hostname          : 024133d2c4fb
Username          : root
Python version    : 3.7.14 (default, Sep  8 2022, 00:06:44) 


The authentication to Google is necessary to run BigQuery queries.

Every operation throughout the whole notebook (BigQuery, fetching data from the IDC buckets) is completely free. The only thing that is needed in order to run the notebook is the set-up of a Google Cloud project. In order for the notebook to work as intended, you will need to specify the name of the project in the cell after the authentication one.

In [None]:
from google.colab import auth
auth.authenticate_user()

In [None]:
from google.colab import files
from google.cloud import storage
from google.cloud import bigquery as bq

# INSERT THE ID OF YOUR PROJECT HERE!
project_id = "idc-sandbox-000"

Throughout this Colab notebook, for image pre-processing we will use [Plastimatch](https://plastimatch.org), a reliable and open source software for image computation. We will be running Plastimatch using the simple [PyPlastimatch](https://github.com/AIM-Harvard/pyplastimatch/tree/main/pyplastimatch) python wrapper. 

In [None]:
%%capture
!apt install plastimatch

In [None]:
# check plastimatch was correctly installed
!plastimatch --version

plastimatch version 1.7.0


---

Start by cloning the AIMI hub repository on the Colab instance.

The AIMI hub repository stores all the code we will use for pulling, preprocessing, processing, and postprocessing the data for this use case - as long as the others shared through AIMI hub.

In [None]:
!git clone https://github.com/AIM-Harvard/aimi_alpha.git aimi

Cloning into 'aimi'...
remote: Enumerating objects: 488, done.[K
remote: Counting objects: 100% (126/126), done.[K
remote: Compressing objects: 100% (94/94), done.[K
remote: Total 488 (delta 54), reused 66 (delta 26), pack-reused 362[K
Receiving objects: 100% (488/488), 5.21 MiB | 24.57 MiB/s, done.
Resolving deltas: 100% (248/248), done.


To organise the DICOM data in a more common (and human-understandable) fashion after downloading those from the buckets, we will make use of [DICOMSort](https://github.com/pieper/dicomsort). 

DICOMSort is an open source tool for custom sorting and renaming of dicom files based on their specific DICOM tags. In our case, we will exploit DICOMSort to organise the DICOM data by `PatientID` and `Modality` - so that the final directory will look like the following:

```
data/raw/nsclc-radiomics/dicom/$PatientID
 └─── CT
       ├─── $SOPInstanceUID_slice0.dcm
       ├─── $SOPInstanceUID_slice1.dcm
       ├───  ...
       │
      RTSTRUCT 
       ├─── $SOPInstanceUID_RTSTRUCT.dcm
      SEG
       └─── $SOPInstanceUID_RTSEG.dcm

```

In [None]:
!git clone https://github.com/pieper/dicomsort dicomsort

Cloning into 'dicomsort'...
remote: Enumerating objects: 130, done.[K
remote: Counting objects: 100% (4/4), done.[K
remote: Compressing objects: 100% (4/4), done.[K
remote: Total 130 (delta 0), reused 1 (delta 0), pack-reused 126[K
Receiving objects: 100% (130/130), 44.12 KiB | 961.00 KiB/s, done.
Resolving deltas: 100% (63/63), done.


We will also use DCMQI for converting the resulting segmentation into standard DICOM SEG objects.

In [None]:
%%capture
dcmqi_release_url = "https://github.com/QIICR/dcmqi/releases/download/v1.2.4/dcmqi-1.2.4-linux.tar.gz"
dcmqi_download_path = "/content/dcmqi-1.2.4-linux.tar.gz"
dcmqi_path = "/content/dcmqi-1.2.4-linux"

!wget -O $dcmqi_download_path $dcmqi_release_url

!tar -xvf $dcmqi_download_path

!mv $dcmqi_path/bin/* /bin

---

In [None]:
%%capture
!pip install pyplastimatch nnunet ipywidgets
!pip install TotalSegmentator

In [None]:
import shutil
import random

import json
import pprint
import numpy as np
import pandas as pd

import pydicom
import nibabel as nib
import SimpleITK as sitk
import pyplastimatch as pypla

print("Python version               : ", sys.version.split('\n')[0])
print("Numpy version                : ", np.__version__)

# ----------------------------------------

#everything that has to do with plotting goes here below
import matplotlib
matplotlib.use("agg")

import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from matplotlib.patches import Patch

%matplotlib inline
%config InlineBackend.figure_format = "png"

import ipywidgets as ipyw

## ----------------------------------------

# create new colormap appending the alpha channel to the selected one
# (so that we don't get a \"color overlay\" when plotting the segmask superimposed to the CT)
cmap = plt.cm.Reds
my_reds = cmap(np.arange(cmap.N))
my_reds[:, -1] = np.linspace(0, 1, cmap.N)
my_reds = ListedColormap(my_reds)

cmap = plt.cm.Greens
my_greens = cmap(np.arange(cmap.N))
my_greens[:, -1] = np.linspace(0, 1, cmap.N)
my_greens = ListedColormap(my_greens)

cmap = plt.cm.Blues
my_blues = cmap(np.arange(cmap.N))
my_blues[:, -1] = np.linspace(0, 1, cmap.N)
my_blues = ListedColormap(my_blues)

cmap = plt.cm.spring
my_spring = cmap(np.arange(cmap.N))
my_spring[:, -1] = np.linspace(0, 1, cmap.N)
my_spring = ListedColormap(my_spring)
## ----------------------------------------

import seaborn as sns

Python version               :  3.7.14 (default, Sep  8 2022, 00:06:44) 
Numpy version                :  1.21.6


Provided everything was set up correctly, we can run the BigQuery query and get all the information we need to download the testing data from the IDC platform.

For this specific use case, we are going to be working with the "CT lymph nodes" collection hosted on IDC - which groups a collections of series that are close to whole body CT scans.

In [None]:
%%bigquery --project=$project_id cohort_df

SELECT
  dicom_pivot_v11.PatientID,
  dicom_pivot_v11.collection_id,
  dicom_pivot_v11.source_DOI,
  dicom_pivot_v11.StudyInstanceUID,
  dicom_pivot_v11.SeriesInstanceUID,
  dicom_pivot_v11.SOPInstanceUID,
  dicom_pivot_v11.gcs_url
FROM
  `bigquery-public-data.idc_v11.dicom_pivot_v11` dicom_pivot_v11
WHERE
  StudyInstanceUID IN (
    SELECT
      StudyInstanceUID
    FROM
      `bigquery-public-data.idc_v11.dicom_pivot_v11` dicom_pivot_v11
    WHERE
      (
        dicom_pivot_v11.collection_id IN ('Community', 'ct_lymph_nodes')
      )
    GROUP BY
      StudyInstanceUID
  )
GROUP BY
  dicom_pivot_v11.PatientID,
  dicom_pivot_v11.collection_id,
  dicom_pivot_v11.source_DOI,
  dicom_pivot_v11.StudyInstanceUID,
  dicom_pivot_v11.SeriesInstanceUID,
  dicom_pivot_v11.SOPInstanceUID,
  dicom_pivot_v11.gcs_url
ORDER BY
  dicom_pivot_v11.PatientID ASC,
  dicom_pivot_v11.collection_id ASC,
  dicom_pivot_v11.source_DOI ASC,
  dicom_pivot_v11.StudyInstanceUID ASC,
  dicom_pivot_v11.SeriesInstanceUID ASC,
  dicom_pivot_v11.SOPInstanceUID ASC,
  dicom_pivot_v11.gcs_url ASC

In [None]:
# this works as intended only if the BQ query parses data from a single dataset
# if not, feel free to set the name manually!
dataset_name = cohort_df["collection_id"].values[0]

dataset_name

'ct_lymph_nodes'

In [None]:
# create the directory tree
!mkdir -p data models

!mkdir -p data/raw 
!mkdir -p data/raw/tmp data/raw/$dataset_name
!mkdir -p data/raw/$dataset_name/dicom

!mkdir -p data/processed
!mkdir -p data/processed/$dataset_name
!mkdir -p data/processed/$dataset_name/nii
!mkdir -p data/processed/$dataset_name/dicomseg

!mkdir -p data/model_input/
!mkdir -p data/totalsegmentator_output/

## **Parsing Cohort Information from BigQuery Tables**

We can check the various fields of the table we populated by running the BigQuery query.

This table will store one entry for each DICOM file in the dataset (therefore, expect thousands of rows!)

In [None]:
pat_id_list = sorted(list(set(cohort_df["PatientID"].values)))

print("Total number of unique Patient IDs:", len(pat_id_list))

display(cohort_df.info())

display(cohort_df.head())

Total number of unique Patient IDs: 176
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110003 entries, 0 to 110002
Data columns (total 7 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   PatientID          110003 non-null  object
 1   collection_id      110003 non-null  object
 2   source_DOI         110003 non-null  object
 3   StudyInstanceUID   110003 non-null  object
 4   SeriesInstanceUID  110003 non-null  object
 5   SOPInstanceUID     110003 non-null  object
 6   gcs_url            110003 non-null  object
dtypes: object(7)
memory usage: 5.9+ MB


None

Unnamed: 0,PatientID,collection_id,source_DOI,StudyInstanceUID,SeriesInstanceUID,SOPInstanceUID,gcs_url
0,ABD_LYMPH_001,ct_lymph_nodes,10.7937/K9/TCIA.2015.AQIIDCNM,61.7.22285965616260355338860879829667630274,61.7.167248355135476067044532759811631626828,61.7.100530760313930961000572615593503636820,gs://public-datasets-idc/38101099-8fae-44b5-be...
1,ABD_LYMPH_001,ct_lymph_nodes,10.7937/K9/TCIA.2015.AQIIDCNM,61.7.22285965616260355338860879829667630274,61.7.167248355135476067044532759811631626828,61.7.100619337614589303607528629909134919710,gs://public-datasets-idc/90b51943-20e5-4ce0-b7...
2,ABD_LYMPH_001,ct_lymph_nodes,10.7937/K9/TCIA.2015.AQIIDCNM,61.7.22285965616260355338860879829667630274,61.7.167248355135476067044532759811631626828,61.7.100722470958405165423499101883203258976,gs://public-datasets-idc/949a8429-0b08-4120-ad...
3,ABD_LYMPH_001,ct_lymph_nodes,10.7937/K9/TCIA.2015.AQIIDCNM,61.7.22285965616260355338860879829667630274,61.7.167248355135476067044532759811631626828,61.7.100926126811826446149832025888003249166,gs://public-datasets-idc/9190ed3e-edf4-4771-9d...
4,ABD_LYMPH_001,ct_lymph_nodes,10.7937/K9/TCIA.2015.AQIIDCNM,61.7.22285965616260355338860879829667630274,61.7.167248355135476067044532759811631626828,61.7.102568601113976310733671672702929246062,gs://public-datasets-idc/e050baf5-59e9-4416-8a...


---

## **Set Run Parameters**

From this cell, we can configure the nnU-Net inference step - specifying, for instance, the type of model we want to run (among the four different models the framework provides), whether we want to use test time augmentation, or whether we want to export the soft probability maps of the segmentation masks.


In [None]:
# FIXED PARAMETERS
data_base_path = "/content/data"
raw_base_path = "/content/data/raw/tmp"
sorted_base_path = os.path.join("/content/data/raw/", dataset_name, "dicom")

processed_base_path = os.path.join("/content/data/processed/", dataset_name)
processed_nifti_path = os.path.join(processed_base_path, "nii")

processed_dicomseg_path = os.path.join(processed_base_path, "dicomseg")

model_input_folder = "/content/data/model_input/"
model_output_folder = "/content/data/totalsegmentator_output/"

dicomseg_json_path = "/content/aimi/aimi/totalsegmentator/config/dicomseg_metadata_whole.json"

## **Running the Analysis for a Single Patient**

In [None]:
import aimi.aimi as aimi

from aimi import general_utils as aimi_utils
from aimi import totalsegmentator as aimi_model

The following cells run all the processing pipeline, from pre-processing to post-processing.

In [None]:
pat_id = random.choice(cohort_df["PatientID"].values)
pat_df = cohort_df[cohort_df["PatientID"] == pat_id].reset_index(drop = True)

display(pat_df.info())
display(pat_df.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 643 entries, 0 to 642
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   PatientID          643 non-null    object
 1   collection_id      643 non-null    object
 2   source_DOI         643 non-null    object
 3   StudyInstanceUID   643 non-null    object
 4   SeriesInstanceUID  643 non-null    object
 5   SOPInstanceUID     643 non-null    object
 6   gcs_url            643 non-null    object
dtypes: object(7)
memory usage: 35.3+ KB


None

Unnamed: 0,PatientID,collection_id,source_DOI,StudyInstanceUID,SeriesInstanceUID,SOPInstanceUID,gcs_url
0,ABD_LYMPH_008,ct_lymph_nodes,10.7937/K9/TCIA.2015.AQIIDCNM,61.7.93273854116647800470730671671118421206,61.7.280176113695462316231376981119115690926,61.7.100097559128660927446670184150895234239,gs://public-datasets-idc/ba7f0475-0db0-4dbf-81...
1,ABD_LYMPH_008,ct_lymph_nodes,10.7937/K9/TCIA.2015.AQIIDCNM,61.7.93273854116647800470730671671118421206,61.7.280176113695462316231376981119115690926,61.7.100206512389913440886281449668917857549,gs://public-datasets-idc/bc9a70f2-a1f2-4403-b2...
2,ABD_LYMPH_008,ct_lymph_nodes,10.7937/K9/TCIA.2015.AQIIDCNM,61.7.93273854116647800470730671671118421206,61.7.280176113695462316231376981119115690926,61.7.101298939003079484184266337441157027334,gs://public-datasets-idc/41cccdf6-aff6-442f-9a...
3,ABD_LYMPH_008,ct_lymph_nodes,10.7937/K9/TCIA.2015.AQIIDCNM,61.7.93273854116647800470730671671118421206,61.7.280176113695462316231376981119115690926,61.7.101370835362935855626766182843851527054,gs://public-datasets-idc/8a7a0fb8-8763-49ab-b5...
4,ABD_LYMPH_008,ct_lymph_nodes,10.7937/K9/TCIA.2015.AQIIDCNM,61.7.93273854116647800470730671671118421206,61.7.280176113695462316231376981119115690926,61.7.101820270859122602936553500476036960799,gs://public-datasets-idc/31ee7ac7-92a6-451d-bb...


In [None]:
# init

print("Processing patient: %s"%(pat_id))

patient_df = cohort_df[cohort_df["PatientID"] == pat_id]

dicomseg_fn = pat_id + "_SEG.dcm"

input_nifti_fn = pat_id + ".nii.gz"
input_nifti_path = os.path.join(model_input_folder, input_nifti_fn)

pred_nifti_fn = pat_id + ".nii.gz"
pred_nifti_path = os.path.join(model_output_folder, pred_nifti_fn)

pred_softmax_folder_name = "pred_softmax"
pred_softmax_folder_path = os.path.join(processed_nifti_path, pat_id, pred_softmax_folder_name)

Processing patient: ABD_LYMPH_008


In [None]:
# data cross-loading
aimi_utils.gcs.download_patient_data(raw_base_path = raw_base_path,
                                     sorted_base_path = sorted_base_path,
                                     patient_df = patient_df,
                                     remove_raw = True)

Copying files from IDC buckets to /content/data/raw/tmp/ABD_LYMPH_008...
Done in 33.056 seconds.

Sorting DICOM files...
Done in 1.34802 seconds.
Sorted DICOM data saved at: /content/data/raw/ct_lymph_nodes/dicom/ABD_LYMPH_008
Removing un-sorted data at /content/data/raw/tmp/ABD_LYMPH_008...
... Done.


In [None]:
# DICOM CT to NIfTI - required for the processing
aimi_utils.preprocessing.pypla_dicom_ct_to_nifti(sorted_base_path = sorted_base_path,
                                                 processed_nifti_path = processed_nifti_path,
                                                 pat_id = pat_id, verbose = True)


Running 'plastimatch convert' with the specified arguments:
  --input /content/data/raw/ct_lymph_nodes/dicom/ABD_LYMPH_008/CT
  --output-img /content/data/processed/ct_lymph_nodes/nii/ABD_LYMPH_008/ABD_LYMPH_008_CT.nii.gz
... Done.


In [None]:
# prepare the `model_input` folder for the inference phase
aimi_utils.preprocessing.prep_ct_input_data(processed_nifti_path = processed_nifti_path,
                                            model_input_folder = model_input_folder,
                                            pat_id = pat_id)

Copying /content/data/processed/ct_lymph_nodes/nii/ABD_LYMPH_008/ABD_LYMPH_008_CT.nii.gz
to /content/data/model_input/ABD_LYMPH_008_0000.nii.gz...
... Done.


The following cell runs the DL-based segmentation.


In [None]:
aimi_model.utils.processing.process_patient(pat_id = pat_id,
                                            model_input_folder = model_input_folder,
                                            model_output_folder = model_output_folder)

Running TotalSegmentator in default mode (1.5mm)
Done in 468.705 seconds.


---


In [None]:
processed_seg_folder = os.path.join(processed_nifti_path, pat_id, "totalsegmentator")

shutil.copytree(model_output_folder, processed_seg_folder)

'/content/data/processed/ct_lymph_nodes/nii/ABD_LYMPH_008/totalsegmentator'

Given that TotalSegmentator segments more than one hundred structures, the conversion from *.nii.gz to DICOM SEG might take a few minutes.

In [None]:
aimi_model.utils.postprocessing.nifti_to_dicomseg(sorted_base_path = sorted_base_path,
                                                  processed_base_path = processed_base_path,
                                                  dicomseg_json_path = dicomseg_json_path,
                                                  pat_id = pat_id)

---

## **Data Download**

In [None]:
%%capture

archive_fn = "%s.zip"%(pat_id)

try:
  os.remove(archive_fn)
except OSError:
  pass

seg_dicom_path = os.path.join(processed_dicomseg_path, pat_id, dicomseg_fn)
ct_dicom_path = os.path.join(sorted_base_path, pat_id)

!zip -j -r $archive_fn $ct_dicom_path $seg_dicom_path

In [None]:
filesize = os.stat(archive_fn).st_size/1024e03
print('Starting the download of "%s" (%2.1f MB)...\n'%(archive_fn, filesize))

files.download(archive_fn)

Starting the download of "ABD_LYMPH_008.zip" (180.0 MB)...



<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>