<a href="https://colab.research.google.com/github/AIM-Harvard/mhub/blob/colab/mhub/mhub/totalsegmentator/notebooks/mhub_totalsegmentator_mwe.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **ModelHub - Whole Body CT Segmentation**

This notebook provides a minimal working example of TotalSegmentator, a tool for the segmentation of 104 anatomical structures from CT images. The model was trained using a wide range of imaging CT data of different pathologies from several scanners, protocols and institutions.

We test TotalSegmentator by implementing an end-to-end (cloud-based) pipeline on publicly available whole body CT scans hosted on the [Imaging Data Commons (IDC)](https://portal.imaging.datacommons.cancer.gov/), starting from raw DICOM CT data and ending with a DICOM SEG object storing the segmentation masks generated by the AI model. The testing dataset we use is external and independent from the data used in the development phase of the model (training and validation) and is composed by a wide variety of image types (from the area covered by the scan, to the presence of contrast and various types of artefacts).

The way all the operations are executed - from pulling data, to data postprocessing, and the standardisation of the results - have the goal of promoting transparency and reproducibility.

Please cite the following article if you use this code or pre-trained models:

Wasserthal, J., Meyer, M., Breit, H.C., Cyriac, J., Yang, S. and Segeroth, M., 2022. TotalSegmentator: robust segmentation of 104 anatomical structures in CT images. arXiv preprint arXiv:2208.05868, [
https://doi.org/10.48550/arXiv.2208.05868]( 	
https://doi.org/10.48550/arXiv.2208.05868).

The original code is published on
[GitHub](https://github.com/wasserth/TotalSegmentator)  using the [Apache-2.0 license](https://github.com/wasserth/TotalSegmentator/blob/master/LICENSE).

# **Disclaimer**

The code and data of this repository are provided to promote reproducible research. They are not intended for clinical care or commercial use.

The software is provided "as is", without warranty of any kind, express or implied, including but not limited to the warranties of merchantability, fitness for a particular purpose and noninfringement. In no event shall the authors or copyright holders be liable for any claim, damages or other liability, whether in an action of contract, tort or otherwise, arising from, out of or in connection with the software or the use or other dealings in the software.

## **Environment Setup**

This demo notebook is intended to be run using a GPU.

To access a free GPU on Colab:
`Edit > Notebooks Settings`.

From the dropdown menu under `Hardware accelerator`, select `GPU`. Let's check the Colab instance is indeed equipped with a GPU.

In [None]:
import os
import sys
import shutil

import yaml

import time
import tqdm


# useful information
curr_dir = !pwd
curr_droid = !hostname
curr_pilot = !whoami

print(time.asctime(time.localtime()))

print("\nCurrent directory :", curr_dir[-1])
print("Hostname          :", curr_droid[-1])
print("Username          :", curr_pilot[-1])

print("Python version    :", sys.version.split('\n')[0])

Tue Jan 31 17:06:07 2023

Current directory : /content
Hostname          : a691501a5ece
Username          : root
Python version    : 3.8.10 (default, Nov 14 2022, 12:59:47) 


The authentication to Google is necessary to run BigQuery queries.

Every operation throughout the whole notebook (BigQuery, fetching data from the IDC buckets) is completely free. The only thing that is needed in order to run the notebook is the set-up of a Google Cloud project. In order for the notebook to work as intended, you will need to specify the name of the project in the cell after the authentication one.

In [None]:
from google.colab import auth
auth.authenticate_user()

In [None]:
from google.colab import files
from google.cloud import storage
from google.cloud import bigquery as bq

# INSERT THE ID OF YOUR PROJECT HERE!
project_id = "idc-external-030"

Throughout this Colab notebook, for image pre-processing we will use [Plastimatch](https://plastimatch.org), a reliable and open source software for image computation. We will be running Plastimatch using the simple [PyPlastimatch](https://github.com/AIM-Harvard/pyplastimatch/tree/main/pyplastimatch) python wrapper. 

In [None]:
%%capture
!pip install yamlmagic

In [None]:
%load_ext yamlmagic

In [None]:
%%capture
!apt install plastimatch

In [None]:
# check plastimatch was correctly installed
!plastimatch --version

plastimatch version 1.8.0


*TODO*: we use subversion to load our mhubio framework and model specific runner module and config into colab.

In [None]:
!apt install subversion

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following additional packages will be installed:
  libapr1 libaprutil1 libserf-1-1 libsvn1 libutf8proc2
Suggested packages:
  db5.3-util libapache2-mod-svn subversion-tools
The following NEW packages will be installed:
  libapr1 libaprutil1 libserf-1-1 libsvn1 libutf8proc2 subversion
0 upgraded, 6 newly installed, 0 to remove and 27 not upgraded.
Need to get 2,355 kB of archives.
After this operation, 10.3 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu focal/main amd64 libapr1 amd64 1.6.5-1ubuntu1 [91.4 kB]
Get:2 http://archive.ubuntu.com/ubuntu focal/main amd64 libaprutil1 amd64 1.6.1-4ubuntu2 [84.7 kB]
Get:3 http://archive.ubuntu.com/ubuntu focal/universe amd64 libserf-1-1 amd64 1.3.9-8build1 [45.2 kB]
Get:4 http://archive.ubuntu.com/ubuntu focal/universe amd64 libutf8proc2 amd64 2.5.0-1 [50.0 kB]
Get:5 http://archive.ubuntu.com/ubuntu focal-updates/un

---

Start by cloning the AIMI hub repository on the Colab instance.

The AIMI hub repository stores all the code we will use for pulling, preprocessing, processing, and postprocessing the data for this use case - as long as the others shared through AIMI hub.

In [None]:
%%capture
!svn checkout https://github.com/AIM-Harvard/aimi_alpha/trunk/aimi/general_utils/ mhub/aimi_utils

!svn checkout https://github.com/AIM-Harvard/mhub/trunk/mhub/mhubio mhub/mhubio
!svn checkout https://github.com/AIM-Harvard/mhub/trunk/mhub/ymldicomseg mhub/ymldicomseg
!svn checkout https://github.com/AIM-Harvard/mhub/trunk/mhub/totalsegmentator mhub/totalsegmentator

To organise the DICOM data in a more common (and human-understandable) fashion after downloading those from the buckets, we will make use of [DICOMSort](https://github.com/pieper/dicomsort). 

DICOMSort is an open source tool for custom sorting and renaming of dicom files based on their specific DICOM tags. In our case, we will exploit DICOMSort to organise the DICOM data by `PatientID` and `Modality` - so that the final directory will look like the following:

```
data/raw/nsclc-radiomics/dicom/$PatientID
 └─── CT
       ├─── $SOPInstanceUID_slice0.dcm
       ├─── $SOPInstanceUID_slice1.dcm
       ├───  ...
       │
      RTSTRUCT 
       ├─── $SOPInstanceUID_RTSTRUCT.dcm
      SEG
       └─── $SOPInstanceUID_RTSEG.dcm

```

In [None]:
!git clone https://github.com/pieper/dicomsort dicomsort

Cloning into 'dicomsort'...
remote: Enumerating objects: 130, done.[K
remote: Counting objects: 100% (4/4), done.[K
remote: Compressing objects: 100% (4/4), done.[K
remote: Total 130 (delta 0), reused 1 (delta 0), pack-reused 126[K
Receiving objects: 100% (130/130), 44.12 KiB | 11.03 MiB/s, done.
Resolving deltas: 100% (63/63), done.


We will also use DCMQI for converting the resulting segmentation into standard DICOM SEG objects.

In [None]:
%%capture
dcmqi_release_url = "https://github.com/QIICR/dcmqi/releases/download/v1.2.4/dcmqi-1.2.4-linux.tar.gz"
dcmqi_download_path = "/content/dcmqi-1.2.4-linux.tar.gz"
dcmqi_path = "/content/dcmqi-1.2.4-linux"

!wget -O $dcmqi_download_path $dcmqi_release_url

!tar -xvf $dcmqi_download_path

!mv $dcmqi_path/bin/* /bin

In [None]:
!printf '#!/bin/bash\npython3 /content/dicomsort/dicomsort.py "$@"\n' > /usr/bin/dicomsort
!chmod +x /usr/bin/dicomsort

---

In [None]:
%%capture
!pip install pyplastimatch nnunet ipywidgets
!pip install TotalSegmentator

In [None]:
import shutil
import random

import json
import pprint
import numpy as np
import pandas as pd

import pydicom
import nibabel as nib
import SimpleITK as sitk
import pyplastimatch as pypla

print("Python version               : ", sys.version.split('\n')[0])
print("Numpy version                : ", np.__version__)

# ----------------------------------------

#everything that has to do with plotting goes here below
import matplotlib
matplotlib.use("agg")

import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from matplotlib.patches import Patch

%matplotlib inline
%config InlineBackend.figure_format = "png"

import ipywidgets as ipyw

## ----------------------------------------

# create new colormap appending the alpha channel to the selected one
# (so that we don't get a \"color overlay\" when plotting the segmask superimposed to the CT)
cmap = plt.cm.Reds
my_reds = cmap(np.arange(cmap.N))
my_reds[:, -1] = np.linspace(0, 1, cmap.N)
my_reds = ListedColormap(my_reds)

cmap = plt.cm.Greens
my_greens = cmap(np.arange(cmap.N))
my_greens[:, -1] = np.linspace(0, 1, cmap.N)
my_greens = ListedColormap(my_greens)

cmap = plt.cm.Blues
my_blues = cmap(np.arange(cmap.N))
my_blues[:, -1] = np.linspace(0, 1, cmap.N)
my_blues = ListedColormap(my_blues)

cmap = plt.cm.spring
my_spring = cmap(np.arange(cmap.N))
my_spring[:, -1] = np.linspace(0, 1, cmap.N)
my_spring = ListedColormap(my_spring)
## ----------------------------------------

import seaborn as sns

Python version               :  3.8.10 (default, Nov 14 2022, 12:59:47) 
Numpy version                :  1.21.6


Provided everything was set up correctly, we can run the BigQuery query and get all the information we need to download the testing data from the IDC platform.

For this specific use case, we are going to be working with the "CT lymph nodes" collection hosted on IDC - which groups a collections of series that are close to whole body CT scans.

In [None]:
%%bigquery cohort_df --project=$project_id 

SELECT
  dicom_pivot_v11.PatientID,
  dicom_pivot_v11.collection_id,
  dicom_pivot_v11.source_DOI,
  dicom_pivot_v11.StudyInstanceUID,
  dicom_pivot_v11.SeriesInstanceUID,
  dicom_pivot_v11.SOPInstanceUID,
  dicom_pivot_v11.gcs_url
FROM
  `bigquery-public-data.idc_v11.dicom_pivot_v11` dicom_pivot_v11
WHERE
  StudyInstanceUID IN (
    SELECT
      StudyInstanceUID
    FROM
      `bigquery-public-data.idc_v11.dicom_pivot_v11` dicom_pivot_v11
    WHERE
      (
        dicom_pivot_v11.collection_id IN ('Community', 'ct_lymph_nodes')
      )
    GROUP BY
      StudyInstanceUID
  )
GROUP BY
  dicom_pivot_v11.PatientID,
  dicom_pivot_v11.collection_id,
  dicom_pivot_v11.source_DOI,
  dicom_pivot_v11.StudyInstanceUID,
  dicom_pivot_v11.SeriesInstanceUID,
  dicom_pivot_v11.SOPInstanceUID,
  dicom_pivot_v11.gcs_url
ORDER BY
  dicom_pivot_v11.PatientID ASC,
  dicom_pivot_v11.collection_id ASC,
  dicom_pivot_v11.source_DOI ASC,
  dicom_pivot_v11.StudyInstanceUID ASC,
  dicom_pivot_v11.SeriesInstanceUID ASC,
  dicom_pivot_v11.SOPInstanceUID ASC,
  dicom_pivot_v11.gcs_url ASC

Query is running:   0%|          |

Downloading:   0%|          |

In [None]:
# this works as intended only if the BQ query parses data from a single dataset
# if not, feel free to set the name manually!
dataset_name = cohort_df["collection_id"].values[0]

dataset_name

'ct_lymph_nodes'

In [None]:
# create the directory tree
!mkdir -p data

!mkdir -p data/input_data 
!mkdir -p data/output_data 

## **Parsing Cohort Information from BigQuery Tables**

We can check the various fields of the table we populated by running the BigQuery query.

This table will store one entry for each DICOM file in the dataset (therefore, expect thousands of rows!)

In [None]:
pat_id_list = sorted(list(set(cohort_df["PatientID"].values)))

print("Total number of unique Patient IDs:", len(pat_id_list))

display(cohort_df.info())

display(cohort_df.head())

Total number of unique Patient IDs: 176
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110003 entries, 0 to 110002
Data columns (total 7 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   PatientID          110003 non-null  object
 1   collection_id      110003 non-null  object
 2   source_DOI         110003 non-null  object
 3   StudyInstanceUID   110003 non-null  object
 4   SeriesInstanceUID  110003 non-null  object
 5   SOPInstanceUID     110003 non-null  object
 6   gcs_url            110003 non-null  object
dtypes: object(7)
memory usage: 5.9+ MB


None

Unnamed: 0,PatientID,collection_id,source_DOI,StudyInstanceUID,SeriesInstanceUID,SOPInstanceUID,gcs_url
0,ABD_LYMPH_001,ct_lymph_nodes,10.7937/K9/TCIA.2015.AQIIDCNM,61.7.22285965616260355338860879829667630274,61.7.167248355135476067044532759811631626828,61.7.100530760313930961000572615593503636820,gs://public-datasets-idc/38101099-8fae-44b5-be...
1,ABD_LYMPH_001,ct_lymph_nodes,10.7937/K9/TCIA.2015.AQIIDCNM,61.7.22285965616260355338860879829667630274,61.7.167248355135476067044532759811631626828,61.7.100619337614589303607528629909134919710,gs://public-datasets-idc/90b51943-20e5-4ce0-b7...
2,ABD_LYMPH_001,ct_lymph_nodes,10.7937/K9/TCIA.2015.AQIIDCNM,61.7.22285965616260355338860879829667630274,61.7.167248355135476067044532759811631626828,61.7.100722470958405165423499101883203258976,gs://public-datasets-idc/949a8429-0b08-4120-ad...
3,ABD_LYMPH_001,ct_lymph_nodes,10.7937/K9/TCIA.2015.AQIIDCNM,61.7.22285965616260355338860879829667630274,61.7.167248355135476067044532759811631626828,61.7.100926126811826446149832025888003249166,gs://public-datasets-idc/9190ed3e-edf4-4771-9d...
4,ABD_LYMPH_001,ct_lymph_nodes,10.7937/K9/TCIA.2015.AQIIDCNM,61.7.22285965616260355338860879829667630274,61.7.167248355135476067044532759811631626828,61.7.102568601113976310733671672702929246062,gs://public-datasets-idc/e050baf5-59e9-4416-8a...


**Text fett markieren**---

## **Setup mhubio**

TODO: import mhubio


In [None]:
import sys, os
sys.path.append('.')

from mhub.mhubio.Config import Config, DataType, FileType, CT, SEG
from mhub.mhubio.modules.importer.UnsortedDicomImporter import UnsortedInstanceImporter
from mhub.mhubio.modules.importer.DataSorter import DataSorter
from mhub.mhubio.modules.convert.NiftiConverter import NiftiConverter
from mhub.mhubio.modules.convert.DsegConverter import DsegConverter
from mhub.mhubio.modules.organizer.DataOrganizer import DataOrganizer
from mhub.totalsegmentator.utils.TotalSegmentatorRunner import TotalSegmentatorRunner

## **Running the Analysis for a Single Patient**

The following cells run all the processing pipeline, from pre-processing to post-processing.

In [None]:
pat_id = random.choice(cohort_df["PatientID"].values)
pat_df = cohort_df[cohort_df["PatientID"] == pat_id].reset_index(drop = True)

display(pat_df.info())
display(pat_df.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 709 entries, 0 to 708
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   PatientID          709 non-null    object
 1   collection_id      709 non-null    object
 2   source_DOI         709 non-null    object
 3   StudyInstanceUID   709 non-null    object
 4   SeriesInstanceUID  709 non-null    object
 5   SOPInstanceUID     709 non-null    object
 6   gcs_url            709 non-null    object
dtypes: object(7)
memory usage: 38.9+ KB


None

Unnamed: 0,PatientID,collection_id,source_DOI,StudyInstanceUID,SeriesInstanceUID,SOPInstanceUID,gcs_url
0,ABD_LYMPH_040,ct_lymph_nodes,10.7937/K9/TCIA.2015.AQIIDCNM,61.7.310058150941655902813894518477154673975,61.7.270112099727492233168245235417633548821,61.7.100844096634047711281583818734013145837,gs://public-datasets-idc/4ca27ae8-e89f-46c7-a4...
1,ABD_LYMPH_040,ct_lymph_nodes,10.7937/K9/TCIA.2015.AQIIDCNM,61.7.310058150941655902813894518477154673975,61.7.270112099727492233168245235417633548821,61.7.101281610963381973124306423125276793439,gs://public-datasets-idc/05be671a-4223-4978-bb...
2,ABD_LYMPH_040,ct_lymph_nodes,10.7937/K9/TCIA.2015.AQIIDCNM,61.7.310058150941655902813894518477154673975,61.7.270112099727492233168245235417633548821,61.7.102103938689957191525483034830848431548,gs://public-datasets-idc/22120445-140a-450a-be...
3,ABD_LYMPH_040,ct_lymph_nodes,10.7937/K9/TCIA.2015.AQIIDCNM,61.7.310058150941655902813894518477154673975,61.7.270112099727492233168245235417633548821,61.7.102785575245738404291343301952108872265,gs://public-datasets-idc/9f1d885d-46ae-4226-9c...
4,ABD_LYMPH_040,ct_lymph_nodes,10.7937/K9/TCIA.2015.AQIIDCNM,61.7.310058150941655902813894518477154673975,61.7.270112099727492233168245235417633548821,61.7.103160839026352525904197325495639489450,gs://public-datasets-idc/e680d2b3-1639-40ba-b2...


In [None]:
# init

print("Processing patient: %s"%(pat_id))
patient_df = cohort_df[cohort_df["PatientID"] == pat_id]

Processing patient: ABD_LYMPH_040


In [None]:
!mkdir data/tmp

In [None]:
# data cross-loading
from mhub.aimi_utils.gcs import download_patient_data
download_patient_data(raw_base_path = "data/tmp",
                      sorted_base_path = "data/input_data",
                      patient_df = patient_df,
                      remove_raw = True)

Copying files from IDC buckets to data/tmp/ABD_LYMPH_040...
Done in 185.57 seconds.

Sorting DICOM files...
Done in 1.46509 seconds.
Sorted DICOM data saved at: data/input_data/ABD_LYMPH_040
Removing un-sorted data at data/tmp/ABD_LYMPH_040...
... Done.


Now we write a configuration file containing the configurations for all modules we're going to use in the following. 

In [None]:
%%writefile totalsegmentator_config.yml

general:
  data_base_dir: /content/data
modules:
  UnsortedInstanceImporter:
    input_dir: input_data
  DataSorter:
    base_dir: /content/data/sorted
    structure: '%SeriesInstanceUID/dicom/%SOPInstanceUID.dcm'
  DsegConverter:
    #dicomseg_json_path: /content/mhub/totalsegmentator/config/dicomseg_metadata_whole.json
    dicomseg_yml_path: /content/mhub/totalsegmentator/config/dseg.yml
    skip_empty_slices: True
  TotalSegmentatorRunner:
    use_fast_mode: true

Writing totalsegmentator_config.yml


In [None]:
# config
config = Config('totalsegmentator_config.yml')
config.verbose = True  # TODO: define levels of verbosity and integrate consistently. 


In [None]:
# import 
UnsortedInstanceImporter(config).execute()


--------------------------
Start UnsortedInstanceImporter
Done in 9.15527e-05 seconds.


In [None]:
# sort
DataSorter(config).execute()


--------------------------
Start DataSorter
sorting schema: /content/data/sorted/%SeriesInstanceUID/dicom/%SOPInstanceUID.dcm
>> run:  dicomsort -k -u /content/data/input_data /content/data/sorted/%SeriesInstanceUID/dicom/%SOPInstanceUID.dcm
adding ct in dicom format with resolved path:  /content/data/sorted/61.7.270112099727492233168245235417633548821/dicom
Done in 1.45 seconds.


In [None]:
# convert (ct:dicom -> ct:nifti)
NiftiConverter(config).execute()


--------------------------
Start NiftiConverter

Running 'plastimatch convert' with the specified arguments:
  --input /content/data/sorted/61.7.270112099727492233168245235417633548821/dicom
  --output-img /content/data/sorted/61.7.270112099727492233168245235417633548821/image.nii.gz
... Done.
Done in 91.2303 seconds.


In [None]:
# execute model (ct:nifti -> seg:nifti)
TotalSegmentatorRunner(config).execute()


--------------------------
Start TotalSegmentatorRunner
Running TotalSegmentator in fast mode ('--fast', 3mm): 
>> run ts:  TotalSegmentator -i /content/data/sorted/61.7.270112099727492233168245235417633548821/image.nii.gz -o /app/tmp/18bd1414-f144-49b0-ad5e-82348d2ca09d --fast


CalledProcessError: ignored

In [None]:
# convert (seg:nifti -> seg:dicomseg)
DsegConverter(config).execute()

In [None]:
# organize data into output folder
organizer = DataOrganizer(config, set_file_permissions=sys.platform.startswith('linux'))
organizer.setTarget(DataType(FileType.NIFTI, CT), "/app/data/output_data/[i:SeriesID]/[path]")
organizer.setTarget(DataType(FileType.DICOMSEG, SEG), "/app/data/output_data/[i:SeriesID]/TotalSegmentator.seg.dcm")
organizer.execute()

---

## **Data Download**

In [None]:
%%capture

archive_fn = "%s.zip"%(pat_id)

try:
  os.remove(archive_fn)
except OSError:
  pass

seg_dicom_path = os.path.join(processed_dicomseg_path, pat_id, dicomseg_fn)
ct_dicom_path = os.path.join(sorted_base_path, pat_id)

!zip -j -r $archive_fn $ct_dicom_path $seg_dicom_path

In [None]:
filesize = os.stat(archive_fn).st_size/1024e03
print('Starting the download of "%s" (%2.1f MB)...\n'%(archive_fn, filesize))

files.download(archive_fn)

Starting the download of "ABD_LYMPH_051.zip" (178.4 MB)...



<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>