<a href="https://colab.research.google.com/github/fedorov/AI-Deep-Learning-Lab-2021/blob/idc-tcia/sessions/tcia-idc/RSNA_2021_IDC_and_TCIA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RSNA 2021: Working with public datasets: TCIA and IDC

The goal of this session is to introduce you to the two data repositories supported by the US National Cancer Institute:

* The Cancer Imaging Archive (TCIA)
* Imaging Data Commons (IDC), which is the imaging repository within NCI Cancer Research Data Commons (CRDC)

**Learning Objectives:**
1. Understand basic capabilities of TCIA and IDC, and the differences between the two repositories.
2. Explore relevant functionality of TCIA and IDC to support data exploration, cohort definition, and retrieval of the data.
3. Learn how to analyze the data retrieved from TCIA/IDC on an example of a lung nodule segmentation task.

This notebook will guide you thought the complete process of identifying a relevant dataset, retrieving it, preparing it for processing by the specific analysis tool, installing the tool and applying it to the dataset, and visualizing the segmentation results produced by the tool.

Note that it is not the purpose of this tutorial to promote a specific tool, or assess its robustness. 

We aim to provide an example of how a tool can be used for analyzing a sample dataset from TCIA/IDC. We hope that after completing this tutorial you will be empowered and motivated to experiment with more tools and apply them to more datasets in TCIA/IDC!

**Session Authors**

* Andrey Fedorov
* Justin Kirby
* Dennis Bontempi



## Outline

There is a growing number of tools being developed for analyzing medical images. More and more of those are shared openly by the authors to support dissemination of findings and reproducibility of scientific studies. 

However, getting such tools to work is not always straightforward. Some of the challenges include deployment of the tool, identifying datasets that are suitable for analysis using a specific tool, preprocessing of the data.

In this notebook we will guide you through the process of deploying one such tool on a Colab VM, and demonstrate how you can utilize publicly available repositories of cancer imaging data to find relevant datasets, how to preprocess them for analysis by a specific tool, and how to visualize results of image segmentation produced by the tool. 

In this tutorial we will work with the nnU-Net segmentation tool developed by Isensee et al in the following publication:

> Isensee, F., Jaeger, P. F., Kohl, S. A. A., Petersen, J. & Maier-Hein, K. H. nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nat. Methods 18, 203–211 (2021).

and shared in https://github.com/MIC-DKFZ/nnUNet.

Specifically, we will utilize the network pretrained to segment 4 abdominal organs at risk (AORs): heart, aorta, trachea and esophagus. The network was trained using the [SegTHOR (Segmentation of THoracic Organs at Risk) dataset](https://arxiv.org/abs/1912.05950) and is shared in this Zenodo entry (see [Task055_SegTHOR.zip](https://zenodo.org/record/4485926/files/Task055_SegTHOR.zip?download=1)):

> Isensee, Fabian, Jäger, Paul F., Kohl, Simon A. A., Petersen, Jens, & Maier-Hein, Klaus H. (2021). pretrained models for 3D semantic image segmentation with nnU-Net (2.1). Zenodo. https://doi.org/10.5281/zenodo.4485926





## Prerequisites

* To use Colab, and to access data in IDC, you will need a [Google Account](https://support.google.com/accounts/answer/27441?hl=en)
* Make sure your Colab instance has a GPU! For this check "Runtime > Change runtime type" and make sure to choose the GPU runtime.
* To perform queries against IDC, and to download imaging data from Google Storage buckets you will need a cloud project with configured billing. You can apply for an IDC-sponsored cloud project using [this form](https://docs.google.com/forms/d/e/1FAIpQLSfXvXqficGaVEalJI3ym6rKqarmW_YUUWG6A4U8pclvR8MmRQ/viewform). 
  * **NOTE: if you are working with this notebook at RSNA 2021, please mention "RSNA 2021" in the comments section of the application form! This way we will onboard you to a shared sponsored project we set up specifically for the conference. Applications submitted during the RSNA 2021 tutorial session will be processed rapidly!**

## nnU-Net model setup

**Download of the pretrained network is the most time-consuming steps in this tutorial, so please execute the cells in this section before anything else!**

While the nnU-Net framework should take care of the model download (from Zenodo), some of the zip files containing the pre-trained weights are particularly large, so the download can take a lot of time, get stuck, or produce errors (as [reported by other users](https://github.com/MIC-DKFZ/nnUNet/issues/358#issue-726410474) and in the [repository FAQ](https://github.com/MIC-DKFZ/nnUNet/blob/master/documentation/common_problems_and_solutions.md#downloading-pretrained-models-unzip-cannot-find-zipfile-directory-in-one-of-homeisenseennunetdownload_16031094034174126)) .

For this reason, and for the purpose of speeding up this tutorial, we decided to copy the relevant model weights in a shared Dropbox folder. In the following cells, we use Linux `wget` to pull such files from the folder - and use the nnU-Net framework command `nnUNet_install_pretrained_model_from_zip` to unpack and install the pre-trained model.

In [None]:
# create the directory tree
!mkdir -p tutorial 
!mkdir -p tutorial/models tutorial/data tutorial/output
!mkdir -p tutorial/data/dicom tutorial/data/processed tutorial/data/nnUNet_raw_data

In [None]:
# this will usually take between one and five minutes (but can sometimes take up to eight)
seg_model_url = "https://www.dropbox.com/s/m7es2ojn8h0ybhv/Task055_SegTHOR.zip?dl=0"
output_path = "tutorial/models/Task055_SegTHOR.zip"

!wget -O $output_path $seg_model_url

## The Cancer Imaging Archive (TCIA)


This course assumes you have some basic familiarity with The Cancer Imaging Archive.  If you have never used TCIA you can [watch this presentation from RSNA 2020](https://vimeo.com/595989800) in order to understand the mission of TCIA and services it provides to the research community.  Options for accessing data from TCIA are summarized at https://www.cancerimagingarchive.net/access-data/. The two most relevant data access methods for this course are briefly summarized below.

#### Browsing Collections & Analysis Results

The most basic way to find data on TCIA is to [Browse Collections](https://www.cancerimagingarchive.net/collections) and [Browse Analysis Results](https://www.cancerimagingarchive.net/tcia-analysis-results/). Using the information in the table you can identify potential datasets of interest. Clicking on a given dataset takes you to a page which provides a description, data usage policy and citation guidelines, and links to download the data.  

TCIA hosts a variety of image types and other related files, but the majority of its data are radiology images stored in DICOM format. When downloading DICOM images the download link will save a *.TCIA "manifest" file rather than the actual images. These manifest files must be opened with a helper application called the [NBIA Data Retriever](https://wiki.cancerimagingarchive.net/x/egOnAg). The Data Retriever can be installed on Windows, Mac and Linux operating systems. The Linux version also supports a command-line interface option which can be used on Google Colab.  

#### Example
Let's assume you are interested in lung cancer datasets which have both CT images and segmentations in DICOM format.  From the [Browse Collections](https://www.cancerimagingarchive.net/collections) page you can use the filter box (top right of the table of datasets) to filter out datasets of interest.  Try typing "lung cancer CT".  This should reduce the table to 24 results.  In order to find out which datasets also have segmentations you can add "SEG" or "RTSTRUCT" to the filter.  For the sake of this example, let's try using "lung cancer CT seg".  This should reduce the results to 3 datasets.  Let's assume that you find the [NSCLC-Radiomics-Interobserver1](https://doi.org/10.7937/tcia.2019.cwvlpd26) collection to be the most interesting.  Clicking on the link to this dataset in the table will open its summary page. 

After reviewing the page to learn more about this dataset, scroll down to the bottom "Data Access" section.  Click the blue "Download" button for the Images and Segmentations to save the associated manifest file.  You can then upload this file to Colab, and open it using the NBIA Data Retriever by running the following code.


In [None]:
# install NBIA Data Retriever for downloading images 
# documentation available at https://wiki.cancerimagingarchive.net/display/NBIA/Downloading+TCIA+Images

!mkdir /usr/share/desktop-directories/
!wget -P /content/NBIA-Data-Retriever https://cbiit-download.nci.nih.gov/nbia/releases/ForTCIA/NBIADataRetriever_4.2/nbia-data-retriever-4.2.deb
!dpkg -i /content/NBIA-Data-Retriever/nbia-data-retriever-4.2.deb


In [None]:
# NBIA Data Retriever's Linux CLI documentation is at: https://wiki.cancerimagingarchive.net/display/NBIA/NBIA+Data+Retriever+Command+Line+Interface 

# TODO: is there a way to wget the file directly from the wiki to avoid manual steps of uploading the file into Colab?
!wget -O manifest_nsclc.tcia https://wiki.cancerimagingarchive.net/download/attachments/52756590/NSCLC-RADIOMICS-INTEROBSERVER1-Aug%2031%202020-NBIA-manifest.tcia?version=1&modificationDate=1598890227618&api=v2

## Imaging Data Commons (IDC)

The National Cancer Institute (NCI) Cancer Research DataCommons (CRDC) [link text](https://datacommons.cancer.gov/) aims to establish a national cloud-based datascience infrastructure. Imaging Data Commons (IDC) is a newcomponent of CRDC supported by the Cancer Moonshot. The goal of IDC is to enable a broad spectrum of cancer researchers, with and without imaging expertise, to easily access and explore the value of deidentified imaging data and to support integrated analyses with nonimaging data. We achieve this goal by colocating versatile imaging collections with cloud-based computing resources and data exploration, visualization, and analysis tools. 

IDC provides access to curated imaging collections, accompanied by documentation, a user forum, and a growing number of analysis use cases that aim todemonstrate the value of a data commons framework applied to cancer imaging research.

Key resources maintained by IDC are the following:
* [public cancer imaging data](https://imaging.datacommons.cancer.gov/collections/) stored in Google Storage buckets and [public metadata tables](https://console.cloud.google.com/bigquery?p=bigquery-public-data&d=idc_current&page=dataset) that contain all of the DICOM metadata for IDC-hosted images
  * these are now available in [Google Public Dataset Program](https://console.cloud.google.com/marketplace/product/gcp-public-data-idc/nci-idc-data)
* [radiology](https://viewer.imaging.datacommons.cancer.gov/viewer/1.3.6.1.4.1.14519.5.2.1.6279.6001.224985459390356936417021464571) and [pathology](https://viewer.imaging.datacommons.cancer.gov/slim/studies/1.3.6.1.4.1.5962.99.1.2463087261.2121647220.1625960757917.3.0/series/1.3.6.1.4.1.5962.99.1.2463087261.2121647220.1625960757917.2.0) zero-footprint viewers that can be used to visualize any of the data hosted by IDC in your browser
* radiology and pathology use cases: reproducible analysis workflows that operate on IDC data, [available as Colab notebooks](https://github.com/ImagingDataCommons/IDC-Examples/tree/master/notebooks)
* [user portal](https://imaging.datacommons.cancer.gov/) that can be used to explore the data available in IDC, visualize images and annotations, and build cohorts
* [API](https://api.imaging.datacommons.cancer.gov/v1/swagger) that can be used for programmatic operations with IDC cohorts

At the moment, most of the data you will find in IDC has been replicated from TCIA (exception to this are DICOM-converted digital pathology collections). In the future, IDC will host cancer imaging data from sources other than TCIA.

## Setup of the Colab VM



In the following cells we will confirm you have a GPU before doing anything else, and will install and import all the Python dependencies. 

The main python packages we need to install are:
* `nnunet` - which is the [codebase for the nn-UNet framework](https://github.com/MIC-DKFZ/nnUNet) we are going to be using for the segmentation step;
* `pydicom`, a Python [package](https://github.com/pydicom/pydicom) that lets the use read, modify, and write DICOM data in an easy "pythonic" way - that we are going to use to distinguish different DICOM objects from each other.

### GPU checks

In [None]:
# check wether the Colab Instance was correctly initialized with a GPU instance
gpu_list = !nvidia-smi --list-gpus

has_gpu = False if "failed" in gpu_list[0] else True

if not has_gpu:
  print("Your Colab VM does not have a GPU - check \"Runtime > Change runtime type\"")

In [None]:
# check which model of GPU the notebook is equipped with - a Tesla K80 or T4
# T4 is the best performing on the two - and can about half the GPU processing time

!nvidia-smi

### Environment Setup

Here we will configure the Linux environment variables needed to run the nnU-Net pipeline. 

Three main variables are needed by default to run the nnU-Net segmentation pipelines:
* `nnUNet_raw_data_base` is the path to the folder where the segmentation pipeline expects to find the data to process;
* `nnUNet_preprocessed` is the path to the folder where the preprocessed data are saved;
* `RESULTS_FOLDER` is the path to the folder storing by default the model weights and, in our case, for simplicity, the segmentation masks produced by the pipeline.

We will use the additional variable `PATH_TO_MODEL_FILE` to point to the location where the pre-trained model weights for the chosen model will be stored (more on this later).

Please notice that these variables need to be set using `os.environ[]` in Google Colab - as `!export` is not sufficient to guarantee the variables are kept from one cell to the other. For more in-depth information regarding what the nnU-Net framework uses these folders for, please visit [the dedicated nnU-Net documentation page](https://github.com/MIC-DKFZ/nnUNet/blob/master/documentation/setting_up_paths.md)

In [None]:
# set env variables for the bash process
import os
os.environ['nnUNet_raw_data_base'] = "/content/tutorial/data/nnUNet_raw_data/"
os.environ['nnUNet_preprocessed'] = "/content/tutorial/data/processed/"

os.environ["RESULTS_FOLDER"] = "/content/tutorial/output/"
os.environ["PATH_TO_MODEL_FILE"] = "/content/tutorial/models/Task055_SegTHOR.zip"

dicom_sorted_dir = "/content/tutorial/data/dicom"

### Install command-line tools


The only package we will need for this tutorial is [Plastimatch](https://plastimatch.org/index.html). Plastimatch is considered to be the swiss army knife of medical images processing: we will use it to convert DICOM (CT, RTSTRUCT) series to NRRD files - but it can be used for a multitude of other tasks, such as registration, resampling, cropping, and computing statistics to name a few. Plastimatch is also available as a 3DSlicer plug-in and can be used directly from the Slicer GUI.

For the sake of clarity and simplicity, we will call Plastimatch from a very simple [Python wrapper](https://github.com/denbonte/pyplastimatch) written for the occasion (unfortunately, Plastimatch does not provide an official one) - more on this later.

In [None]:
%%capture
!sudo apt update

!sudo apt install plastimatch

In [None]:
!echo $(plastimatch --version)

We are also going to install subversion, a tool that will allow us to clone GitHub repositories only partially (to save time and space).

In [None]:
%%capture

!sudo apt install subversion

In [None]:
!echo $(svn --version | head -n 2)

### Install Python packages

In [None]:
%%capture
!pip install nnunet
!pip install pydicom

Unpack and install model (under `PATH_TO_MODEL_FILE`).

In [None]:
%%capture
!nnUNet_install_pretrained_model_from_zip $PATH_TO_MODEL_FILE

In [None]:
import os
import sys
import shutil

import time
import gdown

import json
import pprint
import numpy as np
import pandas as pd

import pydicom
import nibabel as nib
import SimpleITK as sitk

from medpy.metric.binary import dc as dice_coef
from medpy.metric.binary import hd as hausdorff_distance
from medpy.metric.binary import asd as avg_surf_distance

from medpy.filter.binary import largest_connected_component

# use the "tensorflow_version" magic to make sure TF 1.x is imported
%tensorflow_version 1.x
import tensorflow as tf
import keras

print("Python version               : ", sys.version.split('\n')[0])
print("Numpy version                : ", np.__version__)
print("TensorFlow version           : ", tf.__version__)
print("Keras (stand-alone) version  : ", keras.__version__)

print("\nThis Colab instance is equipped with a GPU.")

# ----------------------------------------

#everything that has to do with plotting goes here below
import matplotlib
matplotlib.use("agg")

import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap

%matplotlib inline
%config InlineBackend.figure_format = "png"

import ipywidgets as ipyw

## ----------------------------------------

# create new colormap appending the alpha channel to the selected one
# (so that we don't get a \"color overlay\" when plotting the segmask superimposed to the CT)
cmap = plt.cm.Reds
my_reds = cmap(np.arange(cmap.N))
my_reds[:,-1] = np.linspace(0, 1, cmap.N)
my_reds = ListedColormap(my_reds)

cmap = plt.cm.Greens
my_greens = cmap(np.arange(cmap.N))
my_greens[:,-1] = np.linspace(0, 1, cmap.N)
my_greens = ListedColormap(my_greens)

cmap = plt.cm.Blues
my_blues = cmap(np.arange(cmap.N))
my_blues[:,-1] = np.linspace(0, 1, cmap.N)
my_blues = ListedColormap(my_blues)

cmap = plt.cm.spring
my_spring = cmap(np.arange(cmap.N))
my_spring[:,-1] = np.linspace(0, 1, cmap.N)
my_spring = ListedColormap(my_spring)
## ----------------------------------------

import seaborn as sns

In [None]:
# PyPlastimatch - python wrapper for Plastimatch (and interactive notebook visualisation)
!svn checkout https://github.com/AIM-Harvard/pyplastimatch/trunk/pyplastimatch tutorial/pyplastimatch

In [None]:
# dicomsort is the pythong package that can sort DICOM files into
# folder organization based on user-specified DICOM attributes
!git clone https://github.com/pieper/dicomsort.git

## Data selection

The model is trained to segment organs in chest CT. In the following cell we initialize variable that points to the specific CT image (more precisely, specific DICOM CT image series) using DICOM `SeriesInstanceUID` attribute that we will use with the segmentation tool. Utilizing this unique identifier you can retrieve the imaging series from either TCIA or IDC.

Once you are done with this example, we will give you instructions how to find more chest CT image series that you can use to experiment with the nnU-Net segmentation tool!

In [None]:
my_SeriesInstanceUID = "1.3.6.1.4.1.32722.99.99.232988001551799080335895423941323261228"

## Visualization and download of data from IDC

In order to work with Google Cloud, you will need to have a GCP project configured with billing enabled. If you completed the prerequisites, you should have project ID handy - please put into the following cell.


In [None]:
# initialize this variable with your project ID!
# if you are working with this notebook during RSNA 2021, completed the prerequisites
# and your application has been approved, you should be able to use the project ID
# specified here without changes
my_ProjectID = "idc-external-016"

import os
os.environ["GCP_PROJECT_ID"] = my_ProjectID

In the next cell your google account will be authenticated so that you can interact with the GCP resources. Follow the prompts and enter the verification code.

In [None]:
from google.colab import auth
auth.authenticate_user()

Given `SeriesInstanceUID` value identifying the image series, we can query the IDC metadata table to get the list of files (defined by the Google Storage URLs) corresponding to this series. You can perform this query using command line tools from Google Cloud SDK, BigQuery Python API, or with the `%%bigquery` magic. We will use Python API, since it is the easiest to parameterize.

All of the DICOM metadata for each of the DICOM files is available in the BigQuery table we will be querying. We will get not just the `gcs_url`, but also identifiers for the Study, Series and Instance, to better understand organization of data, and since `StudyInstanceUID` will be handy later when we get to the visualization of the data.

In [None]:
from google.cloud import bigquery
bq_client = bigquery.Client(my_ProjectID)

selection_query = f"\
  SELECT  \
    StudyInstanceUID, \
    SeriesInstanceUID, \
    SOPInstanceUID, \
    gcs_url \
  FROM \
    `bigquery-public-data.idc_current.dicom_all` \
  WHERE \
    SeriesInstanceUID = \"{my_SeriesInstanceUID}\""

selection_result = bq_client.query(selection_query)
selection_df = selection_result.result().to_dataframe()

Let's look at the resulting table. Each row corresponds to a single DICOM file, which can be downloaded using the `gcs_url` URL. The values of `StudyInstanceUID` and `SeriesInstanceUID` are identical for all files, since they belong to the same study and series, but `SOPInstanceUID` values uniquely identify the specific DICOM file (instance).

In [None]:
import pandas as pd
pd.set_option('display.max_colwidth', None)

selection_df

We will initiallize `StudyInstanceUID` as we will need it for several steps in the following cells.

In [None]:
import pandas as pd
my_StudyInstanceUIDs = selection_df['StudyInstanceUID'].unique()
my_StudyInstanceUIDs.sort()
my_StudyInstanceUID = my_StudyInstanceUIDs[0]

We can use the IDC radiology image viewer to visualize the series and study of interest. This can be done using the identifiers we have in the dataframe we obtained in the earlier query.

In [None]:
def get_idc_viewer_url(studyUID, seriesUID=None):
  url = "https://viewer.imaging.datacommons.cancer.gov/viewer/"+studyUID
  if seriesUID is not None:
    url = url+"?seriesInstanceUID="+seriesUID
  return url

print("URL to view the entire study:")
print(get_idc_viewer_url(my_StudyInstanceUID))
print()
print("URL to view the specific series:")
print(get_idc_viewer_url(my_StudyInstanceUID, my_SeriesInstanceUID))

Now that we confirmed the series we identified is indeed a CT of the chest, let's download the DICOM files for this series to the Colab VM. We will use the `gsutil` command line tool to fetch each of the files defined by the list of GCS URLs saved in a plain text file.

If you want to download large number of files from IDC, make sure to check out [this documentation article](https://learn.canceridc.dev/data/downloading-data) to learn about performance optimizations of the download!

In [None]:
# save the list of GCS URLs into a file
import os
idc_download_folder = "/content/IDC_downloads"
if not os.path.exists(idc_download_folder):
  os.mkdir(idc_download_folder)

selection_manifest = os.path.join(idc_download_folder, "idc_manifest.txt")
selection_df["gcs_url"].to_csv(selection_manifest, header=False, index=False)

In [None]:
# confirm the resulting manifest has as many lines as the number of rows in the
# dataframe we initialized earlier
!cat /content/IDC_downloads/idc_manifest.txt |wc

In [None]:
# let's make sure the download folder is clean, in case you ran this cell earlier
# for a different dataset
!rm -rf /content/IDC_downloads/*.dcm

In [None]:
# download is this simple!
%%capture

!cat /content/IDC_downloads/idc_manifest.txt | gsutil -m cp -I /content/IDC_downloads


Now you have the data ready for the next steps of the processing. If you are interested how to download the same series using TCIA API, continue to the next section. Otherwise you can skip to **Sort the DICOM files**.

## Download of data from TCIA

For the sake of simplicity, we will download images for a specific case that we know has CT of the chest, and segmentations of the organs of interest. This time, let's utilize the [NBIA REST API](https://wiki.cancerimagingarchive.net/x/fILTB) instead of the NBIA Data Retriever to download the data.



In [None]:
import requests
params = {"Collection":my_CollectionID, "PatientID":my_PatientID}
r = requests.get("https://services.cancerimagingarchive.net/nbia-api/services/v1/getSeries", params=params)
if r.status_code == 200:
  df = pd.read_json(r.text)
else:
  print(f"Failed with {r.status_code}")

In [None]:
import json
tcia_manifest_json = json.loads(r.text)


tcia_selection_df = pd.read_json(r.text)

distinct_StudyInstanceUIDs = selection_df['StudyInstanceUID'].unique()
distinct_StudyInstanceUIDs.sort()
print("Distinct values of StudyInstanceUID:")
print('\n'.join(distinct_StudyInstanceUIDs))

study_uid = distinct_StudyInstanceUIDs[0]
print(f"\nStudy that will be analyzed: {study_uid}")

In [None]:
!mkdir -p /content/TCIA_downloads && rm -rf /content/TCIA_downloads/*.dcm

In [None]:
# download zip file with the series instances
import requests, os, zipfile

tcia_download_folder = "/content/TCIA_downloads"
params = {"SeriesInstanceUID":my_SeriesInstanceUID}
image_request = requests.get(" https://services.cancerimagingarchive.net/nbia-api/services/v1/getImage", params=params, stream=True)
print(f"Completed request: {image_request.url}")
if image_request.status_code == 200:
  series_zip_name = os.path.join(tcia_download_folder, f"{my_SeriesInstanceUID}.zip")
  with open(series_zip_name, "wb") as f:
    for chunk in image_request.iter_content(chunk_size=1024):
      f.write(chunk)

  # extract individual instances from the series zip file
  series_folder_name = os.path.join(tcia_download_folder, my_SeriesInstanceUID)
  if not os.path.exists(series_folder_name):
    os.mkdir(series_folder_name)
  with zipfile.ZipFile(series_zip_name, 'r') as zip_ref:
    zip_ref.extractall(series_folder_name)
else:
  print(f"Failed with {r.status_code}")

In [None]:
import json, zipfile

series_list = tcia_selection_df[tcia_selection_df["StudyInstanceUID"]==study_uid]["SeriesInstanceUID"].unique()

tcia_download_folder = "/content/TCIA_downloads"
if not os.path.exists(tcia_download_folder):
  os.mkdir(tcia_download_folder)

for series_uid in series_list:

  # download zip file with the series instances
  params = {"SeriesInstanceUID":series_uid}
  image_request = requests.get(" https://services.cancerimagingarchive.net/nbia-api/services/v1/getImage", params=params, stream=True)
  print(f"Completed request: {image_request.url}")
  if image_request.status_code == 200:
    series_zip_name = os.path.join(tcia_download_folder, f"{series_uid}.zip")
    with open(series_zip_name, "wb") as f:
      for chunk in image_request.iter_content(chunk_size=1024):
        f.write(chunk)
    print(f"Downloaded and saved series {series_uid}")

    # extract individual instances from the series zip file
    series_folder_name = os.path.join(tcia_download_folder, series_uid)
    if not os.path.exists(series_folder_name):
      os.mkdir(series_folder_name)
    with zipfile.ZipFile(series_zip_name, 'r') as zip_ref:
      zip_ref.extractall(series_folder_name)
  else:
    print(f"Failed with {r.status_code}")

Now we will extract DICOM attributes of interset to enable more convenient exploration and subsetting of series within the study.

In [None]:
import glob
import pydicom

selection_dict = []
for root, _, files in os.walk(tcia_download_folder):
  for file in files:
    if file.endswith(".dcm"):
      dcm = pydicom.read_file(os.path.join(root, file), stop_before_pixels=True)
      dict_item = {}
      for attr in dicom_attributes:
        try:
          dict_item[attr] = dcm.data_element(attr).value
        except (AttributeError, KeyError) as e:
          #print(f"Failed to find {attr} in {file}! Skipping.")
          dict_item[attr] = None
      selection_dict.append(dict_item)
tcia_selection_df = pd.DataFrame(selection_dict)

In [None]:
tcia_selection_df

In [None]:
pd.set_option('display.max_colwidth', None)

selection_df[selection_df["StudyInstanceUID"]==target_StudyInstanceUID].groupby(['SeriesInstanceUID','Modality']).size().reset_index().rename(columns={0:'count'})

In [None]:
selection_df

In [None]:
### %%bigquery --project=idc-tcia  --params={"patient_id":"LUNG1-002","collection_id":"NSCLC-Radiomics"} case002_df 
dicom_attributes = ["PatientID", "StudyInstanceUID", "SeriesInstanceUID", "SOPInstanceUID", "Modality", "SeriesDescription"]
dicom_attributes_str = ','.join(dicom_attributes)
dicom_attributes_str

TODO: replace the below with BQ python client to simplify parameterization

In [None]:
%%bigquery --project=idc-tcia  --params={"patient_id":"100014","collection_id":"NLST"} selection_df 

WITH
  idc_manifest AS (
  SELECT
    PatientID,
    StudyInstanceUID,
    SeriesInstanceUID,
    SOPInstanceUID,
    Modality,
    SeriesDescription,
    gcs_url,
    collection_id as idc_collection_id
  FROM
    `bigquery-public-data.idc_current.dicom_all`
  WHERE
    PatientID = @patient_id
)
SELECT
  idc_manifest.*,
  # this is necessary since collection IDs used internally by IDC and TCIA are a bit different,
  # so we need to get the TCIA collection ID that will be recognized by TCIA API
  aux_table.tcia_api_collection_id
FROM
  idc_manifest
JOIN
  `bigquery-public-data.idc_current.auxiliary_metadata` AS aux_table
ON
  idc_manifest.SOPInstanceUID = aux_table.SOPInstanceUID
WHERE 
    # PatientID is unique and parameterization by collection_id is not really necessary,
    # but we use it here for consistency with the query we use with NBIA API, which does
    # require collection ID to be specified
    aux_table.tcia_api_collection_id = @collection_id


In [None]:
from google.cloud import bigquery
bq_client = bigquery.Client(my_ProjectID)

selection_query = f"\
  WITH idc_manifest AS ( \
  SELECT {dicom_attributes_str}, \
    gcs_url, \
    collection_id as idc_collection_id \
  FROM \
    `bigquery-public-data.idc_current.dicom_all` \
  WHERE \
    PatientID = \"{my_PatientID}\" \
) \
SELECT \
  idc_manifest.*, \
  # this is necessary since collection IDs used internally by IDC and TCIA are a bit different,\n \
  # so we need to get the TCIA collection ID that will be recognized by TCIA API\n \
  aux_table.tcia_api_collection_id \
FROM \
  idc_manifest \
JOIN \
  `bigquery-public-data.idc_current.auxiliary_metadata` AS aux_table \
ON \
  idc_manifest.SOPInstanceUID = aux_table.SOPInstanceUID \
WHERE \
    # PatientID is unique and parameterization by collection_id is not really necessary,\n \
    # but we use it here for consistency with the query we use with NBIA API, which does\n \
    # require collection ID to be specified\n \
    aux_table.tcia_api_collection_id = \"{my_CollectionID}\"" 

print(selection_query)
selection_result = bq_client.query(selection_query)
selection_df = selection_result.result().to_dataframe()


## Sort the DICOM files

Let's check the content downloaded from the two repositories is identical, just in case. This section will apply only if you downloaded the data both from IDC and TCIA.

In [None]:
!mkdir -p IDC_sorted && mkdir -p TCIA_sorted
!python dicomsort/dicomsort.py -k -u IDC_downloads IDC_sorted/%PatientID/%StudyInstanceUID/%SeriesInstanceUID/%SOPInstanceUID.dcm
!python dicomsort/dicomsort.py -k -u TCIA_downloads TCIA_sorted/%PatientID/%StudyInstanceUID/%SeriesInstanceUID/%SOPInstanceUID.dcm

Run the cell below only if you downloaded the image series using both IDC and TCIA routes.

In [None]:
# compare if the files downloaded for the series are the same between IDC and TCIA
!diff -r IDC_sorted TCIA_sorted

Move the sorted data into the right place

In [None]:
!mkdir -p tutorial/data/dicom && rm -rf tutorial/data/dicom/* && mv IDC_sorted/* tutorial/data/dicom

## Data Pre-processing

In order to run the AI segmentation pipeline, we need to convert the DICOM data in a format required by nnU-Net.

Using the simple Plastimatch wrapper, let's convert the DICOM CT series in both NRRD (very flexible, simple handling with SimpleITK) and NIfTI (as required by the nnU-Net pipeline) format.


In [None]:
from tutorial.pyplastimatch import pyplastimatch as pypla
from tutorial.pyplastimatch.utils import viz as viz_utils
from tutorial.pyplastimatch.utils import data as data_utils

In [None]:
pat = os.listdir(dicom_sorted_dir)[0]

# study_uid was initialized earlier, when we decided which study to download

# directories storing NRRD and NIfTI files
base_preproc_path = "tutorial/data/processed"

pat_dir_path_nrrd = os.path.join(base_preproc_path, "nrrd", pat)
pat_dir_path_nii = os.path.join(base_preproc_path, "nii", pat)
  
# patient subfolder where all the preprocessed NRRDs will be stored
if not os.path.exists(pat_dir_path_nrrd): os.makedirs(pat_dir_path_nrrd)
  
# patient subfolder where all the preprocessed NIfTIs will be stored
if not os.path.exists(pat_dir_path_nii): os.makedirs(pat_dir_path_nii)

# path to the directory where the DICOM CT file is stored
path_to_ct_dir = os.path.join("tutorial/data/dicom", pat,
                              my_StudyInstanceUID, my_SeriesInstanceUID)

# path to the files where the NRRD and NIfTI CTs will be stored
ct_nrrd_path = os.path.join(pat_dir_path_nrrd, pat + "_ct.nrrd")
ct_nii_path = os.path.join(pat_dir_path_nii, pat + "_ct.nii.gz")

verbose = True

# logfile for the plastimatch conversion
log_file_path_nrrd = os.path.join(pat_dir_path_nrrd, pat + '_pypla.log')
log_file_path_nii = os.path.join(pat_dir_path_nii, pat + '_pypla.log')
  
# DICOM CT to NRRD conversion (if the file doesn't exist yet)
if not os.path.exists(ct_nrrd_path):
  convert_args_ct = {"input" : path_to_ct_dir,
                     "output-img" : ct_nrrd_path}
  
  # clean old log file if it exist
  if os.path.exists(log_file_path_nrrd): os.remove(log_file_path_nrrd)
  
  pypla.convert(verbose = verbose, path_to_log_file = log_file_path_nrrd, **convert_args_ct)

# DICOM CT to NIfTI conversion (if the file doesn't exist yet)
if not os.path.exists(ct_nii_path):
  convert_args_nii = {"input" : path_to_ct_dir, 
                      "output-img" : ct_nii_path}
  
  # clean old log file if it exist
  if os.path.exists(log_file_path_nii): os.remove(log_file_path_nii)
  
  pypla.convert(verbose = verbose, path_to_log_file = log_file_path_nii, **convert_args_nii)

As the final step before running the lung nodules segmentation pipeline, we need to make sure the folder storing the data follows the structure required by the nnU-Net framework, described at the [dedicated documentation page](https://github.com/MIC-DKFZ/nnUNet/blob/master/documentation/dataset_conversion.md).

In [None]:
# create a folder (random task name) for nnU-Net inference
proc_folder_path = os.path.join(os.environ["nnUNet_raw_data_base"],
                                "segthor", "imagesTs")

!mkdir -p $proc_folder_path

# populate the folder following the nnU-Net naming conventions
copy_path = os.path.join(proc_folder_path, pat + "_0000.nii.gz")

# copy NIfTI to the right dir for nnU-Net processing
if not os.path.exists(copy_path):
  shutil.copy(src = ct_nii_path, dst = copy_path)

## Segmentation of thoracic structures from CT series

### Inference 

In order to run the segmentation pipeline, we can follow the ["run inference" section of the nnU-Net documentation](https://github.com/MIC-DKFZ/nnUNet#how-to-run-inference-with-pretrained-models), specifying the path to the input and output folders defined in the sections above, and the pretrained model we want to use (i.e., the one we downloaded earlier).

For the purpose of this notebook, to make the processing faster, we are not going to use an ensemble of different U-Net configurations for inference or test time augmentation (TTA). You are invited to explore these options later - and if you decide to do so, you can read [this example](https://github.com/MIC-DKFZ/nnUNet/blob/master/documentation/inference_example_Prostate.md) from the nnU-Net documentation to learn how this can be achieved.

To learn more about all the arguments that can be specified to the `nnUNet_predict` command, run `nnUNet_predict --help`.

The following step will take a couple of minutes.

In [None]:
# run the inference phase
# accepted options for --model are: 2d, 3d_lowres, 3d_fullres or 3d_cascade_fullres
!nnUNet_predict --input_folder "tutorial/data/nnUNet_raw_data/segthor/imagesTs" \
                --output_folder $RESULTS_FOLDER \
                --task_name "Task055_SegTHOR" --model 2d --disable_tta 

### Post-processing of inference results

After the inference is finished, we can convert the segmentation masks back to NRRD for visualisation purposes and for easier handling.

In [None]:
pred_nii_path = os.path.join(os.environ["RESULTS_FOLDER"], pat + ".nii.gz")

sitk_ct = sitk.ReadImage(ct_nrrd_path)

nrrd_spacing = sitk_ct.GetSpacing()
nrrd_dim = sitk_ct.GetSize()

nii_spacing = tuple(nib.load(pred_nii_path).header['pixdim'][1:4])
nii_dim = tuple(nib.load(pred_nii_path).get_fdata().shape)

assert (nrrd_spacing == nii_spacing) & (nrrd_dim == nii_dim)

## ----------------------------------------
# NIfTI TO NRRD CONVERSION

# path to the output NRRD file (inferred segmasks)
pred_nrrd_path = os.path.join(pat_dir_path_nrrd, pat + "_pred_segthor.nrrd")
log_file_path = os.path.join(pat_dir_path_nrrd, pat + "_pypla.log")

# Inferred NIfTI segmask to NRRD
convert_args_pred = {"input" : pred_nii_path, 
                     "output-img" : pred_nrrd_path}

pypla.convert(path_to_log_file = log_file_path, **convert_args_pred)

### Visualising segmentation results

We can visualise the raw AI-inferred segmentation mask (heart, aorta, esophagus, amd treachea - in green, yellow, red, and blue, respectively) and compare the heart (and esophagus, if available for the randomly selected patient) segmentation to the manual delineation.

In [None]:
# load NRRD volumes
ct_nrrd = sitk.GetArrayFromImage(sitk_ct)

# inferred segmask
pred_nrrd_segthor = sitk.GetArrayFromImage(sitk.ReadImage(pred_nrrd_path))

pred_nrrd_esophagus = np.copy(pred_nrrd_segthor)
pred_nrrd_heart = np.copy(pred_nrrd_segthor)
pred_nrrd_trachea = np.copy(pred_nrrd_segthor)
pred_nrrd_aorta = np.copy(pred_nrrd_segthor)
  
# zero every segmask other than the esophagus and make the mask binary (0/1)
pred_nrrd_esophagus[pred_nrrd_segthor != 1] = 0
pred_nrrd_esophagus[pred_nrrd_esophagus != 0] = 1
  
# zero every segmask other than the heart and make the mask binary (0/1)
pred_nrrd_heart[pred_nrrd_segthor != 2] = 0
pred_nrrd_heart[pred_nrrd_heart != 0] = 1
  
# zero every segmask other than the trachea and make the mask binary (0/1)
pred_nrrd_trachea[pred_nrrd_segthor != 3] = 0
pred_nrrd_trachea[pred_nrrd_trachea != 0] = 1
  
# zero every segmask other than the aorta and make the mask binary (0/1)
pred_nrrd_aorta[pred_nrrd_segthor != 4] = 0
pred_nrrd_aorta[pred_nrrd_aorta != 0] = 1


In [None]:
_ = viz_utils.AxialSliceSegmaskViz(ct_volume = ct_nrrd,
                                          segmask_dict = {"Heart" : pred_nrrd_heart,
                                                             "Aorta" : pred_nrrd_aorta,
                                                             "Trachea" : pred_nrrd_trachea,
                                                             "Esophagus" : pred_nrrd_esophagus},
                                          segmask_cmap_dict = {"Heart" : my_greens,
                                                               "Aorta" : my_spring,
                                                               "Esophagus" : my_reds,
                                                               "Trachea" : my_blues},
                                          dpi = 100)

## Comparison of segmentations with existing annotations

**NOTE: The cells in this section are applicable for the specific study/series we use as the example (`SeriesInstanceUID = 1.3.6.1.4.1.32722.99.99.203715003805996641695765332389135385095`). If you experiment with other series, this will either not be applicable, or you will need to adjust the code.**

Many of the collections contain annotations alongside images, which can be used to train new algorithms, or to evaluate performance of algorithms. We can view the entire imaging study, as we did earlier, to see what annotations are available for the series we segmented. As you can see, this study contains segmentations of organs saved in RTSTRUCT and SEG series.

In [None]:
print(get_idc_viewer_url(my_StudyInstanceUID))

We can get the names of the structures in the RTSTRUCT series that are available in the same DICOM study using the following query:

```sql
SELECT
  SeriesInstanceUID,
  structureSetROISequence.ROIName AS ROIName
FROM
 `bigquery-public-data.idc_current.dicom_all`
CROSS JOIN
 UNNEST (StructureSetROISequence) AS structureSetROISequence
WHERE
 Modality = "RTSTRUCT" AND StudyInstanceUID = <my_StudyInstanceUID>
 ```

In [None]:
rt_selection_query = f"\
  SELECT \
    SeriesInstanceUID,\
    structureSetROISequence.ROIName AS ROIName, \
    gcs_url \
  FROM \
    `bigquery-public-data.idc_current.dicom_all` \
  CROSS JOIN \
    UNNEST (StructureSetROISequence) AS structureSetROISequence \
  WHERE \
    Modality = \"RTSTRUCT\" AND StudyInstanceUID = \"{my_StudyInstanceUID}\""

rt_selection_result = bq_client.query(rt_selection_query)
rt_selection_df = rt_selection_result.result().to_dataframe()

rt_selection_df

As you can see, in this specific case, there is a single RTSTRUCT series that has segmentations of the structures that we also segmented in the previous step using nnU-Net. Let's download the corresponding DICOM file, and convert RTSTRUCT into a representation that we can use for visualization.

In [None]:
!mkdir -p /contents/IDC_downloads/RTSTRUCT
!rm -rf /content/IDC_downloads/RTSTRUCT/*.dcm

In [None]:
path_to_rt_dir = "/contents/IDC_downloads/RTSTRUCT"
rt_selection_manifest = "/content/IDC_downloads/rt_manifest.txt"
rt_selection_df["gcs_url"].to_csv(rt_selection_manifest, header=False, index=False)

!cat /content/IDC_downloads/rt_manifest.txt | gsutil -m cp -I /contents/IDC_downloads/RTSTRUCT

In [None]:
# path to the files where the NRRD RTSTRUCTs will be stored
rt_folder = os.path.join(pat_dir_path_nrrd, "RTSTRUCT")
rt_list_path = os.path.join(pat_dir_path_nrrd, "RTSTRUCT_content")

# DICOM RTSTRUCT to NRRD conversion (if the file doesn't exist yet)
if not os.path.exists(rt_folder):
  convert_args_rt = {"input" : path_to_rt_dir, 
                     "referenced-ct" : path_to_ct_dir,
                     "output-prefix" : rt_folder,
                     "prefix-format" : 'nrrd',
                     "output-ss-list" : rt_list_path}
  
  # clean old log file if it exist
  if os.path.exists(log_file_path_nrrd): os.remove(log_file_path_nrrd)
  
  pypla.convert(verbose = verbose, path_to_log_file = log_file_path_nrrd, **convert_args_rt)


In [None]:
# manual segmask (from the RTSTRUCT)
rt_segmask_heart = os.path.join(pat_dir_path_nrrd, "RTSTRUCT", "Heart.nrrd")
rt_nrrd_heart = sitk.GetArrayFromImage(sitk.ReadImage(rt_segmask_heart))

try:
  rt_segmask_esophagus = os.path.join(pat_dir_path_nrrd, "RTSTRUCT", "Esophagus.nrrd")
  rt_nrrd_esophagus = sitk.GetArrayFromImage(sitk.ReadImage(rt_segmask_esophagus))
except:
  # for the sake of simplicity, fill the volume with zeros
  # (so that we can keep the code that comes after the same)
  rt_nrrd_esophagus = np.zeros(rt_nrrd_heart.shape)

In [None]:
_ = viz_utils.AxialSliceSegmaskComparison(ct_volume = ct_nrrd,
                                          segmask_ai_dict = {"Heart" : pred_nrrd_heart,
                                                             "Aorta" : pred_nrrd_aorta,
                                                             "Trachea" : pred_nrrd_trachea,
                                                             "Esophagus" : pred_nrrd_esophagus},
                                          segmask_manual_dict = {"Heart" : rt_nrrd_heart,
                                                                 "Esophagus" : rt_nrrd_esophagus},
                                          segmask_cmap_dict = {"Heart" : my_greens,
                                                               "Aorta" : my_spring,
                                                               "Esophagus" : my_reds,
                                                               "Trachea" : my_blues},
                                          dpi = 100)

### Quantitative assessment of the results

Let's start by defining a function to compute the center of mass (CoM) of the segmentation masks. Before computing the common segmentation metrics, the CoM can give us a rough idea of how different the 3D delineations are and if there are any major labelling errors (which we could correct, e.g., with a largest connected component analysis).

We will base our function on the [implementation](https://github.com/AIM-Harvard/pyradiomics/blob/master/radiomics/generalinfo.py) found in the open source [PyRadiomics library](https://github.com/AIM-Harvard/pyradiomics).

In [None]:
def getCenterOfMassIndexValue(input_mask):
    
    """
    Returns z, y and x coordinates of the center of mass of the ROI in terms of
    the image coordinate space (continuous index).

    Calculation is based on the original (non-resampled) mask.
    Because this represents the continuous index, the order of x, y and z is reversed,
    i.e. the first element is the z index, the second the y index and the last element is the x index.

    @params:
      input_mask - required : numpy (binary) volume storing the segmentation mask.

    """

    if input_mask is not None:
      mask_coordinates = np.array(np.where(input_mask == 1))
      center_index = np.mean(mask_coordinates, axis = 1)
      return tuple(center_index)
    else:
      return None

In [None]:
com_manual_heart = np.array(getCenterOfMassIndexValue(rt_nrrd_heart))
com_manual_heart_int = np.ceil(com_manual_heart).astype(dtype = np.uint16)

com_raw_heart = np.array(getCenterOfMassIndexValue(pred_nrrd_heart))
com_raw_heart_int = np.ceil(com_raw_heart).astype(dtype = np.uint16)

print("Heart Center of Mass (raw AI segmentation) \t:", com_raw_heart_int)
print("Heart Center of Mass (manual segmentation) \t:", com_manual_heart_int)

In [None]:
com_manual_heart = np.array(getCenterOfMassIndexValue(rt_nrrd_heart))
com_manual_heart_int = np.ceil(com_manual_heart).astype(dtype = np.uint16)

com_raw_heart = np.array(getCenterOfMassIndexValue(pred_nrrd_heart))
com_raw_heart_int = np.ceil(com_raw_heart).astype(dtype = np.uint16)

print("Heart Center of Mass (raw AI segmentation) \t:", com_raw_heart_int)
print("Heart Center of Mass (manual segmentation) \t:", com_manual_heart_int)

## ----------------------------------------

# run this if and only if a manual esophagus segmentation mask is available
if np.sum(rt_nrrd_esophagus):

  com_manual_esophagus = np.array(getCenterOfMassIndexValue(rt_nrrd_esophagus))
  com_manual_esophagus_int = np.ceil(com_manual_esophagus).astype(dtype = np.uint16)

  com_raw_esophagus = np.array(getCenterOfMassIndexValue(pred_nrrd_esophagus))
  com_raw_esophagus_int = np.ceil(com_raw_esophagus).astype(dtype = np.uint16)

  print("\nEsophagus Center of Mass (raw AI segmentation) \t:", com_raw_esophagus_int)
  print("Esophagus Center of Mass (manual segmentation) \t:", com_manual_esophagus_int)

Another common way to evaluate the quality of the segmentation is computing the Dice Coefficient between the AI segmentation and the manual one. To do so, we will use [MedPy's implementation of the Dice coefficient](https://loli.github.io/medpy/generated/medpy.metric.binary.dc.html#medpy-metric-binary-dc) (for binary masks).

We can use other MedPy's functions to compute the Hausdorff distance and the average surface distance as well*.

_*in most cases, the Hausdorff Distance will be quite high for both the heart segmentation and, if available with the randomly selected patient, the esophagus one. This is not a clear indication the model performance is poor: rather, it could also be the segmentation guidelines of the two datasets (the one the nnU-Net model was trained on and the external and independent validation dataset pulled from IDC) differ significantly._

In [None]:
pred_nrrd_path = os.path.join(pat_dir_path_nrrd, pat + "_pred_segthor.nrrd")


voxel_spacing = list(sitk_ct.GetSpacing())

dc_heart = dice_coef(pred_nrrd_heart, rt_nrrd_heart)
hd_heart = hausdorff_distance(pred_nrrd_heart, rt_nrrd_heart, voxelspacing = voxel_spacing)
asd_heart = avg_surf_distance(pred_nrrd_heart, rt_nrrd_heart, voxelspacing = voxel_spacing)

print("Heart Dice Coefficient (raw segmentation) :", dc_heart)
print("Heart Hausdorff Distance (raw segmentation) [mm]:", hd_heart)
print("Heart Average Surface Distance (raw segmentation) [mm]:", asd_heart)


# run this if and only if a manual esophagus segmentation mask is available
if np.sum(rt_nrrd_esophagus):
  dc_esophagus = dice_coef(pred_nrrd_esophagus, rt_nrrd_esophagus)
  hd_esophagus = hausdorff_distance(pred_nrrd_esophagus, rt_nrrd_esophagus, voxelspacing = voxel_spacing)
  asd_esophagus = avg_surf_distance(pred_nrrd_esophagus, rt_nrrd_esophagus, voxelspacing = voxel_spacing)

  print("\nEsophagus Dice Coefficient (raw segmentation) :", dc_esophagus)
  print("Esophagus Hausdorff Distance (raw segmentation) [mm]:", hd_esophagus)
  print("Esophagus Average Surface Distance (raw segmentation) [mm]:", asd_esophagus)



## Finding more relevant images to test

Let's get some better idea of the CT series we segmented. We can query IDC BigQuery DICOM metadata table to retrieve some relevant attributes.

```sql
SELECT
  DISTINCT(Manufacturer),
  ManufacturerModelName
FROM
 `bigquery-public-data.idc_current.dicom_all`
WHERE
 SeriesInstanceUID = <my_SeriesInstanceUID>
```

In [None]:
details_selection_query = f"\
  SELECT \
    DISTINCT(Manufacturer), \
    ManufacturerModelName, \
    BodyPartExamined \
  FROM \
    `bigquery-public-data.idc_current.dicom_all` \
  WHERE \
    SeriesInstanceUID = \"{my_SeriesInstanceUID}\""

details_selection_result = bq_client.query(details_selection_query)
details_selection_df = details_selection_result.result().to_dataframe()

details_selection_df

We can easily check what other combinations of `Manufacturer` and `ManufacturerModelName` we have in store, to test generalizability of the segmentation network we have been using. In the query below, we will get all unique combinations of `Manufacturer`/`ManufacturerModelName`, and count how many CT studies that have `BodyPartExamined = LUNG` are available for that specific combination.

In a similar fashion, you can utilize any of the DICOM attributes to select representative data to use in testing.

Note that such explorations can also be done using IDC portal or IDC DataStudio dashboard. Exploratory SQL queries are best done in the [BigQuery console](https://console.cloud.google.com/bigquery).

```sql
SELECT
  Manufacturer,
  ARRAY_TO_STRING(ARRAY_AGG(DISTINCT(ManufacturerModelName)),','),
  COUNT(DISTINCT(StudyInstanceUID)) AS number_of_studies
FROM
  `bigquery-public-data.idc_current.dicom_all`
WHERE
  Modality = "CT"
  AND BodyPartExamined = "LUNG"
GROUP BY
  Manufacturer
ORDER BY
  number_of_studies DESC
```

In [None]:
sample_selection_query = f"\
SELECT \
  Manufacturer, \
  ARRAY_TO_STRING(ARRAY_AGG(DISTINCT(ManufacturerModelName)),',') as ManufacturerModelNames, \
  COUNT(DISTINCT(StudyInstanceUID)) AS number_of_studies \
FROM \
  `bigquery-public-data.idc_current.dicom_all` \
WHERE \
  Modality = \"CT\" \
  AND BodyPartExamined = \"LUNG\" \
GROUP BY \
  Manufacturer \
ORDER BY \
  number_of_studies DESC"

sample_selection_result = bq_client.query(sample_selection_query)
sample_selection_df = sample_selection_result.result().to_dataframe()

sample_selection_df

Next we can query for a study that has `Manufacturer`/`ManufacturerModelName` combination we want to test, and visualize that study. 


```sql
SELECT
  SeriesInstanceUID,
  ARRAY_TO_STRING(ARRAY_AGG(DISTINCT(SeriesDescription)),',') AS SeriesDescriptions,
  ARRAY_TO_STRING(ARRAY_AGG(DISTINCT(StudyInstanceUID)),',') AS StudyInstanceUIDs,  
  COUNT(DISTINCT(SOPInstanceUID)) AS numberOfInstances
FROM
  `bigquery-public-data.idc_current.dicom_all`
WHERE
  Modality = "CT"
  AND BodyPartExamined = "LUNG"
  AND Manufacturer = "GE MEDICAL SYSTEMS"
  AND ManufacturerModelName = "LightSpeed Xtra"
GROUP BY
  SeriesInstanceUID
```

This query returns three `SeriesInstanceUID`s.



In [None]:
print(get_idc_viewer_url("1.3.6.1.4.1.14519.5.2.1.3023.4012.146358232297157912733174589662"))

Let's re-initialize `SeriesInstanceUID` with the series from this study, and re-run the segmentation steps by returning to the **Visualization and download of data from IDC** section of the notebook after running the cell below!

In [None]:
my_SeriesInstanceUID = "1.3.6.1.4.1.14519.5.2.1.3023.4012.162275549801143329076803363880"

## I want to train my network, not run inference - what do I do?

Google Colab has a number of limitations on the resources available and time you can use this free resource. If you plan to train deep learning models, you should consider using GCP [Vertex AI Notebooks](https://cloud.google.com/vertex-ai-workbench), which is a paid resource that you can use to define highly configurable AI development environments accessible via JupyterLab interface. You can [apply for a sponsored project from IDC](https://learn.canceridc.dev/introduction/requesting-gcp-cloud-credits) to experiment with those capabilities at no cost to you.