From dfdffc0fdc37bce1179da9464f512a53b8b8e53f Mon Sep 17 00:00:00 2001 From: Andrey Fedorov Date: Thu, 28 Nov 2024 13:32:37 -0500 Subject: [PATCH] add RSNA'24 DLL notebook --- notebooks/labs/idc_rsna2024.ipynb | 3024 +++++++++++++++++++++++++++++ 1 file changed, 3024 insertions(+) create mode 100644 notebooks/labs/idc_rsna2024.ipynb diff --git a/notebooks/labs/idc_rsna2024.ipynb b/notebooks/labs/idc_rsna2024.ipynb new file mode 100644 index 0000000..ee33301 --- /dev/null +++ b/notebooks/labs/idc_rsna2024.ipynb @@ -0,0 +1,3024 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "view-in-github" + }, + "source": [ + "\"Open" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "KmXfYFZtja2F" + }, + "source": [ + "# Getting started with Imaging Data Commons: Download, searching, visualization, attribution\n", + "\n", + "---\n", + "\n", + "## Summary\n", + "\n", + "[NCI Imaging Data Commons (IDC)](https://imaging.datacommons.cancer.gov/explore/) is a cloud-based environment containing publicly available cancer imaging data co-located with the analysis and exploration tools and resources. IDC is a node within the broader NCI Cancer Research Data Commons (CRDC) infrastructure that provides secure access to a large, comprehensive, and expanding collection of cancer research data.\n", + "\n", + "This notebook is part of [RSNA 2024 Deep Learning Lab series](https://github.com/RSNA/AI-Deep-Learning-Lab-2024) introducing NCI Imaging Data Commons to the users who want to access it programmatically.\n", + "\n", + "Upon completion of this tutorial, you will learn:\n", + "* how to download images from IDC\n", + "* how to load and view DICOM images available from IDC\n", + "* what metadata attributes are available for filtering and how to filter data programmatically\n", + "* how to better leverage IDC visualization capabilities by programmatically generating viewer URLs for selected items, and embedding the viewer in your notebook\n", + "* how get the licensing information for the individual files, and how to comply with the attribution usage requirement, which is in place for most of the data in IDC\n", + "\n", + "To learn more about IDC, check out [this documentation page](https://learn.canceridc.dev/getting-started-with-idc) and [IDC-Tutorials](https://github.com/ImagingDataCommons/IDC-Tutorials) repository!\n", + "\n", + "---\n", + "Initial version: Nov 2024" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Pr-mqH76gtz7" + }, + "source": [ + "## Warnings: read these to avoid errors!\n", + "\n", + "In order to avoid errors, please keep in mind the following warnings:\n", + "\n", + "1. **Execute each cell of the notebook in order without skipping** any code cells - otherwise you will likely encounter runtime errors!\n", + "2. If you are going through this notebook for the first time, **do not change any code until you successfully completed the notebook** (unless you are specifically asked to change something). Otherwise you may run into errors.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "a4Q-kRDW77Iy" + }, + "source": [ + "## Prerequisites\n", + "\n", + "This tutorial relies on [`idc-index`](https://github.com/ImagingDataCommons/idc-index) - a python package that accompanies IDC and provides basic functionality around searching and accessing data from IDC.\n", + "\n", + "We will install `idc-index` using `pip`. This package is under active development, and so we will use the `--upgrade` option to access the latest version. It should take around 1-2 minutes to install the dependencies and complete the next cell." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "cellView": "form", + "id": "bDGChJBK9ooq" + }, + "outputs": [], + "source": [ + "%%capture\n", + "!pip install idc-index --upgrade" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "LwRoq_JxAOgD" + }, + "source": [ + "## Downloading data from IDC\n", + "\n", + "One of the first questions we get from the users of IDC is \"How do I download the images?\"\n", + "\n", + "With `idc-index`, answering this question is easy.\n", + "\n", + "We will head out to IDC Portal, that provides a basic interface to explore the data available in IDC. As you move around the IDC Portal explore page, you will be able to copy identifiers corresponding to the content available in IDC at the different levels of data hierarchy - starting from entire collections, down to the individual image series.\n", + "\n", + "Try this out yourself! Open the IDC Portal explore page [https://portal.imaging.datacommons.cancer.gov/explore/](https://portal.imaging.datacommons.cancer.gov/explore/) in a separate window, and experiment with copying these identifiers. Identifiers for collections and patients/cases will be short strings, while unique identifiers for studies and series will be rather long strings of text that contain numbers and \".\" characters).\n", + "\n", + "![IDC Portal demo](https://github.com/ImagingDataCommons/IDC-Tutorials/releases/download/0.2.0/Tutorial-copy.gif)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "erQWF7viAOgD" + }, + "source": [ + "The identifier you have in the clipboard, is all you need to access the corresponding images. To download those, we will first instantiate `IDCClient` that is provided by `idc-index`." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": { + "id": "p26lTqcmAOgE" + }, + "outputs": [], + "source": [ + "from idc_index import IDCClient\n", + "\n", + "idc_client = IDCClient()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "IFxe5O3oAOgE" + }, + "source": [ + "In the following cell, we pass the identifier of a study to the function `download_from_selection()` to fetch the corresponding files. For the sake of the tutorial, please proceed with the identifier used in the following cell. You can experiment with using the identifiers you copied from the portal after you completed the notebook.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "Wan_LznEAOgE", + "outputId": "1fb5376b-2f60-4b6f-ca8e-8fb5d23d394a" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Downloading data: 100%|██████████| 314M/314M [00:04<00:00, 67.2MB/s]\n" + ] + } + ], + "source": [ + "studyInstanceUID = \"1.2.840.113654.2.55.68425808326883186792123057288612355322\"\n", + "\n", + "\n", + "idc_client.download_from_selection(studyInstanceUID=studyInstanceUID, downloadDir=\".\")\n", + "\n", + "# You can experiment with downloading all of the data for a patient by its PatientID, or for entire collections\n", + "# by uncommenting the lines below!\n", + "#idc_client.download_from_selection(patientId=\"LUNG1-025\", downloadDir=\".\")\n", + "#idc_client.download_from_selection(collection_id=\"ct_phantom4radiomics\", downloadDir=\".\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "0s2Ok9iK5Ae-" + }, + "source": [ + "You also have the choice of downloading from IDC using command-line helper tool that is installed as part of the `idc-index` package. The cell below shows how to download that same study from the command line." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "jyEEknBR5P1V", + "outputId": "4e6ebf93-84db-42eb-9926-1a9b6d4ce870" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "2024-11-28 17:08:47,168 - Downloading from IDC v20 index\n", + "2024-11-28 17:08:47,346 - Identified matching StudyInstanceUID: ['1.2.840.113654.2.55.68425808326883186792123057288612355322']\n", + "2024-11-28 17:08:47,415 - Total free space on disk: 206.840569856 GB\n", + "2024-11-28 17:08:47,666 - Not using s5cmd sync as the destination folder is empty or sync or progress bar is not requested\n", + "2024-11-28 17:08:47,677 - Initial size of the directory: 66.35 MB\n", + "2024-11-28 17:08:47,680 - Approximate size of the files that need to be downloaded: 314.45 MB\n", + "Downloading data: 79% 248M/314M [00:05<00:01, 47.1MB/s]\n", + "2024-11-28 17:08:52,956 - Successfully downloaded files to /content\n" + ] + } + ], + "source": [ + "!idc download 1.2.840.113654.2.55.68425808326883186792123057288612355322 --download-dir ." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zxRdcx_kAOgE" + }, + "source": [ + "Once the download is completed, you can check the current directory to examine the content. Note that files are downloaded in to a folder hierarchy to make navigation easier. The default hierarchy is `collection_id` > `PatientID` > `StudyInstanceUID` > `Modality`_`SeriesInstanceUID`, and the files you downloaded should be organized as shown below.\n", + "\n", + "```\n", + "├── nlst <--- collection ID\n", + "│ └── 100002 <--- Patient ID\n", + "│ └── 1.2.840.113654.2.55.68425808326883186792123057288612355322 <--- Study ID\n", + "│ ├── CT_1.2.840.113654.2.55.229650531101716203536241646069123704792 <--- Series ID\n", + "│ │ ├── 0025b198-6198-4b33-85cf-92582531ad28.dcm <--- individual instances/files the series\n", + "│ │ ├── 00a93dcb-4cd0-46ca-ae41-9240421cb0c7.dcm (corresponding to slices in radiology and\n", + "│ │ ├── 00aa5957-1b4e-4b8b-879f-c505e12f2dcc.dcm resolution layers in digital pathology)\n", + "│ │ ├── 02da9050-622c-4e00-a0d4-a70dd418973c.dcm\n", + "...\n", + "```\n", + "\n", + "The DICOM study we downloaded is from the National Lung Screening Trial (NLST) collection available in IDC. This study contains two Computed Tomography (CT) DICOM series, two DICOM Segmentations (SEG) with the results of segmenting the CT series using [TotalSegmentator](https://github.com/wasserth/TotalSegmentator), and 4 DICOM Structured Reporting series with the radiomics features extracted from the segmented images. Most of the CT images in the NLST collection in IDC are accompanied by TotalSegmentator segmentations and corresponding radiomics features!" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "JI9_m50rK0qb" + }, + "source": [ + "## Loading and visualizing IDC DICOM images and segmentations\n", + "\n", + "In this tutorial we will focus on visualization of radiology images.\n", + "\n", + "Images available from IDC are stored using DICOM format. This is the format that is used virtually for all of the images produced by radiological medical imaging equipment. It is also increasingly adopted by the manufacturers of the digital pathology equipment.\n", + "\n", + "DICOM format is supported by many open source tools and libraries. In the following cells we will use:\n", + "* [ITK](https://itk.org/) for loading DICOM images\n", + "* [pydicom-seg]() for loading DICOM segmentations\n", + "* and [itkWidgets](https://itkwidgets.readthedocs.io/en/latest/) for visualization.\n", + "\n", + "In the following cell, which should take about a minute to complete, we will install these prerequisites. You may see some errors about incompatible packages in the end of the installation. Those should be harmless." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "ASa94UB1iPC4", + "outputId": "0da65592-8c64-4335-ebb2-37ff4ae50f90" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\u001b[?25l \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m0.0/4.0 MB\u001b[0m \u001b[31m?\u001b[0m eta \u001b[36m-:--:--\u001b[0m\r\u001b[2K \u001b[91m━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[90m╺\u001b[0m\u001b[90m━━━━━━━━━━━━━━\u001b[0m \u001b[32m2.5/4.0 MB\u001b[0m \u001b[31m75.7 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m\r\u001b[2K \u001b[91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[91m╸\u001b[0m \u001b[32m4.0/4.0 MB\u001b[0m \u001b[31m92.4 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m\r\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m4.0/4.0 MB\u001b[0m \u001b[31m55.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25h Preparing metadata (setup.py) ... \u001b[?25l\u001b[?25hdone\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m152.6/152.6 kB\u001b[0m \u001b[31m11.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25h Preparing metadata (setup.py) ... \u001b[?25l\u001b[?25hdone\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.8/1.8 MB\u001b[0m \u001b[31m49.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m28.0/28.0 MB\u001b[0m \u001b[31m22.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m80.2/80.2 MB\u001b[0m \u001b[31m8.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m44.5/44.5 kB\u001b[0m \u001b[31m3.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m75.8/75.8 kB\u001b[0m \u001b[31m5.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m56.3/56.3 kB\u001b[0m \u001b[31m3.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m52.4/52.4 MB\u001b[0m \u001b[31m14.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m179.9/179.9 kB\u001b[0m \u001b[31m13.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m8.6/8.6 MB\u001b[0m \u001b[31m54.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m210.7/210.7 kB\u001b[0m \u001b[31m15.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m3.1/3.1 MB\u001b[0m \u001b[31m78.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m117.7/117.7 kB\u001b[0m \u001b[31m8.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m6.4/6.4 MB\u001b[0m \u001b[31m92.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m94.9/94.9 kB\u001b[0m \u001b[31m7.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m168.2/168.2 kB\u001b[0m \u001b[31m11.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m442.1/442.1 kB\u001b[0m \u001b[31m27.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m73.2/73.2 kB\u001b[0m \u001b[31m5.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m3.8/3.8 MB\u001b[0m \u001b[31m88.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m442.6/442.6 kB\u001b[0m \u001b[31m29.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m6.0/6.0 MB\u001b[0m \u001b[31m13.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m63.8/63.8 kB\u001b[0m \u001b[31m5.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25h Building wheel for asciitree (setup.py) ... \u001b[?25l\u001b[?25hdone\n", + " Building wheel for elfinder-client (setup.py) ... \u001b[?25l\u001b[?25hdone\n", + "\u001b[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\n", + "albumentations 1.4.20 requires pydantic>=2.7.0, but you have pydantic 1.10.19 which is incompatible.\n", + "langchain 0.3.7 requires pydantic<3.0.0,>=2.7.4, but you have pydantic 1.10.19 which is incompatible.\n", + "langchain-core 0.3.19 requires pydantic<3.0.0,>=2.5.2; python_full_version < \"3.12.4\", but you have pydantic 1.10.19 which is incompatible.\u001b[0m\u001b[31m\n", + "\u001b[0m" + ] + } + ], + "source": [ + "# Install the packages required for sorting and loading the data as well as visualization\n", + "!pip install -q \"pydicom<3.0.0\" pydicom-seg \"itk-io>=5.3.0\" \"itkwidgets[all]>=1.0a32\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "fuB8D90qjPWm" + }, + "source": [ + "Next we will import the necessary packages, and will load the CT image from the folder with the files we downloaded earlier." + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": { + "id": "TnY_Hox9k0d8" + }, + "outputs": [], + "source": [ + "import itk\n", + "from itkwidgets import view\n", + "# Pydicom-Seg is a layer on top of pydicom that handles DICOM SEG objects.\n", + "import pydicom\n", + "import pydicom_seg\n", + "import numpy as np" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": { + "id": "KwmtVQYiHupY" + }, + "outputs": [], + "source": [ + "ct_image_path = \"./nlst/100002/1.2.840.113654.2.55.68425808326883186792123057288612355322/CT_1.2.840.113654.2.55.229650531101716203536241646069123704792\"\n", + "\n", + "ct_image = itk.imread(ct_image_path)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "GOGNhOlMjcXp" + }, + "source": [ + "We can next print the details about the loaded image. If the size of the image you loaded shows up as [512, 512, 126], then everything worked as expected!" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "9XjCyv3zIjl3", + "outputId": "3bd3bf11-d213-48c7-809b-33c64ef8b4ad" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Image (0x587603d04560)\n", + " RTTI typeinfo: itk::Image\n", + " Reference Count: 1\n", + " Modified Time: 25563\n", + " Debug: Off\n", + " Object Name: \n", + " Observers: \n", + " none\n", + " Source: (none)\n", + " Source output name: (none)\n", + " Release Data: Off\n", + " Data Released: False\n", + " Global Release Data: Off\n", + " PipelineMTime: 222\n", + " UpdateMTime: 25562\n", + " RealTimeStamp: 0 seconds \n", + " LargestPossibleRegion: \n", + " Dimension: 3\n", + " Index: [0, 0, 0]\n", + " Size: [512, 512, 126]\n", + " BufferedRegion: \n", + " Dimension: 3\n", + " Index: [0, 0, 0]\n", + " Size: [512, 512, 126]\n", + " RequestedRegion: \n", + " Dimension: 3\n", + " Index: [0, 0, 0]\n", + " Size: [512, 512, 126]\n", + " Spacing: [0.703125, 0.703125, 2.5]\n", + " Origin: [-169.6, -180, -301.545]\n", + " Direction: \n", + "1 0 0\n", + "0 1 0\n", + "0 0 1\n", + "\n", + " IndexToPointMatrix: \n", + "0.703125 0 0\n", + "0 0.703125 0\n", + "0 0 2.5\n", + "\n", + " PointToIndexMatrix: \n", + "1.42222 0 0\n", + "0 1.42222 0\n", + "0 0 0.4\n", + "\n", + " Inverse Direction: \n", + "1 0 0\n", + "0 1 0\n", + "0 0 1\n", + "\n", + " PixelContainer: \n", + " ImportImageContainer (0x5876081c1830)\n", + " RTTI typeinfo: itk::ImportImageContainer\n", + " Reference Count: 1\n", + " Modified Time: 611\n", + " Debug: Off\n", + " Object Name: \n", + " Observers: \n", + " none\n", + " Pointer: 0x794bd4fc4010\n", + " Container manages memory: true\n", + " Size: 33030144\n", + " Capacity: 33030144\n", + "\n" + ] + } + ], + "source": [ + "print(ct_image)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1rjWvHbyjndB" + }, + "source": [ + "Next we will load DICOM SEG series included in the study. We will use `pydicom-seg` library to load the segmentation as an array, which will then convert into ITK image, and resample to the CT image we loaded earlier." + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": { + "id": "kakilA1TIqJm" + }, + "outputs": [], + "source": [ + "# Read the DICOM SEG object using pydicom and pydicom_seg.\n", + "seg_image_path = \"./nlst/100002/1.2.840.113654.2.55.68425808326883186792123057288612355322/SEG_1.2.276.0.7230010.3.1.3.313263360.15851.1706325185.577017/9629abf6-d1de-4931-bc6a-061890ae275c.dcm\"\n", + "seg_dicom = pydicom.dcmread(seg_image_path)\n", + "seg_reader = pydicom_seg.MultiClassReader()\n", + "seg_obj = seg_reader.read(seg_dicom)\n", + "\n", + "# Convert the DICOM SEG object into an itk image, with correct voxel origin, spacing, and directions in physical space.\n", + "seg_image = itk.image_from_array(seg_obj.data.astype(np.float32))\n", + "seg_image.SetOrigin(seg_obj.origin)\n", + "seg_image.SetSpacing(seg_obj.spacing)\n", + "seg_image.SetDirection(seg_obj.direction)\n", + "interpolator = itk.NearestNeighborInterpolateImageFunction.New(seg_image)\n", + "seg_image = itk.resample_image_filter(Input=seg_image,\n", + " Interpolator=interpolator,\n", + " reference_image=ct_image,\n", + " use_reference_image=True)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-i_6fZcUmOPz" + }, + "source": [ + "Now that we have both the CT image and the SEG image loaded as ITK images, we can visualize them using itkWidgets `view()` function.\n", + "\n", + "Note that the following cell may take about a minute to complete. Once the cell execution is done, you will need to wait a bit longer for the visualization widget to populate in the output cell, and a bit longer again (up to several minutes) for the image to properly render. While everything is being loaded, you will see a spinning icon next to the word \"Image\" in the upper left corner of the widget to indicate the content is being processed.\n", + "\n", + "**Please be patient!** Until the image is fully processed by the widget, visualization may not show up correctly. The image below shows how the viewer should look like after everything loaded correctly.\n", + "\n", + "\"select\"" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 567 + }, + "id": "PK7eCblnJV12", + "outputId": "30477108-6ede-4ac5-a7c1-668adf64a663" + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + " \n", + " \n", + " " + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "\n", + "\n", + "
\n", + "
\n", + "
\n", + "
" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "viewer = view(ct_image, label_image=seg_image, ui_collapsed=False)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "VtT3d2P1qSZa" + }, + "source": [ + "As you can see, IDC data is easy to load and visualize using popular open source libraries!" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zrbT7voq7yU5" + }, + "source": [ + "## Searching IDC data\n", + "\n", + "In the previous exercise, we identified the data by navigating the IDC Portal. IDC Portal can be very helpful if you want to browse through the data, quickly check out individual images, or get an idea about what is available.\n", + "\n", + "You can also filter IDC data and explore what is available programmatically: `idc-index` to the rescue!\n", + "\n", + "`idc-index` is named this way because it wraps _index_ of IDC data: a table containing most important metadata attributes describing the files available in IDC. This metadata index is available in the `index` variable (which is a pandas `DataFrame`) of `IDCClient`, which contains the following columns.\n", + "\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "xfCV61zeKMFp", + "outputId": "1e6f27dd-684d-45fc-b167-cb938415d8bc" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "['collection_id' 'analysis_result_id' 'PatientID' 'SeriesInstanceUID'\n", + " 'StudyInstanceUID' 'source_DOI' 'PatientAge' 'PatientSex' 'StudyDate'\n", + " 'StudyDescription' 'BodyPartExamined' 'Modality' 'Manufacturer'\n", + " 'ManufacturerModelName' 'SeriesDate' 'SeriesDescription' 'SeriesNumber'\n", + " 'instanceCount' 'license_short_name' 'series_aws_url' 'series_size_MB'\n", + " 'crdc_series_uuid']\n" + ] + } + ], + "source": [ + "print(idc_client.index.columns.values)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3CX9qqezKjIC" + }, + "source": [ + "We will discuss just a few of those columns - you can learn about those not discussed in [this `idc-index` documentation page](https://idc-index.readthedocs.io/en/latest/column_descriptions.html).\n", + "\n", + "IDC is using DICOM for data representation, and in the DICOM data model, patients (identified by **`PatientID`** column) undergo imaging exams (or _studies_, in DICOM nomenclature).\n", + "\n", + "Each patient will have one or more studies, with each study identified uniquely by the attribute **`StudyInstanceUID`**. During each of the imaging studies one or more imaging _series_ will be collected. As an example, a Computed Tomography (CT) imaging study may include a volume sweep before and after administration of the contrast agent. Imaging series are uniqiely identified by **`SeriesInstanceUID`**.\n", + "\n", + "Individual collections within IDC group patients/cases, and are recognized by their **`collection_id`** values.\n", + "\n", + "The function we used earlier - `download_from_selection()` - can be used to download images given any of these identifiers: `collection_id`,\n", + "`PatientID`, `StudyInstanceUID`, or `SeriesInstanceUID`." + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "H32c6a6PLLdj", + "outputId": "4d04ac54-cf8b-44a3-a3ce-12734eb7bee6" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Help on method download_from_selection in module idc_index.index:\n", + "\n", + "download_from_selection(downloadDir, dry_run=False, collection_id=None, patientId=None, studyInstanceUID=None, seriesInstanceUID=None, sopInstanceUID=None, crdc_series_uuid=None, quiet=True, show_progress_bar=True, use_s5cmd_sync=False, dirTemplate='%collection_id/%PatientID/%StudyInstanceUID/%Modality_%SeriesInstanceUID') method of idc_index.index.IDCClient instance\n", + " Download the files corresponding to the selection. The filtering will be applied in sequence (but does it matter?) by first selecting the collection(s), followed by\n", + " patient(s), study(studies) and series. If no filtering is applied, all the files will be downloaded.\n", + " \n", + " Args:\n", + " downloadDir: string containing the path to the directory to download the files to\n", + " dry_run: calculates the size of the cohort but download does not start\n", + " collection_id: string or list of strings containing the values of collection_id to filter by\n", + " patientId: string or list of strings containing the values of PatientID to filter by\n", + " studyInstanceUID: string or list of strings containing the values of DICOM StudyInstanceUID to filter by\n", + " seriesInstanceUID: string or list of strings containing the values of DICOM SeriesInstanceUID to filter by\n", + " sopInstanceUID: string or list of strings containing the values of DICOM SOPInstanceUID to filter by\n", + " crdc_series_uuid: string or list of strings containing the values of crdc_series_uuid to filter by\n", + " quiet (bool): If True, suppresses the output of the subprocess. Defaults to True\n", + " show_progress_bar (bool): If True, tracks the progress of download\n", + " use_s5cmd_sync (bool): If True, will use s5cmd sync operation instead of cp when downloadDirectory is not empty; this can significantly improve the download speed if the content is partially downloaded\n", + " dirTemplate (str): Download directory hierarchy template. This variable defines the folder hierarchy for the organizing the downloaded files in downloadDirectory. Defaults to index.DOWNLOAD_HIERARCHY_DEFAULT set to %collection_id/%PatientID/%StudyInstanceUID/%Modality_%SeriesInstanceUID. The template string can be built using a combination of selected metadata attributes (PatientID, collection_id, Modality, StudyInstanceUID, SeriesInstanceUID) that must be prefixed by '%'. The following special characters can be used as separators: '-' (hyphen), '/' (slash for subdirectories), '_' (underscore). When set to None all files will be downloaded to the download directory with no subdirectories.\n", + "\n" + ] + } + ], + "source": [ + "help(idc_client.download_from_selection)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xjK236ZILQg9" + }, + "source": [ + "Any of the columns included in the index can be used to build filters, or selection queries, subsetting content that meets specific requirements.\n", + "\n", + "For any of the index column, we can check the unique values for each column, to first understand what is available.\n", + "\n", + "As an example, let's look at the `Modality` column, which contains an abbreviation encoding the type of image." + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "xku_WpzQL_1y", + "outputId": "c7fe5aa8-0a1d-43b3-ce91-15f55e18c1c1" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "['CT' 'PT' 'MR' 'SM' 'DX' 'SEG' 'SR' 'MG' 'RTSTRUCT' 'M3D' 'CR' 'RTPLAN'\n", + " 'US' 'PR' 'RTDOSE' 'NM' 'XA' 'KO' 'REG' 'SC' 'RWV' 'XC' 'FUSION' 'OT'\n", + " 'RF' 'ANN']\n" + ] + } + ], + "source": [ + "print(idc_client.index['Modality'].unique())" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "UfV7xNiiN_Oe" + }, + "source": [ + "As an exercise of using the index to subset IDC data, we will search for all Magnetic Resonance series. This can be done using the `Modality` column we used above. If you want to know what each of the abbreviations above stands for, check out this page from the DICOM standard: https://dicom.nema.org/medical/dicom/current/output/chtml/part03/sect_C.7.3.html#sect_C.7.3.1.1.1. But of the sake of this exercise, \"MR\" value corresponds to \"Magnetic Resonance\".\n", + "\n", + "`index` is just a pandas `DataFrame`, and you can use pandas syntax to do the selection, as shown in the next cell." + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": { + "id": "9FQrOSgZSO0q" + }, + "outputs": [], + "source": [ + "mr_selection_pd = idc_client.index[idc_client.index['Modality'] == \"MR\"]" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5BQer6HtSdkN" + }, + "source": [ + "As an alternative, you can use Standard Query Language (SQL). The following cell accomplishes the same task using an SQL query." + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": { + "id": "hEp_eHpSScqT" + }, + "outputs": [], + "source": [ + "query = \"\"\"\n", + "SELECT *\n", + "FROM index\n", + "WHERE Modality = 'MR'\n", + "\"\"\"\n", + "\n", + "mr_selection_sql = idc_client.sql_query(query)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "sitGwE0KTBEF" + }, + "source": [ + "In both cases, the result will be returned as a pandas `DataFrame` containing rows that have \"Modality\" set to \"MR\". You can see that in both cases, there are 122707 series that meet this selection criteria." + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 982 + }, + "id": "xOu3sGIwS5ok", + "outputId": "87d416c1-227a-44b4-c131-b9d0f78aa13f" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Index.format is deprecated and will be removed in a future version. Convert using index.astype(str) or index.map(formatter) instead.\n" + ] + }, + { + "data": { + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "dataframe", + "variable_name": "mr_selection_sql" + }, + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
collection_idanalysis_result_idPatientIDSeriesInstanceUIDStudyInstanceUIDsource_DOIPatientAgePatientSexStudyDateStudyDescription...ManufacturerManufacturerModelNameSeriesDateSeriesDescriptionSeriesNumberinstanceCountlicense_short_nameseries_aws_urlseries_size_MBcrdc_series_uuid
0pdmr_292921_168_rNone292921-168-R-14152.25.525625083469119837615239272579874258942.25.20875623021896858390846643746338088176210.7937/tcia.2020.pcak-8z10NoneM2019-03-20NCI PDMR Tumor Characterization...Philips Medical SystemsAchieva2019-03-20TSE45 split401336CC BY 4.0s3://idc-open-data/1461966c-1a43-4198-a664-921...22.511461966c-1a43-4198-a664-9213ce671db1
1upenn_gbmNoneUPENN-GBM-000731.3.6.1.4.1.14519.5.2.1.5542074552050843981810...1.3.6.1.4.1.14519.5.2.1.2908533414156045303969...10.7937/tcia.709x-dn49068YM2002-11-23MRI BRAIN W/INJ/MHDI...SIEMENSTrioTim2002-11-23ep2d_DTI_30dir993CC BY 4.0s3://idc-open-data/b5ca88ab-bb87-4b53-9711-915...110.12b5ca88ab-bb87-4b53-9711-915d6b6541a1
2upenn_gbmNoneUPENN-GBM-003931.3.6.1.4.1.14519.5.2.1.3194947574846003675214...1.3.6.1.4.1.14519.5.2.1.2816649104808466125578...10.7937/tcia.709x-dn49069YF2010-12-10BRAIN^BRAIN...SIEMENSTrioTim2010-12-10t1 axial stealth-post : Processed_CaPTk24192CC BY 4.0s3://idc-open-data/6eb21652-c127-4fa4-9561-82c...25.756eb21652-c127-4fa4-9561-82cef589b458
3breast_mri_nact_pilotNoneUCSF-BR-661.3.6.1.4.1.14519.5.2.1.7695.2311.961504369910...1.3.6.1.4.1.14519.5.2.1.7695.2311.195372761058...10.7937/k9/tcia.2016.qhsyhjky048YF1992-01-07MR BREAS, UNIT...GE MEDICAL SYSTEMSGENESIS_SIGNA1992-01-07Dynamic-3dfgre: SER3100060CC BY 3.0s3://idc-open-data/fbbe6ce5-c9d7-4346-9d21-8bb...8.23fbbe6ce5-c9d7-4346-9d21-8bb9572170a1
4upenn_gbmNoneUPENN-GBM-003981.3.6.1.4.1.14519.5.2.1.1321370275496387006647...1.3.6.1.4.1.14519.5.2.1.7948613579489379867954...10.7937/tcia.709x-dn49069YM2010-11-19BRAIN^BRAIN...SIEMENSTrioTim2010-11-19Axial T2 tse: Processed_CaPTk364CC BY 4.0s3://idc-open-data/127fb2c9-ccaa-48ad-8c1f-7e3...7.00127fb2c9-ccaa-48ad-8c1f-7e38cc0e1e80
..................................................................
123157ispy1NoneISPY1_10451.3.6.1.4.1.14519.5.2.1.7695.1700.252777698659...1.3.6.1.4.1.14519.5.2.1.7695.1700.118428709069...10.7937/k9/tcia.2016.hdhpgjlk049YF1986-01-23MRI BR UNILAT W&WO CONT LEFT...GE MEDICAL SYSTEMSGENESIS_SIGNA1986-01-23T1 left breast post delay752CC BY 3.0s3://idc-open-data/c944fe91-04c6-48a2-a60d-777...7.08c944fe91-04c6-48a2-a60d-777aecf8189d
123158ispy1NoneISPY1_11351.3.6.1.4.1.14519.5.2.1.7695.1700.615305979619...1.3.6.1.4.1.14519.5.2.1.7695.1700.232588206214...10.7937/k9/tcia.2016.hdhpgjlk059YF1987-08-02MR BREASTUNI UE...GE MEDICAL SYSTEMSGENESIS_SIGNA1987-08-02DWSSFSE Diffusion w fatsat750CC BY 3.0s3://idc-open-data/92f2385f-a493-4541-9d06-a58...6.7792f2385f-a493-4541-9d06-a58c47ffc86f
123159ispy1NoneISPY1_10651.3.6.1.4.1.14519.5.2.1.7695.1700.154037209906...1.3.6.1.4.1.14519.5.2.1.7695.1700.319023455610...10.7937/k9/tcia.2016.hdhpgjlkNoneF1986-04-10MR BREAST WO/W CONT...Philips Medical SystemsGyroscan Intera1986-04-10BRSTCA SENSFS3DSAG 6DYN: PE212100270CC BY 3.0s3://idc-open-data/0ff2a54f-d42d-493a-b00f-993...9.430ff2a54f-d42d-493a-b00f-993dd20d7457
123160ispy1NoneISPY1_11171.3.6.1.4.1.14519.5.2.1.7695.1700.121966219660...1.3.6.1.4.1.14519.5.2.1.7695.1700.642706103327...10.7937/k9/tcia.2016.hdhpgjlk049YF1986-11-08MR BREASTUNI UE...GE MEDICAL SYSTEMSGENESIS_SIGNANonePJN64811CC BY 3.0s3://idc-open-data/2336521a-c503-4576-9dc1-7bf...5.812336521a-c503-4576-9dc1-7bf2ed032b32
123161ispy1NoneISPY1_11551.3.6.1.4.1.14519.5.2.1.7695.1700.236782822497...1.3.6.1.4.1.14519.5.2.1.7695.1700.233301049640...10.7937/k9/tcia.2016.hdhpgjlk042YF1987-06-07HUP5 PROTOCOLS^BREAST...SIEMENSSonata1987-06-07localizer14CC BY 3.0s3://idc-open-data/93373de9-671e-4ba5-958b-f20...0.5393373de9-671e-4ba5-958b-f20d88583464
\n", + "

123162 rows × 22 columns

\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "\n", + "
\n", + " \n", + " \n", + " \n", + "
\n", + "\n", + "
\n", + "
\n" + ], + "text/plain": [ + " collection_id analysis_result_id PatientID \\\n", + "0 pdmr_292921_168_r None 292921-168-R-1415 \n", + "1 upenn_gbm None UPENN-GBM-00073 \n", + "2 upenn_gbm None UPENN-GBM-00393 \n", + "3 breast_mri_nact_pilot None UCSF-BR-66 \n", + "4 upenn_gbm None UPENN-GBM-00398 \n", + "... ... ... ... \n", + "123157 ispy1 None ISPY1_1045 \n", + "123158 ispy1 None ISPY1_1135 \n", + "123159 ispy1 None ISPY1_1065 \n", + "123160 ispy1 None ISPY1_1117 \n", + "123161 ispy1 None ISPY1_1155 \n", + "\n", + " SeriesInstanceUID \\\n", + "0 2.25.52562508346911983761523927257987425894 \n", + "1 1.3.6.1.4.1.14519.5.2.1.5542074552050843981810... \n", + "2 1.3.6.1.4.1.14519.5.2.1.3194947574846003675214... \n", + "3 1.3.6.1.4.1.14519.5.2.1.7695.2311.961504369910... \n", + "4 1.3.6.1.4.1.14519.5.2.1.1321370275496387006647... \n", + "... ... \n", + "123157 1.3.6.1.4.1.14519.5.2.1.7695.1700.252777698659... \n", + "123158 1.3.6.1.4.1.14519.5.2.1.7695.1700.615305979619... \n", + "123159 1.3.6.1.4.1.14519.5.2.1.7695.1700.154037209906... \n", + "123160 1.3.6.1.4.1.14519.5.2.1.7695.1700.121966219660... \n", + "123161 1.3.6.1.4.1.14519.5.2.1.7695.1700.236782822497... \n", + "\n", + " StudyInstanceUID \\\n", + "0 2.25.208756230218968583908466437463380881762 \n", + "1 1.3.6.1.4.1.14519.5.2.1.2908533414156045303969... \n", + "2 1.3.6.1.4.1.14519.5.2.1.2816649104808466125578... \n", + "3 1.3.6.1.4.1.14519.5.2.1.7695.2311.195372761058... \n", + "4 1.3.6.1.4.1.14519.5.2.1.7948613579489379867954... \n", + "... ... \n", + "123157 1.3.6.1.4.1.14519.5.2.1.7695.1700.118428709069... \n", + "123158 1.3.6.1.4.1.14519.5.2.1.7695.1700.232588206214... \n", + "123159 1.3.6.1.4.1.14519.5.2.1.7695.1700.319023455610... \n", + "123160 1.3.6.1.4.1.14519.5.2.1.7695.1700.642706103327... \n", + "123161 1.3.6.1.4.1.14519.5.2.1.7695.1700.233301049640... \n", + "\n", + " source_DOI PatientAge PatientSex StudyDate \\\n", + "0 10.7937/tcia.2020.pcak-8z10 None M 2019-03-20 \n", + "1 10.7937/tcia.709x-dn49 068Y M 2002-11-23 \n", + "2 10.7937/tcia.709x-dn49 069Y F 2010-12-10 \n", + "3 10.7937/k9/tcia.2016.qhsyhjky 048Y F 1992-01-07 \n", + "4 10.7937/tcia.709x-dn49 069Y M 2010-11-19 \n", + "... ... ... ... ... \n", + "123157 10.7937/k9/tcia.2016.hdhpgjlk 049Y F 1986-01-23 \n", + "123158 10.7937/k9/tcia.2016.hdhpgjlk 059Y F 1987-08-02 \n", + "123159 10.7937/k9/tcia.2016.hdhpgjlk None F 1986-04-10 \n", + "123160 10.7937/k9/tcia.2016.hdhpgjlk 049Y F 1986-11-08 \n", + "123161 10.7937/k9/tcia.2016.hdhpgjlk 042Y F 1987-06-07 \n", + "\n", + " StudyDescription ... Manufacturer \\\n", + "0 NCI PDMR Tumor Characterization ... Philips Medical Systems \n", + "1 MRI BRAIN W/INJ/MHDI ... SIEMENS \n", + "2 BRAIN^BRAIN ... SIEMENS \n", + "3 MR BREAS, UNIT ... GE MEDICAL SYSTEMS \n", + "4 BRAIN^BRAIN ... SIEMENS \n", + "... ... ... ... \n", + "123157 MRI BR UNILAT W&WO CONT LEFT ... GE MEDICAL SYSTEMS \n", + "123158 MR BREASTUNI UE ... GE MEDICAL SYSTEMS \n", + "123159 MR BREAST WO/W CONT ... Philips Medical Systems \n", + "123160 MR BREASTUNI UE ... GE MEDICAL SYSTEMS \n", + "123161 HUP5 PROTOCOLS^BREAST ... SIEMENS \n", + "\n", + " ManufacturerModelName SeriesDate \\\n", + "0 Achieva 2019-03-20 \n", + "1 TrioTim 2002-11-23 \n", + "2 TrioTim 2010-12-10 \n", + "3 GENESIS_SIGNA 1992-01-07 \n", + "4 TrioTim 2010-11-19 \n", + "... ... ... \n", + "123157 GENESIS_SIGNA 1986-01-23 \n", + "123158 GENESIS_SIGNA 1987-08-02 \n", + "123159 Gyroscan Intera 1986-04-10 \n", + "123160 GENESIS_SIGNA None \n", + "123161 Sonata 1987-06-07 \n", + "\n", + " SeriesDescription SeriesNumber instanceCount \\\n", + "0 TSE45 split 4013 36 \n", + "1 ep2d_DTI_30dir 9 93 \n", + "2 t1 axial stealth-post : Processed_CaPTk 24 192 \n", + "3 Dynamic-3dfgre: SER 31000 60 \n", + "4 Axial T2 tse: Processed_CaPTk 3 64 \n", + "... ... ... ... \n", + "123157 T1 left breast post delay 7 52 \n", + "123158 DWSSFSE Diffusion w fatsat 7 50 \n", + "123159 BRSTCA SENSFS3DSAG 6DYN: PE2 121002 70 \n", + "123160 PJN 648 11 \n", + "123161 localizer 1 4 \n", + "\n", + " license_short_name series_aws_url \\\n", + "0 CC BY 4.0 s3://idc-open-data/1461966c-1a43-4198-a664-921... \n", + "1 CC BY 4.0 s3://idc-open-data/b5ca88ab-bb87-4b53-9711-915... \n", + "2 CC BY 4.0 s3://idc-open-data/6eb21652-c127-4fa4-9561-82c... \n", + "3 CC BY 3.0 s3://idc-open-data/fbbe6ce5-c9d7-4346-9d21-8bb... \n", + "4 CC BY 4.0 s3://idc-open-data/127fb2c9-ccaa-48ad-8c1f-7e3... \n", + "... ... ... \n", + "123157 CC BY 3.0 s3://idc-open-data/c944fe91-04c6-48a2-a60d-777... \n", + "123158 CC BY 3.0 s3://idc-open-data/92f2385f-a493-4541-9d06-a58... \n", + "123159 CC BY 3.0 s3://idc-open-data/0ff2a54f-d42d-493a-b00f-993... \n", + "123160 CC BY 3.0 s3://idc-open-data/2336521a-c503-4576-9dc1-7bf... \n", + "123161 CC BY 3.0 s3://idc-open-data/93373de9-671e-4ba5-958b-f20... \n", + "\n", + " series_size_MB crdc_series_uuid \n", + "0 22.51 1461966c-1a43-4198-a664-9213ce671db1 \n", + "1 110.12 b5ca88ab-bb87-4b53-9711-915d6b6541a1 \n", + "2 25.75 6eb21652-c127-4fa4-9561-82cef589b458 \n", + "3 8.23 fbbe6ce5-c9d7-4346-9d21-8bb9572170a1 \n", + "4 7.00 127fb2c9-ccaa-48ad-8c1f-7e38cc0e1e80 \n", + "... ... ... \n", + "123157 7.08 c944fe91-04c6-48a2-a60d-777aecf8189d \n", + "123158 6.77 92f2385f-a493-4541-9d06-a58c47ffc86f \n", + "123159 9.43 0ff2a54f-d42d-493a-b00f-993dd20d7457 \n", + "123160 5.81 2336521a-c503-4576-9dc1-7bf2ed032b32 \n", + "123161 0.53 93373de9-671e-4ba5-958b-f20d88583464 \n", + "\n", + "[123162 rows x 22 columns]" + ] + }, + "execution_count": 17, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "mr_selection_sql" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2g_DMz7VZnF2" + }, + "source": [ + "As you search the data, you can combine multiple columns to build your cohort!\n", + "\n", + "`BodyPartExamined` column describes the anatomy that was imaged. Similar to what we did to understand the values available in the `Modality` column, let's find the distinct values and this time also count the number of times that value is encountered." + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "zivORYVUabWF", + "outputId": "9dfdded3-6479-4dbc-b87e-a2a20160f1fa" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "BodyPartExamined\n", + "CHEST 352806\n", + "BREAST 100341\n", + "PROSTATE 22984\n", + "LUNG 10971\n", + "ABDOMEN 8893\n", + "PELVIS 5636\n", + "COLON 3544\n", + "KIDNEY 3457\n", + "LIVER 3267\n", + "BRAIN 2013\n", + "HEADNECK 1448\n", + "HEAD 1412\n", + "CHESTABDPELVIS 1153\n", + "EXTREMITY 934\n", + "BLADDER 846\n", + "OVARY 844\n", + "UTERUS 838\n", + "TSPINE 641\n", + "CERVIX 488\n", + "PANCREAS 428\n", + "PHANTOM 364\n", + "STOMACH 308\n", + "SKULL 286\n", + "WHOLEBODY 268\n", + "ABDOMENPELVIS 230\n", + "MEDIASTINUM 180\n", + "SPINE 131\n", + "ESOPHAGUS 126\n", + "ADRENAL 121\n", + "CSPINE 102\n", + "LSPINE 64\n", + "NECKCHESTABDPELV 63\n", + "CHESTABDOMEN 57\n", + "NECK 45\n", + "RECTUM 34\n", + "TLSPINE 31\n", + "THYROID 28\n", + "THORAX 28\n", + "OTHER 18\n", + "FUSION 18\n", + "LEG 16\n", + "UNKNOWN 15\n", + "NECKCHESTABDOMEN 14\n", + "SEG 12\n", + "TH CT CHEST WO C 11\n", + "ABDOMEN CAVIT 10\n", + "IAC 9\n", + "HEART 8\n", + "FOOT 7\n", + "CHEST (THORAX) W 6\n", + "PELVIC 6\n", + "LSSPINE 6\n", + "ABDOMENPELVIC 4\n", + "BONE 4\n", + "OUTSIDE FIL 3\n", + "WO INTER 2\n", + "EAR 2\n", + "BD CT ABD WO_W C 2\n", + "BODY 1\n", + "Name: count, dtype: int64\n" + ] + } + ], + "source": [ + "sorted_unique_values = idc_client.index['BodyPartExamined'].value_counts().sort_values(ascending=False)\n", + "\n", + "print(sorted_unique_values)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "vSfM6uYtav4U" + }, + "source": [ + "For the sake of example, let's filter images that meet two criteria: Magnetic Resonance (`MR`) as `Modality`, and `LIVER` as `BodyPartExamined`." + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": { + "id": "vaCquh8_a8z1" + }, + "outputs": [], + "source": [ + "query = \"\"\"\n", + "SELECT *\n", + "FROM index\n", + "WHERE Modality = 'MR' AND BodyPartExamined = 'LIVER'\n", + "\"\"\"\n", + "\n", + "liver_mr_selection_sql = idc_client.sql_query(query)" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 982 + }, + "id": "3zn44nuHbB5d", + "outputId": "cd5bf17f-13f7-4084-b3a9-367387cc20bd" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Index.format is deprecated and will be removed in a future version. Convert using index.astype(str) or index.map(formatter) instead.\n" + ] + }, + { + "data": { + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "dataframe", + "variable_name": "liver_mr_selection_sql" + }, + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
collection_idanalysis_result_idPatientIDSeriesInstanceUIDStudyInstanceUIDsource_DOIPatientAgePatientSexStudyDateStudyDescription...ManufacturerManufacturerModelNameSeriesDateSeriesDescriptionSeriesNumberinstanceCountlicense_short_nameseries_aws_urlseries_size_MBcrdc_series_uuid
0tcga_lihcNoneTCGA-BC-A10Y1.3.6.1.4.1.14519.5.2.1.8421.4008.165737829269...1.3.6.1.4.1.14519.5.2.1.8421.4008.253241154539...10.7937/k9/tcia.2016.immqw8uq076YM1993-03-21MRI ABD W+WO CONT...SIEMENSSonata1993-03-21Subtraction_S11_S9_11488CC BY 3.0s3://idc-open-data/ed74a10a-60fe-4aa1-8c44-736...34.86ed74a10a-60fe-4aa1-8c44-736c0eab00a0
1tcga_lihcNoneTCGA-DD-A4NJ1.3.6.1.4.1.14519.5.2.1.3344.4008.345436872573...1.3.6.1.4.1.14519.5.2.1.3344.4008.368700536079...10.7937/k9/tcia.2016.immqw8uq054YF2004-03-30MRI ABDOMEN W/WO CONTRAST (74183)...SIEMENSAvanto2004-03-30AX VIBE_Pre1288CC BY 3.0s3://idc-open-data/e61388db-859e-4b71-abd0-7e3...34.86e61388db-859e-4b71-abd0-7e310accb91f
2tcga_lihcNoneTCGA-DD-A4NP1.3.6.1.4.1.14519.5.2.1.3344.4008.185206193510...1.3.6.1.4.1.14519.5.2.1.3344.4008.164794594228...10.7937/k9/tcia.2016.immqw8uq033YM1999-01-21*MRI - ABDOMEN...GE MEDICAL SYSTEMSSIGNA EXCITE1999-01-21(lava arc) POST10348CC BY 3.0s3://idc-open-data/2ff18a3a-9390-4cc0-b083-7de...184.432ff18a3a-9390-4cc0-b083-7de785b7ab18
3tcga_lihcNoneTCGA-DD-A4NH1.3.6.1.4.1.14519.5.2.1.3344.4008.233142894536...1.3.6.1.4.1.14519.5.2.1.3344.4008.201829600838...10.7937/k9/tcia.2016.immqw8uq065YF2004-06-13MRI ABDOMEN W/WO CONTRAST (74183)...SIEMENSAvanto2004-06-13AX HASTE LONG TE1630CC BY 3.0s3://idc-open-data/095ae4d1-3c3d-4e58-9457-574...3.04095ae4d1-3c3d-4e58-9457-574da16b5286
4tcga_lihcNoneTCGA-DD-A1EF1.3.6.1.4.1.14519.5.2.1.3344.4008.260482159431...1.3.6.1.4.1.14519.5.2.1.3344.4008.280574940552...10.7937/k9/tcia.2016.immqw8uq057YF1999-12-01MRI Abd wo&w...GE MEDICAL SYSTEMSSIGNA EXCITE1999-12-013 plane loc163CC BY 3.0s3://idc-open-data/cb0165bb-d93e-4955-8020-dd6...8.62cb0165bb-d93e-4955-8020-dd6602590133
..................................................................
928tcga_lihcNoneTCGA-DD-A4NF1.3.6.1.4.1.14519.5.2.1.3344.4008.385090740780...1.3.6.1.4.1.14519.5.2.1.3344.4008.308452142823...10.7937/k9/tcia.2016.immqw8uq071YM2004-02-29MRI LIVER ELASTOGRAPHY W/O...GE MEDICAL SYSTEMSSIGNA EXCITE2004-02-29MRE7004CC BY 3.0s3://idc-open-data/cca5d478-50c0-4419-a006-9f5...0.55cca5d478-50c0-4419-a006-9f5d1220f421
929tcga_lihcNoneTCGA-DD-A1EF1.3.6.1.4.1.14519.5.2.1.3344.4008.307480562412...1.3.6.1.4.1.14519.5.2.1.3344.4008.280574940552...10.7937/k9/tcia.2016.immqw8uq057YF1999-12-01MRI Abd wo&w...GE MEDICAL SYSTEMSSIGNA EXCITE1999-12-01lava3d7336CC BY 3.0s3://idc-open-data/0939b7c7-9474-4ad9-845c-10b...178.120939b7c7-9474-4ad9-845c-10bb4d9b5878
930tcga_lihcNoneTCGA-BC-A5W41.3.6.1.4.1.14519.5.2.1.8421.4008.259154043948...1.3.6.1.4.1.14519.5.2.1.8421.4008.121664577064...10.7937/k9/tcia.2016.immqw8uq070YM2003-05-06MRI ABD W WO CON...SIEMENSSymphony2003-05-06VIBE_AXIAL_P1072CC BY 3.0s3://idc-open-data/ee7262e1-30f6-415e-91f7-bde...11.28ee7262e1-30f6-415e-91f7-bde02c78a7bf
931tcga_lihcNoneTCGA-DD-A4NO1.3.6.1.4.1.14519.5.2.1.3344.4008.171056092280...1.3.6.1.4.1.14519.5.2.1.3344.4008.276936118755...10.7937/k9/tcia.2016.immqw8uq065YM2000-03-11MRI Abd wo&w...GE MEDICAL SYSTEMSSIGNA HDx2000-03-11ax rt fse540CC BY 3.0s3://idc-open-data/bc47dad7-da54-4153-8180-cd6...21.22bc47dad7-da54-4153-8180-cd67bf0f37ae
932tcga_lihcNoneTCGA-DD-A4NF1.3.6.1.4.1.14519.5.2.1.3344.4008.170707128009...1.3.6.1.4.1.14519.5.2.1.3344.4008.146126813486...10.7937/k9/tcia.2016.immqw8uq071YM2004-02-24MR ABDOMEN W/WO CONTRAST...SIEMENSSymphony2004-02-24t1 vibe cor fs HAVE TO DO2244CC BY 3.0s3://idc-open-data/39e20f0f-ff9a-4bd8-81eb-e1f...23.2239e20f0f-ff9a-4bd8-81eb-e1f055059b09
\n", + "

933 rows × 22 columns

\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "\n", + "
\n", + " \n", + " \n", + " \n", + "
\n", + "\n", + "
\n", + "
\n" + ], + "text/plain": [ + " collection_id analysis_result_id PatientID \\\n", + "0 tcga_lihc None TCGA-BC-A10Y \n", + "1 tcga_lihc None TCGA-DD-A4NJ \n", + "2 tcga_lihc None TCGA-DD-A4NP \n", + "3 tcga_lihc None TCGA-DD-A4NH \n", + "4 tcga_lihc None TCGA-DD-A1EF \n", + ".. ... ... ... \n", + "928 tcga_lihc None TCGA-DD-A4NF \n", + "929 tcga_lihc None TCGA-DD-A1EF \n", + "930 tcga_lihc None TCGA-BC-A5W4 \n", + "931 tcga_lihc None TCGA-DD-A4NO \n", + "932 tcga_lihc None TCGA-DD-A4NF \n", + "\n", + " SeriesInstanceUID \\\n", + "0 1.3.6.1.4.1.14519.5.2.1.8421.4008.165737829269... \n", + "1 1.3.6.1.4.1.14519.5.2.1.3344.4008.345436872573... \n", + "2 1.3.6.1.4.1.14519.5.2.1.3344.4008.185206193510... \n", + "3 1.3.6.1.4.1.14519.5.2.1.3344.4008.233142894536... \n", + "4 1.3.6.1.4.1.14519.5.2.1.3344.4008.260482159431... \n", + ".. ... \n", + "928 1.3.6.1.4.1.14519.5.2.1.3344.4008.385090740780... \n", + "929 1.3.6.1.4.1.14519.5.2.1.3344.4008.307480562412... \n", + "930 1.3.6.1.4.1.14519.5.2.1.8421.4008.259154043948... \n", + "931 1.3.6.1.4.1.14519.5.2.1.3344.4008.171056092280... \n", + "932 1.3.6.1.4.1.14519.5.2.1.3344.4008.170707128009... \n", + "\n", + " StudyInstanceUID \\\n", + "0 1.3.6.1.4.1.14519.5.2.1.8421.4008.253241154539... \n", + "1 1.3.6.1.4.1.14519.5.2.1.3344.4008.368700536079... \n", + "2 1.3.6.1.4.1.14519.5.2.1.3344.4008.164794594228... \n", + "3 1.3.6.1.4.1.14519.5.2.1.3344.4008.201829600838... \n", + "4 1.3.6.1.4.1.14519.5.2.1.3344.4008.280574940552... \n", + ".. ... \n", + "928 1.3.6.1.4.1.14519.5.2.1.3344.4008.308452142823... \n", + "929 1.3.6.1.4.1.14519.5.2.1.3344.4008.280574940552... \n", + "930 1.3.6.1.4.1.14519.5.2.1.8421.4008.121664577064... \n", + "931 1.3.6.1.4.1.14519.5.2.1.3344.4008.276936118755... \n", + "932 1.3.6.1.4.1.14519.5.2.1.3344.4008.146126813486... \n", + "\n", + " source_DOI PatientAge PatientSex StudyDate \\\n", + "0 10.7937/k9/tcia.2016.immqw8uq 076Y M 1993-03-21 \n", + "1 10.7937/k9/tcia.2016.immqw8uq 054Y F 2004-03-30 \n", + "2 10.7937/k9/tcia.2016.immqw8uq 033Y M 1999-01-21 \n", + "3 10.7937/k9/tcia.2016.immqw8uq 065Y F 2004-06-13 \n", + "4 10.7937/k9/tcia.2016.immqw8uq 057Y F 1999-12-01 \n", + ".. ... ... ... ... \n", + "928 10.7937/k9/tcia.2016.immqw8uq 071Y M 2004-02-29 \n", + "929 10.7937/k9/tcia.2016.immqw8uq 057Y F 1999-12-01 \n", + "930 10.7937/k9/tcia.2016.immqw8uq 070Y M 2003-05-06 \n", + "931 10.7937/k9/tcia.2016.immqw8uq 065Y M 2000-03-11 \n", + "932 10.7937/k9/tcia.2016.immqw8uq 071Y M 2004-02-24 \n", + "\n", + " StudyDescription ... Manufacturer \\\n", + "0 MRI ABD W+WO CONT ... SIEMENS \n", + "1 MRI ABDOMEN W/WO CONTRAST (74183) ... SIEMENS \n", + "2 *MRI - ABDOMEN ... GE MEDICAL SYSTEMS \n", + "3 MRI ABDOMEN W/WO CONTRAST (74183) ... SIEMENS \n", + "4 MRI Abd wo&w ... GE MEDICAL SYSTEMS \n", + ".. ... ... ... \n", + "928 MRI LIVER ELASTOGRAPHY W/O ... GE MEDICAL SYSTEMS \n", + "929 MRI Abd wo&w ... GE MEDICAL SYSTEMS \n", + "930 MRI ABD W WO CON ... SIEMENS \n", + "931 MRI Abd wo&w ... GE MEDICAL SYSTEMS \n", + "932 MR ABDOMEN W/WO CONTRAST ... SIEMENS \n", + "\n", + " ManufacturerModelName SeriesDate SeriesDescription SeriesNumber \\\n", + "0 Sonata 1993-03-21 Subtraction_S11_S9_1 14 \n", + "1 Avanto 2004-03-30 AX VIBE_Pre 12 \n", + "2 SIGNA EXCITE 1999-01-21 (lava arc) POST 10 \n", + "3 Avanto 2004-06-13 AX HASTE LONG TE 16 \n", + "4 SIGNA EXCITE 1999-12-01 3 plane loc 1 \n", + ".. ... ... ... ... \n", + "928 SIGNA EXCITE 2004-02-29 MRE 700 \n", + "929 SIGNA EXCITE 1999-12-01 lava3d 7 \n", + "930 Symphony 2003-05-06 VIBE_AXIAL_P 10 \n", + "931 SIGNA HDx 2000-03-11 ax rt fse 5 \n", + "932 Symphony 2004-02-24 t1 vibe cor fs HAVE TO DO 22 \n", + "\n", + " instanceCount license_short_name \\\n", + "0 88 CC BY 3.0 \n", + "1 88 CC BY 3.0 \n", + "2 348 CC BY 3.0 \n", + "3 30 CC BY 3.0 \n", + "4 63 CC BY 3.0 \n", + ".. ... ... \n", + "928 4 CC BY 3.0 \n", + "929 336 CC BY 3.0 \n", + "930 72 CC BY 3.0 \n", + "931 40 CC BY 3.0 \n", + "932 44 CC BY 3.0 \n", + "\n", + " series_aws_url series_size_MB \\\n", + "0 s3://idc-open-data/ed74a10a-60fe-4aa1-8c44-736... 34.86 \n", + "1 s3://idc-open-data/e61388db-859e-4b71-abd0-7e3... 34.86 \n", + "2 s3://idc-open-data/2ff18a3a-9390-4cc0-b083-7de... 184.43 \n", + "3 s3://idc-open-data/095ae4d1-3c3d-4e58-9457-574... 3.04 \n", + "4 s3://idc-open-data/cb0165bb-d93e-4955-8020-dd6... 8.62 \n", + ".. ... ... \n", + "928 s3://idc-open-data/cca5d478-50c0-4419-a006-9f5... 0.55 \n", + "929 s3://idc-open-data/0939b7c7-9474-4ad9-845c-10b... 178.12 \n", + "930 s3://idc-open-data/ee7262e1-30f6-415e-91f7-bde... 11.28 \n", + "931 s3://idc-open-data/bc47dad7-da54-4153-8180-cd6... 21.22 \n", + "932 s3://idc-open-data/39e20f0f-ff9a-4bd8-81eb-e1f... 23.22 \n", + "\n", + " crdc_series_uuid \n", + "0 ed74a10a-60fe-4aa1-8c44-736c0eab00a0 \n", + "1 e61388db-859e-4b71-abd0-7e310accb91f \n", + "2 2ff18a3a-9390-4cc0-b083-7de785b7ab18 \n", + "3 095ae4d1-3c3d-4e58-9457-574da16b5286 \n", + "4 cb0165bb-d93e-4955-8020-dd6602590133 \n", + ".. ... \n", + "928 cca5d478-50c0-4419-a006-9f5d1220f421 \n", + "929 0939b7c7-9474-4ad9-845c-10bb4d9b5878 \n", + "930 ee7262e1-30f6-415e-91f7-bde02c78a7bf \n", + "931 bc47dad7-da54-4153-8180-cd67bf0f37ae \n", + "932 39e20f0f-ff9a-4bd8-81eb-e1f055059b09 \n", + "\n", + "[933 rows x 22 columns]" + ] + }, + "execution_count": 20, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "liver_mr_selection_sql" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5M50eWLifYEe" + }, + "source": [ + "## Working with the search results\n", + "\n", + "Now that we learned the basics of searching, we will go over some of the operations to apply to the search results. We will learn how to download your selection, how to visualize individual images, and how to learn about the terms of use and attribution - this will be important if you use images from IDC in publications or commercial work!" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5sL2ZJUSgWnk" + }, + "source": [ + "### Downloading selected images\n", + "\n", + "Earlier we learned how to download the selected study, where study identifier was copied from the IDC Portal.\n", + "\n", + "You can use the same function to download images that you found by searching the index. In the following cell we will use `SeriesInstanceUID` passed to the `download_from_selection()` function to download the first 10 series." + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "0fDcq_Vufukf", + "outputId": "fd03938d-908e-4c3f-8010-b9f94b54a29a" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Downloading data: 100%|██████████| 432M/432M [00:07<00:00, 61.5MB/s]\n" + ] + } + ], + "source": [ + "idc_client.download_from_selection(seriesInstanceUID = list(liver_mr_selection_sql['SeriesInstanceUID'].values[:10]), downloadDir=\".\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "HsuVEbqeZPSR" + }, + "source": [ + "### Visualizing selected images\n", + "\n", + "To better understand the images available, you may want to take a look at them first. You do not need to leave the notebook to do that! `idc-index` provides a convenience function to generate a URL to open the viewer, and to even embed it in your notebook cell!\n", + "\n", + "In the following cell we will pick a random row from the selection we did earlier, and generate the URL to open that image series in IDC-maintained image viewer." + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "RQZj8AK_bipY", + "outputId": "ea3c6f61-929f-4619-e488-01856037c159" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "https://viewer.imaging.datacommons.cancer.gov/v3/viewer/?StudyInstanceUIDs=1.3.6.1.4.1.14519.5.2.1.8421.4008.303533339368406310446855637599&SeriesInstanceUIDs=1.3.6.1.4.1.14519.5.2.1.8421.4008.328010787802907923407541379931\n" + ] + } + ], + "source": [ + "import random\n", + "\n", + "random_series = random.choice(liver_mr_selection_sql['SeriesInstanceUID'].values)\n", + "viewer_url = idc_client.get_viewer_URL(seriesInstanceUID=random_series, viewer_selector=\"ohif_v3\")\n", + "\n", + "print(viewer_url)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "w6zYr7m8cO0h" + }, + "source": [ + "In the following cell, we embed the viewer that opens the URL for the selected series into the output cell. This way you can examine items you selected without leaving the notebook page!" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 941 + }, + "id": "6g5FZV7YeDqQ", + "outputId": "6145fcb4-7531-4f86-d42d-14a191dfe1b6" + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + " \n", + " " + ], + "text/plain": [ + "" + ] + }, + "execution_count": 25, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from IPython.display import IFrame\n", + "IFrame(viewer_url, width=1600, height=900)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "0Y3TXMU_eyQx" + }, + "source": [ + "### Getting license information for the selected images\n", + "\n", + "Each of the files available from IDC is accompanied by a license that defines the terms of use. In most cases, those are generic, broadly accepted Creative Commons licenses.\n", + "\n", + "The abbreviated license code is available in the `license_short_name` column. In the following cell we get the list of various licenses that are encountered across all of the selected series. You will see that the only license encountered is a Creative Commons By Attribution license https://creativecommons.org/licenses/by/3.0/deed.en, which does not restrict commercial use, but requires you to acknowledge the author if you reuse the item.\n", + "\n", + "Most (>95%) of images in IDC are shared under the permissive CC-BY license that allows commercial use (but does require attribution)!" + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "8bn2VezNhcWj", + "outputId": "19b6beb3-6651-422c-e72c-a60e56378ab5" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "license_short_name\n", + "CC BY 3.0 933\n", + "Name: count, dtype: int64\n" + ] + } + ], + "source": [ + "sorted_unique_values = liver_mr_selection_sql['license_short_name'].value_counts().sort_values(ascending=False)\n", + "\n", + "print(sorted_unique_values)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "0nh-eGaKh4TW" + }, + "source": [ + "### Getting citation information\n", + "\n", + "To get more details about how the data was collected you should use the `source_DOI` column, which contains Digital Object Identifier (DOI) for the dataset from where the given file originates.\n", + "\n", + "We can easily access the complete list of DOIs that accompany the items in our selection." + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "7Zf76kMziVCp", + "outputId": "04eea72f-8e02-47ae-c645-c8be314d9b0a" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "source_DOI\n", + "10.7937/k9/tcia.2016.immqw8uq 910\n", + "10.7937/k9/tcia.2018.oblamn27 23\n", + "Name: count, dtype: int64\n" + ] + } + ], + "source": [ + "sorted_unique_values = liver_mr_selection_sql['source_DOI'].value_counts().sort_values(ascending=False)\n", + "\n", + "print(sorted_unique_values)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7Py8o3VqikO4" + }, + "source": [ + "If you use data from IDC, you should also acknowledge IDC as the source of the data, and cite individual datasets that you used.\n", + "\n", + "To help you comply with the attribution requirements, `idc-index` provides a convenience function `citations_from_selection` that will look up the DOIs and generate the list of citations.\n", + "\n", + "WARNING: As of May 30, 2024, due to server issues at api.crossref.org, the following cell may not work. In the future, we will replace the API call to CrossRef with a cached list of publication to address this issue.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "9f-IWgQCiXxf", + "outputId": "3542101b-c4ba-46a8-c3f3-f51f6bddd240" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "['Erickson, B. J., Kirk, S., Lee, Y., Bathe, O., Kearns, M., Gerdes, C., Rieger-Christ, K., & Lemmerman, J. (2016). The Cancer Genome Atlas Liver Hepatocellular Carcinoma Collection (TCGA-LIHC) (Version 5) [Data set]. The Cancer Imaging Archive. https://doi.org/10.7937/K9/TCIA.2016.IMMQW8UQ',\n", + " 'National Cancer Institute Clinical Proteomic Tumor Analysis Consortium (CPTAC). (2018). The Clinical Proteomic Tumor Analysis Consortium Clear Cell Renal Cell Carcinoma Collection (CPTAC-CCRCC) (Version 13) [Data set]. The Cancer Imaging Archive. https://doi.org/10.7937/K9/TCIA.2018.OBLAMN27',\n", + " 'Fedorov, A., Longabaugh, W. J. R., Pot, D., Clunie, D. A., Pieper, S. D., Gibbs, D. L., Bridge, C., Herrmann, M. D., Homeyer, A., Lewis, R., Aerts, H. J. W. L., Krishnaswamy, D., Thiriveedhi, V. K., Ciausu, C., Schacherer, D. P., Bontempi, D., Pihl, T., Wagner, U., Farahani, K., â\\x80¦ Kikinis, R. (2023). National Cancer Institute Imaging Data Commons: Toward Transparency, Reproducibility, and Scalability in Imaging Artificial Intelligence. RadioGraphics, 43(12). https://doi.org/10.1148/rg.230180\\n']" + ] + }, + "execution_count": 28, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "idc_client.citations_from_selection(seriesInstanceUID=list(liver_mr_selection_sql['SeriesInstanceUID'].values))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "io2DpTiki7OM" + }, + "source": [ + "You can also customize that citation list to be in BibTeX format (learn more in the documentation of the function here)." + ] + }, + { + "cell_type": "code", + "execution_count": 30, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "IIE6IWoAifBH", + "outputId": "b521f7fd-21ea-4172-c0a4-928e3759c820" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "['@misc{https://doi.org/10.7937/k9/tcia.2016.immqw8uq,\\n doi = {10.7937/K9/TCIA.2016.IMMQW8UQ},\\n url = {https://www.cancerimagingarchive.net/collection/tcga-lihc/},\\n author = {Erickson, Bradley J. and Kirk, Shanah and Lee, Yueh and Bathe, Oliver and Kearns, Melissa and Gerdes, Cindy and Rieger-Christ, Kimberly and Lemmerman, John},\\n title = {The Cancer Genome Atlas Liver Hepatocellular Carcinoma Collection (TCGA-LIHC)},\\n publisher = {The Cancer Imaging Archive},\\n year = {2016},\\n copyright = {Creative Commons Attribution 3.0 Unported}\\n}\\n',\n", + " '@misc{https://doi.org/10.7937/k9/tcia.2018.oblamn27,\\n doi = {10.7937/K9/TCIA.2018.OBLAMN27},\\n url = {https://www.cancerimagingarchive.net/collection/cptac-ccrcc/},\\n author = {{National Cancer Institute Clinical Proteomic Tumor Analysis Consortium (CPTAC)}},\\n title = {The Clinical Proteomic Tumor Analysis Consortium Clear Cell Renal Cell Carcinoma Collection (CPTAC-CCRCC)},\\n publisher = {The Cancer Imaging Archive},\\n year = {2018},\\n copyright = {Creative Commons Attribution 3.0 Unported}\\n}\\n',\n", + " ' @article{Fedorov_2023, title={National Cancer Institute Imaging Data Commons: Toward Transparency, Reproducibility, and Scalability in Imaging Artificial Intelligence}, volume={43}, ISSN={1527-1323}, url={http://dx.doi.org/10.1148/rg.230180}, DOI={10.1148/rg.230180}, number={12}, journal={RadioGraphics}, publisher={Radiological Society of North America (RSNA)}, author={Fedorov, Andrey and Longabaugh, William J. R. and Pot, David and Clunie, David A. and Pieper, Steven D. and Gibbs, David L. and Bridge, Christopher and Herrmann, Markus D. and Homeyer, Andr√© and Lewis, Rob and Aerts, Hugo J. W. L. and Krishnaswamy, Deepa and Thiriveedhi, Vamsi Krishna and Ciausu, Cosmin and Schacherer, Daniela P. and Bontempi, Dennis and Pihl, Todd and Wagner, Ulrike and Farahani, Keyvan and Kim, Erika and Kikinis, Ron}, year={2023}, month=dec }\\n']" + ] + }, + "execution_count": 30, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from idc_index import index\n", + "idc_client.citations_from_selection(seriesInstanceUID=list(liver_mr_selection_sql['SeriesInstanceUID'].values), citation_format=index.IDCClient.CITATION_FORMAT_BIBTEX)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "F8PmgbAIuf0-" + }, + "source": [ + "## Summary\n", + "\n", + "This is it! We hope you indeed learned how to search, visualize, and download images from IDC, and how to comply with the usage terms by understanding what license covers specific dataset, and how to attribute its authors.\n", + "\n", + "We hope you enjoyed this tutorial! If you want to learn more about IDC, you can check out the [Getting Started documentation page](https://learn.canceridc.dev/getting-started-with-idc), or take a look at other tutorials we have in the [IDC-Tutorials GitHub repository](https://github.com/ImagingDataCommons/IDC-Tutorials/tree/master).\n", + "\n", + "If something didn't work as expected, if you have any feedback or suggestions of what should be added to this tutorial, please contact IDC support by sending email to support@canceridc.dev or posting your question on [IDC User forum](https://discourse.canceridc.dev)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "It9Md7yzwTn4" + }, + "source": [ + "## Acknowledgments\n", + "\n", + "Imaging Data Commons has been funded in whole or in part with Federal funds from the National Cancer Institute, National Institutes of Health, under Task Order No. HHSN26110071 under Contract No. HHSN261201500003l.\n", + "\n", + "If you use IDC in your research, please cite the following publication:\n", + "\n", + "> Fedorov, A., Longabaugh, W. J. R., Pot, D., Clunie, D. A., Pieper, S. D., Gibbs, D. L., Bridge, C., Herrmann, M. D., Homeyer, A., Lewis, R., Aerts, H. J. W., Krishnaswamy, D., Thiriveedhi, V. K., Ciausu, C., Schacherer, D. P., Bontempi, D., Pihl, T., Wagner, U., Farahani, K., Kim, E. & Kikinis, R. _National Cancer Institute Imaging Data Commons: Toward Transparency, Reproducibility, and Scalability in Imaging Artificial Intelligence_. RadioGraphics (2023). https://doi.org/10.1148/rg.230180" + ] + } + ], + "metadata": { + "colab": { + "include_colab_link": true, + "provenance": [] + }, + "gpuClass": "standard", + "kernelspec": { + "display_name": "Python 3", + "name": "python3" + }, + "language_info": { + "name": "python" + }, + "orig_nbformat": 4 + }, + "nbformat": 4, + "nbformat_minor": 0 +}