# Exploration of the LIDC-IDRI analysis results

# About

The purpose of this notebook is to provide a demonstration of how standard DICOM objects containing annotations and evaluations of the nodules for the TCIA [LIDC-IDRI](https://wiki.cancerimagingarchive.net/display/Public/LIDC-IDRI) collection hosted on the Imaging Data Commons (IDC) can be examined using various standard tools and components provided by IDC and Google Cloud platform.

Detailed description of the dataset is available in the open access article below.

> Fedorov, A., Hancock, M., Clunie, D., Brochhausen, M., Bona, J., Kirby, J., Freymann, J., Pieper, S., J W L Aerts, H., Kikinis, R. & Prior, F. DICOM re-encoding of volumetrically annotated Lung Imaging Database Consortium (LIDC) nodules. Med. Phys. (2020). https://doi.org/10.1002/mp.14445

**The latest version of the notebook is in this repository: https://github.com/ImagingDataCommons/IDC-Examples/blob/master/notebooks/LIDC_exploration.ipynb**

# Prerequisites

This notebook assumes that you: 
* have internet access
* have a Google identity
* configured a project under Google Cloud Platform the following locally on your computer (you can see how to complete this step in [this tutorial](https://youtu.be/i08S0KJLnyw))
* replace `##MY_PROJECT_ID##` in the cell below with the ID of the GCP project you have confugured under your account

Let's authenticate to be able to perform any queries, and import the packages we will be using to work with the data.

In [25]:
myProjectID = "idc-sandbox-000"

In [26]:
from google.colab import auth
auth.authenticate_user()

In [27]:
import pandas as pd
import os, json
import seaborn as sb
import numpy as np
import matplotlib.pyplot as plt

%matplotlib notebook 

def get_idc_viewer_url(studyUID):
  return "https://viewer.imaging.datacommons.cancer.gov/viewer/"+studyUID

# Data exploration

To explore the data, we will query BigQuery tables maintained by IDC that contain all of the DICOM metadata for hosted content.

You can learn about the organization of the IDC BQ tables here: https://learn.canceridc.dev/data/organization-of-data#bigquery-tables



### CT Images

IDC BQ tables contain one row per DICOM instance. Let's first subset all rows that correspond to the instances of CT modality from the `lidc_idri` collection, and get count the number of CT series for each patient.

Note the syntax of the `%%bigquery` command: the last argument specifies the name of the pandas data frame that will contain the result of the query.

**NOTE**: you will need to replace the "PROJECT_ID" placeholdler with the ID of your GCP project!


In [28]:
%%bigquery --project=$myProjectID ct_series_counts

WITH
  all_lidc_ct_series AS (
  SELECT
    DISTINCT(SeriesInstanceUID),
    PatientID
  FROM
    `canceridc-data.idc_views.dicom_all`
  WHERE
    Modality = "CT"
    AND collection_id = "lidc_idri")
SELECT
  PatientID,
  COUNT(PatientID) AS ct_series_count
FROM
  all_lidc_ct_series
GROUP BY
  PatientID
ORDER BY
  ct_series_count DESC

How many subjects do we have? Which subjects have more than one CT series?


In [None]:
num_subjects = ct_series_counts["PatientID"].shape[0]
print(f"Total number of subjects: {num_subjects}")

print("\nSubjects with more than one CT series:")
ct_series_counts[ct_series_counts["ct_series_count"]>1]

We can use BQ to examine various aspects of the dataset, for example those related to the heterogeneity of acquisition in the data.

In [None]:
%%bigquery --project=$myProjectID slice_thickness

WITH
  all_lidc_ct_series AS (
  SELECT
    DISTINCT(SeriesInstanceUID),
    PatientID,
    SliceThickness
  FROM
    `canceridc-data.idc_views.dicom_all`
  WHERE
    Modality = "CT"
    AND collection_id = "lidc_idri")
  SELECT SliceThickness FROM
    all_lidc_ct_series

In [None]:
%matplotlib inline
ax=sb.distplot(slice_thickness["SliceThickness"].astype(float),kde=False)
ax.set(xlabel="SliceThickness, mm")
slice_thickness["SliceThickness"].astype(float).describe()

In [None]:
%%bigquery --project=$myProjectID pixel_spacing

WITH
  all_lidc_ct_series AS (
  SELECT
    DISTINCT(SeriesInstanceUID),
    PatientID,
    ARRAY_TO_STRING(PixelSpacing,"/") as pixelSpacingStr
  FROM
    `canceridc-data.idc_views.dicom_all`
  WHERE
    Modality = "CT"
    AND collection_id = "lidc_idri")
  SELECT pixelSpacingStr FROM
    all_lidc_ct_series

In [None]:
xSpacing = pixel_spacing["pixelSpacingStr"].str.split('/',n=1,expand=True)[0].astype(float)
ax=sb.distplot(xSpacing, kde=False)
ax.set(xlabel="PixelSpacing, mm")
xSpacing.describe()

### Segmentations

LIDC collection includes segmentations stored as DICOM Segmentation objects. You can read more about what those are here: https://learn.canceridc.dev/dicom/derived-objects.

Since most of the metadata related to segmentations is stored in DICOM sequences, and it is a bit cumbersome to query metadata located in sequences (which are stored in BigQuery RECORD data type), we will use the data views maintained by IDC that flatten some of that data to simplify access. 

You can read more about the data views that are maintained by IDC here: https://learn.canceridc.dev/data/organization-of-data#bigquery-tables.

You can read in detail about the data organization in [this paper](https://doi.org/10.1002/mp.14445), but in a nutshell, a subset of CT series included in the LIDC collection contains lung nodules, which were annotated volumetrically by a group of readers. 

First, let's look at the overall summary of the annotations - number of annotations per nodule, and number of nodules per subject.

In the query below, we take segmentation-specific attributes from the `segmentations` view, and join it with the selected attributes from the table that contains all of the DICOM metadata and collection-level metadata.

One such collection metadata is the `Source_DOI`, which is the Digital Object Identifier (DOI) corresponding to the TCIA collection with the LIDC annotations stored in DICOM format. Since each primary collection can have multiple groups of analysis results associated with it, we use the DOI to subset just a single analysis results collections identified by DOI https://doi.org/10.7937/TCIA.2018.h7umfurq.

In [None]:
%%bigquery --project=$myProjectID segmentations

with lidc_segmentations as (
SELECT
  collection_id, 
  all_attributes.PatientID,
  all_attributes.SeriesDescription,
  TrackingID,
  TrackingUID,
  all_attributes.StudyInstanceUID,
  all_attributes.SOPInstanceUID,
  all_attributes.Source_DOI
FROM
  `canceridc-data.idc_views.segmentations` AS seg_attributes
JOIN
  `canceridc-data.idc_views.dicom_all` AS all_attributes
ON
  seg_attributes.SOPInstanceUID = all_attributes.SOPInstanceUID)
select * from lidc_segmentations
where Source_DOI = "10.7937/TCIA.2018.h7umfurq"
  

In the below, "annotation" corresponds to a segmentation of a nodule, with multiple segmentation potentially available for a given nodule. `TrackingUID` is a unique nodule identifier assigned by the dataset creators (details in the paper!) that can be used to associate individual annotation with a given nodule.

In [None]:
print("Total annotations: "+str(segmentations.shape[0]))
print("Total nodules: "+str(segmentations.drop_duplicates(subset="TrackingUID").shape[0]))

annotationsPerNodule = segmentations["TrackingUID"].value_counts()
ax=sb.distplot(annotationsPerNodule,kde=False) #.set_title("Number of annotations per nodule")
ax.set(xlabel="annotations per nodule")

In [None]:
# annotations per subject
annotationsPerSubject=segmentations["PatientID"].value_counts()
sb.distplot(annotationsPerSubject,kde=False).set_title("Number of annotations per subject")
annotationsPerSubject.describe()

Next we form a new table that will have a single row per nodule to look at some nodule-level statistics.

In [None]:
# nodules per case, case being "patient"
oneAnnotationPerNodule=segmentations.drop_duplicates(subset="TrackingUID")["PatientID"].value_counts()
ax=sb.distplot(oneAnnotationPerNodule,kde=False) #.set_title("Number of nodules per patient")
ax.set(xlabel="nodules per case")
oneAnnotationPerNodule.describe()

## Visualization of interesting cases

If there is an interesting case or annotation, it is easy to visualize it using the IDC-maintained image viewer.

Let's find a case that has the largest number of nodules.

In [None]:
# which case has the largest number of nodules?
oneAnnotationPerNodule.head(3)

Now that we know `PatientID`s for those, we can get `StudyInstanceUID` - and open the corresponding study in a viewer!

In [None]:
segmentations[segmentations["PatientID"] == "LIDC-IDRI-0583"].drop_duplicates(subset="StudyInstanceUID")["StudyInstanceUID"].values[0]

To open the study in the viewer, just append the `StudyInstanceUID` value above (`1.3.6.1.4.1.14519.5.2.1.6279.6001.230901123329037029807195618747`) to the IDC viewer prefix:

In [None]:
print(get_idc_viewer_url("1.3.6.1.4.1.14519.5.2.1.6279.6001.230901123329037029807195618747"))

## Evaluations and measurements

Each annotation of the nodule is accompanied by its qualitative assessment performed by the reader, and quantitative measurements (volume and surface area) calculated based on the definition of the segmented region.

All of those annotations are stored in DICOM Structured Reporting instances of SR teamplate TID 1500 (read more about it here: https://learn.canceridc.dev/dicom/derived-objects.), with each set of measurements associated with a single segmentation and stored in a single instance of the DICOM SR object.

Similar to the segmentation objects, navigating the content of DICOM SR objects can be quite complex, and IDC provides table views that simplify access to the measurements contained in SR documents.

Let's first get all the measurements, see what kinds of measurements are available for this collection, and how to access them.

In [None]:
%%bigquery --project=$myProjectID quantitative_measurements

with lidc_measurements as (
SELECT
  collection_id, 
  all_attributes.PatientID,
  all_attributes.SeriesDescription,
  trackingIdentifier,
  trackingUniqueIdentifier,
  Quantity.CodeMeaning as Quantity,
  "Units.CodeMeaning" as Units,
  Value,
  all_attributes.StudyInstanceUID,
  all_attributes.SOPInstanceUID,
  all_attributes.Source_DOI
FROM
  `canceridc-data.idc_views.quantitative_measurements` AS measurements_attributes
JOIN
  `canceridc-data.idc_views.dicom_all` AS all_attributes
ON
  measurements_attributes.SOPInstanceUID = all_attributes.SOPInstanceUID)
select * from lidc_measurements
where Source_DOI = "10.7937/TCIA.2018.h7umfurq"

In [None]:
print(f"Number of quantitative measurements: {quantitative_measurements.shape[0]}")

In [None]:
volumes = quantitative_measurements[quantitative_measurements["Quantity"]=="Volume"]
sb.distplot(volumes["Value"].astype(float).values,kde=False).set_title("Annotation volume")
#volumes["Value"].astype(float).describe()

Similar to the example above, we can easily find the largest annotation, and open it in a viewer.

In [None]:
# is that largest tumor an outlier?
largest = volumes[volumes["Value"].astype(float)==np.max(volumes["Value"].astype(float).values)]
subject = largest["PatientID"].values[0]
noduleUID = largest["trackingUniqueIdentifier"].values[0]
studyUID = pd.unique(largest["StudyInstanceUID"])[0]
#annotationLabel = segmentations[segmentations["TrackingUID"]==noduleUID]["SegmentLabel"].values[0]

print(subject)
print(largest["trackingIdentifier"].values[0])
print(get_idc_viewer_url(studyUID))
#print(annotationLabel)

## Qualitative evaluations

First, retrieve qualitative measurements alongside some additional attributes from the `dicom_all` table.

In [18]:
%%bigquery --project=$myProjectID qualitative_measurements

with lidc_measurements as (
SELECT
  collection_id, 
  all_attributes.PatientID,
  all_attributes.SeriesDescription,
  trackingIdentifier,
  trackingUniqueIdentifier,
  Quantity.CodeMeaning as Quantity,
  "Units.CodeMeaning" as Units,
  Value.CodeMeaning as Value,
  all_attributes.StudyInstanceUID,
  all_attributes.SOPInstanceUID,
  all_attributes.Source_DOI
FROM
  `canceridc-data.idc_views.qualitative_measurements` AS measurements_attributes
JOIN
  `canceridc-data.idc_views.dicom_all` AS all_attributes
ON
  measurements_attributes.SOPInstanceUID = all_attributes.SOPInstanceUID)
select * from lidc_measurements
where Source_DOI = "10.7937/TCIA.2018.h7umfurq"

`Quantity` attribute can be used to figure out the type of evaluation.

In [None]:
qualitative_measurements["Quantity"].unique()

Here is a VERY busy plot summarizing all of the types and values of qualitative evaluations for our cohort.

In [None]:
%matplotlib inline

sb.countplot(y="Quantity", hue="Value", data=qualitative_measurements)

#g = sb.FacetGrid(qualitative, col="subject", col_wrap=3, height=2)
#g.map(sb.countplot, "conceptCode_CodeMeaning", "conceptValue_CodeMeaning", color=".3");

#g = sb.FacetGrid(qualitative, col="conceptCode_CodeMeaning", col_wrap=3)
#g.map(sb.countplot, "conceptValue_CodeMeaning", color=".3", orient="v")
#plt.figure(figsize=(10, 30))
#sb.countplot(y="conceptCode_CodeMeaning", hue="conceptValue_CodeMeaning", data=qualitativeWithContext)


In [None]:

qualitative_measurements[qualitative_measurements["Quantity"]=="Internal structure"]["Value"].value_counts()
