<a href="https://colab.research.google.com/github/ImagingDataCommons/IDC-Tutorials/blob/master/notebooks/advanced_topics/dicom_private_tags_intro.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Searching private DICOM tags using IDC BigQuery index

This notebook gives an introduction into how private tags can be used to search IDC content and select cohorts using SQL queries and IDC BigQuery index. Note that this is an advanced topic, and it might be easier for you to tackle this after completing the ["Getting started" tutorial series](https://github.com/ImagingDataCommons/IDC-Tutorials/blob/master/notebooks/getting_started/part3_exploring_cohorts.ipynb) that will introduce you to working with IDC using BigQuery index.

To be able to use this notebook, you will first need to complete the BigQuery prerequisites as discussed in another notebook here: https://github.com/ImagingDataCommons/IDC-Tutorials/blob/master/notebooks/getting_started/part1_prerequisites.ipynb.

This notebook was designed and tested in Google Colab. Please use the "Open in Colab" button at the top of the notebook to open it in Google Colab. You will need to do extra steps to make it work outside of Google Colab. If you need help with that, please open a thread in IDC Forum: https://discourse.canceridc.dev.

---

Initial release: Dec 2023

Updated: Jan 2025

---

Assuming you completed the aforementioned prerequisites, please run the following cell after entering your GCP project ID in the cell below.

In [None]:
#@title Enter your Project ID
# initialize this variable with your Google Cloud Project ID!
my_ProjectID = "idc-sandbox-000" #@param {type:"string"}

import os
os.environ["GCP_PROJECT_ID"] = my_ProjectID

from google.colab import auth
auth.authenticate_user()

## Background

Imaging equipment manufacturers, unfortunately, sometimes rely on private elements to communicate some of the acquisition details. Even more unfortunate is that in some cases, information encoded in such private attributes is absolutely essential for interpretation of those images. As an example, key acquisition parameters, such as b-values in Diffusion Weighted MRI (DWI), are stored in private elements.

What is a "private element"? This is an element that is not part of the DICOM standard, but instead is specific to a particular vendor.

The use of private elements is problematic for a number of reasons:
* users need to rely on other sources of information than the DICOM standard (typically, manufacturer DICOM conformance statements) to learn about those attributes
* the use of the private elements can change across the software versions
* private elements may be removed as part of image de-identification
* the specific tag group is not guaranteed to be fixed

Fortunately for IDC users, private DICOM elements are extracted and are searchable in BigQuery (see documentation here https://cloud.google.com/healthcare-api/docs/how-tos/dicom-bigquery-schema#private_data_elements).

In this notebook we demonstrate how private tags and corresponding data can be accessed using BigQuery on an example of the prostate DWI use case.



## Filtering images by DWI b-values

### Understanding where to find b-values

Private DICOM elements that are not sequences (not DICOM SQ VR) are extracted into the BigQuery column `OtherElements` in the IDC BigQuery index (`bigquery-public-data.idc_current.dicom_all`).

First thing to remember is that since private elements are not harmonized across vendors, you will first need to identify the subset of images that you want to analyze, identify the manufacturer and model name for the imaging equipment that produced this specific data, locate DICOM conformance statement and understand how the values of interest are encoded.

For the sake of this example, we will look into DWI images that are distributed as part of the [QIN-Prostate-Repeatability collection](http://doi.org/10.7937/K9/TCIA.2018.MR1CKGND). B-value is a key acquisition parameter for DWI.

Let's first get an idea about the manufacturer and model name for the MR images included in this collection.

In [None]:
%%bigquery --project $my_ProjectID
SELECT
  DISTINCT Manufacturer, ManufacturerModelName
FROM
  `bigquery-public-data.idc_current.dicom_all`
WHERE
  collection_id = "qin_prostate_repeatability" AND Modality = "MR"

With a little help from ChatGPT (you will have to use your preferred search engine, as there is no single resource containing all of the DICOM conformance statements from all vendors), we located the DICOM conformance statement from GE titled ["HDxt, Hdi & Signa Vibrant Product Line DICOM CONFORMANCE STATEMENT"](https://www.gehealthcare.com/-/jssmedia/widen/2018/01/25/0204/gehealthcarecom/migrated/2018/02/19/0841/tic-resonance-gehc-dicom-conformance_signa-hdi-hdxt-15-30-vibrant-sw16_doc0878555_rev2_pdf.pdf?hash=2F2C737642001B6425A26406F04E614F&rev=-1).

This is the page that contains a rather long list of conformance statements from GE: https://www.gehealthcare.com/products/interoperability/dicom/magnetic-resonance-imaging-dicom-conformance-statements, which also includes one for Discovery 750w.

Opening the PDF documents for those statements, we can search for text string `b_value`, and locate the table that indicates that private element `(0043,1039)` is a sequence (array), which contains b-value as the first item in that sequence.

### Accessing specific private elements in the search query

Now that we know that b-values are in the first item of the `(0043,1039)` sequence, we can access its values.

In [None]:
%%bigquery --project $my_ProjectID

 SELECT
    collection_id,
    StudyDate,
    StudyInstanceUID,
    SeriesInstanceUID,
    SeriesDescription,
    SOPInstanceUID,
    other_elements.Data[SAFE_OFFSET(0)] AS b_value
  FROM
    `bigquery-public-data.idc_current.dicom_all`,
    UNNEST(OtherElements) AS other_elements
  WHERE
    collection_id = "qin_prostate_repeatability"
    AND other_elements.Tag = "Tag_00431039"

As you can see, we are getting a lot of images that have b-value 0. Remember, that each row in the IDC BigQuery index corresponds to a single DICOM file, which for most MR images corresponds to a single slice.

Let's next create a summary table that lists all b-values encountered for every series in our collection.

In [None]:
%%bigquery --project $my_ProjectID

 SELECT
    SeriesInstanceUID,
    ANY_VALUE(SeriesDescription) as SeriesDescription,
    ARRAY_AGG(DISTINCT(other_elements.Data[SAFE_OFFSET(0)])) as b_values
  FROM
    `bigquery-public-data.idc_current.dicom_all`,
    UNNEST(OtherElements) AS other_elements
  WHERE
    collection_id = "qin_prostate_repeatability"
    AND Modality = "MR"
    AND other_elements.Tag = "Tag_00431039"
  GROUP BY
    SeriesInstanceUID

We have 134 MR series that have b-value initialized. Each of those series belongs to one of the following three groups:

1. b-value = [0]
2. b-value = [1400]
3. b-value = [0,1400]

We are interested in high-b DWI images, but you can also see that "Apparent Diffusion Coefficient" series has b-value of 1400, but it is not the DWI acquisition, but a parameter map derived from DWI.

When you do this kind of explorations, it is always helpful to visualize individual series so that you can confirm they are what you think they are, and also look at the DICOM tags. Let's modify the query slightly to add the URL that you can use to open each series in IDC viewer. This time we will use Python so that we can configure the output cell to hyperlink the URLs.



In [None]:
from IPython.display import HTML
from google.cloud import bigquery

# BigQuery client is initialized with the ID of the project
# we specified in the beginning of the notebook!
bq_client = bigquery.Client(my_ProjectID)

selection_query = """
 SELECT
    SeriesInstanceUID,
    ANY_VALUE(SeriesDescription) as SeriesDescription,
    ARRAY_AGG(DISTINCT(other_elements.Data[SAFE_OFFSET(0)])) as b_values,
    ANY_VALUE(CONCAT('https://viewer.imaging.datacommons.cancer.gov/viewer/',
     StudyInstanceUID,
     '?SeriesInstanceUID=',
     SeriesInstanceUID)) as ohif_v2_url,
  FROM
    `bigquery-public-data.idc_current.dicom_all`,
    UNNEST(OtherElements) AS other_elements
  WHERE
    collection_id = "qin_prostate_repeatability"
    AND Modality = "MR"
    AND other_elements.Tag = "Tag_00431039"
  GROUP BY
    SeriesInstanceUID
"""

selection_result = bq_client.query(selection_query)
selection_df = selection_result.result().to_dataframe()

def make_clickable(val):
    return '<a href="{}" target="_blank">{}</a>'.format(val,val)

selection_df['ohif_v2_url'] = selection_df['ohif_v2_url'].apply(make_clickable)

# Display the DataFrame with clickable URLs
HTML(selection_df.to_html(escape=False))

After opening the series in the viewer, you can open "Tag Browser" to see all of the DICOM attributes, confirm your assumptions and understand the data better.

In order to filter out ADC series, we can use another - this time, standard! - DICOM attribute `ImageType`. You can see that for DWI series, the value of this attribute is `ORIGINAL\PRIMARY\OTHER`, while for ADC (since this is a derived image series), it is `DERIVED\SECONDARY\PROCESSED`.

In [None]:
selection_query = """
 SELECT
    SeriesInstanceUID,
    ANY_VALUE(SeriesDescription) as SeriesDescription,
    ANY_VALUE(ImageType[SAFE_OFFSET(0)]) as original_or_not,
    ARRAY_AGG(DISTINCT(other_elements.Data[SAFE_OFFSET(0)])) as b_values,
    ANY_VALUE(CONCAT('https://viewer.imaging.datacommons.cancer.gov/viewer/',
     StudyInstanceUID,
     '?SeriesInstanceUID=',
     SeriesInstanceUID)) as ohif_v2_url,
  FROM
    `bigquery-public-data.idc_current.dicom_all`,
    UNNEST(OtherElements) AS other_elements
  WHERE
    collection_id = "qin_prostate_repeatability"
    AND Modality = "MR"
    AND other_elements.Tag = "Tag_00431039"
  GROUP BY
    SeriesInstanceUID
"""

selection_result = bq_client.query(selection_query)
selection_df = selection_result.result().to_dataframe()

def make_clickable(val):
    return '<a href="{}" target="_blank">{}</a>'.format(val,val)

selection_df['ohif_v2_url'] = selection_df['ohif_v2_url'].apply(make_clickable)

# Display the DataFrame with clickable URLs
HTML(selection_df.to_html(escape=False))

Now we can finally define the selection criteria for high-b DWI images. They should:
* have first element of `ImageType` set to "ORIGINAL"
* have b-value stored in the first element of private tag `(0043,1039)` set to "1400"

The query below will select series that meet these criteria, and count the slices.

In [None]:
selection_query = """
 SELECT
    ANY_VALUE(PatientID) as PatientID,
    ANY_VALUE(StudyDate) as StudyDate,
    SeriesInstanceUID,
    ANY_VALUE(SeriesDescription) as SeriesDescription,
    ARRAY_AGG(DISTINCT(other_elements.Data[SAFE_OFFSET(0)])) as b_values,
    COUNT(DISTINCT(SOPInstanceUID)) as n_slices
  FROM
    `bigquery-public-data.idc_current.dicom_all`,
    UNNEST(OtherElements) AS other_elements
  WHERE
    collection_id = "qin_prostate_repeatability"
    AND Modality = "MR"
    AND other_elements.Tag = "Tag_00431039"
    AND ImageType[SAFE_OFFSET(0)] = "ORIGINAL"
    AND other_elements.Data[SAFE_OFFSET(0)] = "1400"
  GROUP BY
    SeriesInstanceUID
  ORDER BY
    PatientID,
    StudyDate
"""

selection_result = bq_client.query(selection_query)
selection_df = selection_result.result().to_dataframe()

# Display the DataFrame with clickable URLs
HTML(selection_df.to_html(escape=False))


### Filtering by specific b-value and downloading images

Now that we know how to identify DWI images with multiple b-values, we can filter images that correspond to the high b-value (this is often the image that is used by AI PCa detection tools).

In the query below, we get AWS URLs for all slices that have b-value of 1400. For convenience, we use `PatientID` and `SeriesInstanceUID` attributes to sort the files on the fly as we download them.

The query selects the specific series - you can remove the corresponding check from the query to select all series!

In [None]:
selection_query = """
WITH b_values_added AS (
 SELECT
    CONCAT("cp ",aws_url," ./",PatientID,"/",SeriesInstanceUID,"/") as aws_cp_command,
    other_elements.Data[SAFE_OFFSET(0)] AS b_value
  FROM
    `bigquery-public-data.idc_current.dicom_all`,
    UNNEST(OtherElements) AS other_elements
  WHERE
    collection_id = "qin_prostate_repeatability"
    AND other_elements.Tag = "Tag_00431039"
    AND other_elements.Data[SAFE_OFFSET(0)] = "1400"
    AND ImageType[SAFE_OFFSET(0)] = "ORIGINAL"
    # remove the next line to select all high-b images!
    AND SeriesInstanceUID = "1.3.6.1.4.1.14519.5.2.1.3671.4754.234511234806445476378897256789"
  ORDER BY
    aws_cp_command
)

SELECT
  aws_cp_command
FROM
  b_values_added
"""

selection_result = bq_client.query(selection_query)
selection_df = selection_result.result().to_dataframe()


#selection_df.head()# save to csv
with open("aws_manifest.s5cmd", "w") as f:
    for command in selection_df['aws_cp_command']:
        f.write(f"{command}\n")


Now that we have the manifest, we can install [`idc-index`](https://github.com/ImagingDataCommons/idc-index) Python package, and download the corresponding files.

In [None]:
!pip install --upgrade idc-index

In [None]:
from idc_index import IDCClient

c=IDCClient()

!rm -rf dwi_images
!mkdir -p dwi_images
!cd dwi_images && $c.s5cmdPath --no-sign-request --endpoint-url https://s3.amazonaws.com run ../aws_manifest.s5cmd

### Working with the downloaded images

In the previous section, we downloaded high-b DWI images and organized them by series.

Typically, AI tools expect input in formats such as NIfTI or NRRD, and this can be accomplished easily by converting from DICOM. For DICOM to NIfTI conversion, we recommend `dcm2niix`, which is very easy to install.

In [None]:
%%capture
!apt-get install dcm2niix

In the next cell we convert one of the high-b DWI images into NIfTI.

In [None]:
!rm -rf nifti_converted
!mkdir -p nifti_converted
!dcm2niix -o ./nifti_converted /content/dwi_images/PCAMPMRI-00009/1.3.6.1.4.1.14519.5.2.1.3671.4754.234511234806445476378897256789

## Visualization of the downloaded images

WARNING: `itkWidgets` is a large package with many dependencies. If you run into problem with this cell, continue to the next section!

Visualization of the high-b DWI image can also be done right here in the notebook using `itkWidgets`, which can read image directly from DICOM. See more details in this tutorial: https://github.com/ImagingDataCommons/IDC-Tutorials/blob/master/notebooks/getting_started/part3_exploring_cohorts.ipynb.

Installing `itkWidgets` will take some time!

In [None]:
!pip install -q "itk-io>=5.3.0" "itkwidgets[all]>=1.0a32"

In [None]:
import itk
from itkwidgets import view

image = itk.imread("/content/dwi_images/PCAMPMRI-00009/1.3.6.1.4.1.14519.5.2.1.3671.4754.234511234806445476378897256789")

viewer = view(image, ui_collapsed=False)


viewer.set_background_color([0.1,0.1,0.1])
viewer.set_image_color_map("Grayscale")
viewer.set_image_color_range([0,5000])
viewer.set_label_image_blend(0.75)
viewer.set_view_mode('ZPlane')

## Looking beyond QIN-Prostate-Repeatability

The query below will summarize all of the collections that have the notorious GE-specific b-values tag. As you can see, some of the values appear to follow different conventions, such as 1000000600, which cannot be realistic.

In [None]:
%%bigquery --project $my_ProjectID

WITH
  b_value_added AS (
  SELECT
    collection_id,
    Manufacturer,
    ManufacturerModelName,
    ARRAY_TO_STRING(SoftwareVersions,"/"),
    PatientID,
    StudyDate,
    StudyInstanceUID,
    SeriesInstanceUID,
    SeriesDescription,
    SOPInstanceUID,
    other_elements.Data[SAFE_OFFSET(0)] AS b_value
  FROM
    `bigquery-public-data.idc_current.dicom_all`,
    UNNEST(OtherElements) AS other_elements
  WHERE
    other_elements.Tag = "Tag_00431039" )
SELECT
  collection_id,
  array_to_string(ARRAY_AGG(DISTINCT(b_value)),",") as b_values_encountered,
  ARRAY_AGG(DISTINCT(CONCAT(SAFE_CAST(Manufacturer AS STRING)," ",ManufacturerModelName)) IGNORE NULLS) as Manufacturer_and_ModelName
FROM
  b_value_added
WHERE
  b_value <> ""
GROUP BY
  collection_id

# Next steps

Share your feedback or ask questions about this tutorial in IDC Forum: https://discourse.canceridc.dev.