<a href="https://colab.research.google.com/github/ImagingDataCommons/IDC-Tutorials/blob/master/notebooks/advanced_topics/clinical_data_intro.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# IDC clinical data exploration

The goal of this notebook is to introduce the users of NCI Imaging Data Commons (IDC) to the organization of clinical data that accompany some of the IDC imaging data. For a quick visual summary of the clinical data available in IDC, please check out [this DataStudio dashboard](https://datastudio.google.com/u/0/reporting/04cf5976-4ea0-4fee-a749-8bfd162f2e87/page/p_s7mk6eybqc).

[NCI Imaging Data Commons (IDC)](https://imaging.datacommons.cancer.gov) is a cloud-based environment containing publicly available cancer imaging data co-located with analysis and exploration tools and resources. IDC is a node within the broader NCI Cancer Research Data Commons (CRDC) infrastructure that provides secure access to a large, comprehensive, and expanding collection of cancer research data.

If you are not familiar with IDC, we recommend you first take a look at the [Getting started](https://github.com/ImagingDataCommons/IDC-Tutorials/blob/master/notebooks/getting_started/) notebooks that are intended to serve as the introduction into working with IDC programmatically.

If you have any questions about this tutorial, please post your questions on the [IDC user forum](https://discourse.canceridc.dev/) (preferred) or email IDC support at support@canceridc.dev!

Authored by Andrey Fedorov and George White

Prepared: July 2022

Updated: Jan 2025

# Prerequisites

The only prerequisite is [`idc-index`](https://github.com/ImagingDataCommons/idc-index) - python package that contains various utilities to simplify access to IDC data.

In [41]:
%%capture
!pip install --upgrade idc-index

# Clinical data - background

By clinical data we refer to the broad spectrum of image-related data that may accompany images. Such data may include demographics of the patients, observations related to their clinical history (therapies, diagnoses, findings), lab tests, surgeries.

Clinical data is often critical in understanding imaging data, and is essential for the development and validation of imaging biomarkers. However, such data is most often stored in spreadsheets that follow conventions specific to the site that collected the data, may not be accompanied by the dictionary defining the terms used in describing clinical data, and is rarely harmonized. As an example, you can consider examing the clinical data that accompanies the [ACRIN 6698 collection](https://wiki.cancerimagingarchive.net/pages/viewpage.action?pageId=50135447) curated by The Cancer Imaging Archive (TCIA). File named "Full Ancillary Patient Information file.xlsx" linked from the collection page contains two sheets, one of which is the dictionary, and the othe one contains per-patient data with the columns defined by the dictionary.

Not only the terms used in the clinical data accompanying individual collection are not harmonized, but the format of the spreadsheets is also collection-specific. In order to search and navigate clinical data, one has to parse those collection specific tables, and there is no interface to support searching across collections.

With the release v11 of IDC, we make the attempt to lower the barriers for accessing clinical data accompanying IDC imaging collections. We parse collection-specific tables, and organize the underlying data into BigQuery tables that can be accessed using standard SQL queries. You can also see the summary of clinical data available for IDC collections in [this dashboard](https://datastudio.google.com/u/0/reporting/04cf5976-4ea0-4fee-a749-8bfd162f2e87/page/p_s7mk6eybqc). Further, we make the content of the extracted clinical data available via the [`idc-index`](https://github.com/ImagingDataCommons/idc-index) python package.

At the completion of this tutorial you will learn how IDC clinical data is organized, and how to write queries to interrogate this data.

# Organization of clinical data in IDC

`idc-index` packages _indices_ - tables containing key metadata describing data available in IDC. The main index that supports API calls related to download and search is installed by default. To support search of the clinical data accompanying IDC images you will need the `clinical_index` table, which helps navigating clinical data accompanying images.

In [2]:
from idc_index import IDCClient

c = IDCClient()

c.fetch_index('clinical_index')

print('Columns avaialable in clinical_index:\n'+'\n'.join(c.clinical_index.keys()))

Columns avaialable in clinical_index:
collection_id
table_name
short_table_name
column
column_label
values


This table is documented in https://idc-index.readthedocs.io/en/latest/column_descriptions.html#clinical-index.


Accessing relevant clinical data for a given collection is a two-step process:

* Step 1: use `clinical_index` table to identify relevant metadata attributes and the names of the tables where the corresponding metadata is located.
* Step 2: load the specific clinical data table with the selected attribute referenced from `clinical_index` and access clinical metadata for the individual patients.

As a reminder, all of the data - including clinical metadata - in IDC is anonymized!

For the sake of this example, we will start with identifying clinical data attributes that accompany NLST collection.

`collection_id` column of `clinical_index` can be used to associate clinical data attribute to the collection it accompanies.

## Understanding individual clinical data attributes

Here's an example of the columns that correspond to the clinical metadata in one of the ACRIN collections, where `column` is not particularly helpful, but `column_label` provides human readable information to allow interpretation of the column.

In [5]:
acrin_nsclc_fdg_pet_clinical_columns = c.clinical_index[c.clinical_index['collection_id']=='acrin_nsclc_fdg_pet']
acrin_nsclc_fdg_pet_clinical_columns[['collection_id','short_table_name','column','column_label','values']]

  df_html=dataframe._repr_html_(),  # pylint: disable=protected-access


Unnamed: 0,collection_id,short_table_name,column,column_label,values
2359,acrin_nsclc_fdg_pet,acrin_nsclc_fdg_pet_A0,inst_no,De-identified Institution Number,"[{'option_code': '""nan""', 'option_description'..."
2360,acrin_nsclc_fdg_pet,acrin_nsclc_fdg_pet_A0,entryage,Patient age at Registration,"[{'option_code': '""nan""', 'option_description'..."
2361,acrin_nsclc_fdg_pet,acrin_nsclc_fdg_pet_A0,a0e26,RTOG INSTITUTION NUMBER,"[{'option_code': '""nan""', 'option_description'..."
2362,acrin_nsclc_fdg_pet,acrin_nsclc_fdg_pet_A0,rec,Data receipt (from base date),"[{'option_code': '""nan""', 'option_description'..."
2363,acrin_nsclc_fdg_pet,acrin_nsclc_fdg_pet_A0,a0e4d,"""Days from Base_dt to DATE THE STUDY-SPECIFIC...","[{'option_code': '""nan""', 'option_description'..."
...,...,...,...,...,...
3643,acrin_nsclc_fdg_pet,acrin_nsclc_fdg_pet_bamf_lung_pet_ct_segmentation,studyinstanceuid,StudyInstanceUID,[]
3644,acrin_nsclc_fdg_pet,acrin_nsclc_fdg_pet_bamf_lung_pet_ct_segmentation,ptseriesinstanceuid,PTSeriesInstanceUID,[]
3645,acrin_nsclc_fdg_pet,acrin_nsclc_fdg_pet_bamf_lung_pet_ct_segmentation,ctseriesinstanceuid,CTSeriesInstanceUID,[]
3646,acrin_nsclc_fdg_pet,acrin_nsclc_fdg_pet_bamf_lung_pet_ct_segmentation,aisegmentation,AISegmentation,[]


For some columns, the values come from a defined set. In the example above, we can, for example, examine the values assigned to encode patient race.

In [7]:
race_values = acrin_nsclc_fdg_pet_clinical_columns[acrin_nsclc_fdg_pet_clinical_columns["column_label"] == "RACE"]["values"]

race_values.tolist()

[array([{'option_code': '1.0', 'option_description': 'American Indian or Alaska Native'},
        {'option_code': '2.0', 'option_description': 'Asian'},
        {'option_code': '3.0', 'option_description': 'Black or African American'},
        {'option_code': '4.0', 'option_description': 'Native Hawaiian or other Pacific Islander'},
        {'option_code': '5.0', 'option_description': 'White'},
        {'option_code': '6.0', 'option_description': 'More than one race'},
        {'option_code': '9.0', 'option_description': 'Unknown'}],
       dtype=object)]

On the other hand, if we look at the metadata available for the `c4kc_kits` collection, `column_label` and `column` are identical.


In [8]:
c.clinical_index[c.clinical_index["collection_id"] == "c4kc_kits"][:3]

  df_html=dataframe._repr_html_(),  # pylint: disable=protected-access


Unnamed: 0,collection_id,table_name,short_table_name,column,column_label,values
3828,c4kc_kits,bigquery-public-data.idc_v20_clinical.c4kc_kit...,c4kc_kits_clinical,comorbidities__dementia,comorbidities__dementia,"[{'option_code': 'False', 'option_description'..."
3829,c4kc_kits,bigquery-public-data.idc_v20_clinical.c4kc_kit...,c4kc_kits_clinical,comorbidities__aids,comorbidities__aids,"[{'option_code': 'False', 'option_description'..."
3830,c4kc_kits,bigquery-public-data.idc_v20_clinical.c4kc_kit...,c4kc_kits_clinical,intraoperative_complications__cardiac_event,intraoperative_complications__cardiac_event,"[{'option_code': 'False', 'option_description'..."


**As a general rule of thumb**:
* when selecting specific columns from clinical tables, use `column` values
* when searching for concepts of interest in `column_metadata`, use `column_label`

# Exploring IDC clinical data

In the following sections of the notebook we go over some use cases to demonstrate various options for navigating IDC clinical data.

As always, if you have a use case that is not addressed here, if you have suggestions or are confused - please start a discussion thread on the [IDC User forum](https://discourse.canceridc.dev/)!

## Collection-focused exploration

If you used clinical data stored in TCIA, perhaps you started with a specific collection of your interest, downloaded clinical data files for that collection, parsed them into your code. Let's go over those steps the IDC way!

First, let's see which of the collections in IDC have clinical data in BigQuery. To do that we will reuse the pandas dataframe with the results of the query we executed earlier.


In [9]:
c.clinical_index["collection_id"].unique().tolist()

['acrin_6698',
 'acrin_contralateral_breast_mr',
 'acrin_flt_breast',
 'acrin_nsclc_fdg_pet',
 'adrenal_acc_ki67_seg',
 'advanced_mri_breast_lesions',
 'anti_pd_1_lung',
 'b_mode_and_ceus_liver',
 'breast_diagnosis',
 'breast_mri_nact_pilot',
 'c4kc_kits',
 'cc_tumor_heterogeneity',
 'cmmd',
 'colorectal_liver_metastases',
 'covid_19_ar',
 'covid_19_ny_sbu',
 'cptac_brca',
 'cptac_ccrcc',
 'cptac_coad',
 'cptac_gbm',
 'cptac_hnscc',
 'cptac_lscc',
 'cptac_luad',
 'cptac_ov',
 'cptac_pda',
 'cptac_ucec',
 'ctpred_sunitinib_pannet',
 'duke_breast_cancer_mri',
 'ea1141',
 'hcc_tace_seg',
 'htan_hms',
 'htan_ohsu',
 'htan_vanderbilt',
 'htan_wustl',
 'ispy1',
 'ispy2',
 'lidc_idri',
 'lung_fused_ct_pathology',
 'lung_pet_ct_dx',
 'mediastinal_lymph_node_seg',
 'midrc_ricord_1a',
 'midrc_ricord_1b',
 'midrc_ricord_1c',
 'nlst',
 'nsclc_radiogenomics',
 'nsclc_radiomics',
 'nsclc_radiomics_genomics',
 'nsclc_radiomics_interobserver1',
 'prostate_diagnosis',
 'prostatex',
 'qin_breast',
 'rem

If you are interested in what clinical data is available for the specific collection, you can select only the rows corresponding to that collection in the `column_metadata` table. Here we select a subset of columns to improve readability of the dataframe.

Note that for some collections, clinical data sheets are accompanied by dictionaries, which formalize the values encountered. Examples of such collections are [ISPY1](https://wiki.cancerimagingarchive.net/display/Public/ISPY1) or ACRIN trials.

For many other collections there are no such dictionaries available. In those situations, the values you will see in the `values` columns have been derived by examining the distinct values encountered in the clinical data sheets.

In the following we look at the clinical data columns ("dictionary terms") for the [ACRIN 6698 collection](https://wiki.cancerimagingarchive.net/pages/viewpage.action?pageId=50135447) mentioned earlier.

In [14]:
acrin6698_clinical_columns = c.clinical_index[c.clinical_index["collection_id"] == "acrin_6698"]

acrin6698_clinical_columns

  df_html=dataframe._repr_html_(),  # pylint: disable=protected-access


Unnamed: 0,collection_id,table_name,short_table_name,column,column_label,values
0,acrin_6698,bigquery-public-data.idc_v20_clinical.acrin_66...,acrin_6698_clinical,source_batch,idc_provenance_source_batch,"[{'option_code': '0', 'option_description': No..."
1,acrin_6698,bigquery-public-data.idc_v20_clinical.acrin_66...,acrin_6698_clinical,t0,T0 (baseline) MRI study included in collection,"[{'option_code': '0', 'option_description': ' ..."
2,acrin_6698,bigquery-public-data.idc_v20_clinical.acrin_66...,acrin_6698_clinical,t1,T1 (early-Tx) MRI study included in collection,"[{'option_code': '0', 'option_description': ' ..."
3,acrin_6698,bigquery-public-data.idc_v20_clinical.acrin_66...,acrin_6698_clinical,t2,T2 (inter-regimen) MRI study included in colle...,"[{'option_code': '0', 'option_description': ' ..."
4,acrin_6698,bigquery-public-data.idc_v20_clinical.acrin_66...,acrin_6698_clinical,t3,T3 (pre-surgery) MRI study included in collection,"[{'option_code': '0', 'option_description': ' ..."
5,acrin_6698,bigquery-public-data.idc_v20_clinical.acrin_66...,acrin_6698_clinical,bmmr2_train,Patient included in the BMMR2 challenge traini...,"[{'option_code': '0', 'option_description': ' ..."
6,acrin_6698,bigquery-public-data.idc_v20_clinical.acrin_66...,acrin_6698_clinical,bmmr2_test,Patient included in the BMMR2 challenge test c...,"[{'option_code': '0', 'option_description': ' ..."
7,acrin_6698,bigquery-public-data.idc_v20_clinical.acrin_66...,acrin_6698_clinical,primary_aim_t0,T0 MRI study included in the ACRIN-6698 primar...,"[{'option_code': '0', 'option_description': ' ..."
8,acrin_6698,bigquery-public-data.idc_v20_clinical.acrin_66...,acrin_6698_clinical,primary_aim_t1,T1 MRI study included in the ACRIN-6698 primar...,"[{'option_code': '0', 'option_description': ' ..."
9,acrin_6698,bigquery-public-data.idc_v20_clinical.acrin_66...,acrin_6698_clinical,primary_aim_t2,T2 MRI study included in the ACRIN-6698 primar...,"[{'option_code': '0', 'option_description': ' ..."


Here's how you can select just the specific columns in the table - this way it is easier to examine the data.

In [15]:
acrin6698_clinical_columns[["collection_id", "short_table_name", "column", "column_label", "values"]]

  df_html=dataframe._repr_html_(),  # pylint: disable=protected-access


Unnamed: 0,collection_id,short_table_name,column,column_label,values
0,acrin_6698,acrin_6698_clinical,source_batch,idc_provenance_source_batch,"[{'option_code': '0', 'option_description': No..."
1,acrin_6698,acrin_6698_clinical,t0,T0 (baseline) MRI study included in collection,"[{'option_code': '0', 'option_description': ' ..."
2,acrin_6698,acrin_6698_clinical,t1,T1 (early-Tx) MRI study included in collection,"[{'option_code': '0', 'option_description': ' ..."
3,acrin_6698,acrin_6698_clinical,t2,T2 (inter-regimen) MRI study included in colle...,"[{'option_code': '0', 'option_description': ' ..."
4,acrin_6698,acrin_6698_clinical,t3,T3 (pre-surgery) MRI study included in collection,"[{'option_code': '0', 'option_description': ' ..."
5,acrin_6698,acrin_6698_clinical,bmmr2_train,Patient included in the BMMR2 challenge traini...,"[{'option_code': '0', 'option_description': ' ..."
6,acrin_6698,acrin_6698_clinical,bmmr2_test,Patient included in the BMMR2 challenge test c...,"[{'option_code': '0', 'option_description': ' ..."
7,acrin_6698,acrin_6698_clinical,primary_aim_t0,T0 MRI study included in the ACRIN-6698 primar...,"[{'option_code': '0', 'option_description': ' ..."
8,acrin_6698,acrin_6698_clinical,primary_aim_t1,T1 MRI study included in the ACRIN-6698 primar...,"[{'option_code': '0', 'option_description': ' ..."
9,acrin_6698,acrin_6698_clinical,primary_aim_t2,T2 MRI study included in the ACRIN-6698 primar...,"[{'option_code': '0', 'option_description': ' ..."


## From attribute names to data: accessing clinical data at patient level

`short_table_name` gives us the table stored locally that contains the values for the column described in the `column_metadata` row. All of the clinical data tables were downloaded as part of `idc-index` installation.

Here is how we can load the clinical data from the `acrin_6698_clinical` table into a pandas dataframe.

In [18]:
acrin_6698_clinical_df = c.get_clinical_table("acrin_6698_clinical")

Given the information available in the per-collection clinical data, we can proceed with selecting a subset of patients that meet the criteria of your interest. As an example, the following query will select all of the distinct combination of patient ID and tumor grade, as defined by the `sbrgrade` column contents.


In [19]:
# prompt: select distinct combinations of values in dicom_patient_id and sbrgrade columns from the acrin_6698_clinical_df pandas dataframe
acrin_6698_tumors = acrin_6698_clinical_df[['dicom_patient_id', 'sbrgrade']].drop_duplicates()
acrin_6698_tumors

  df_html=dataframe._repr_html_(),  # pylint: disable=protected-access


Unnamed: 0,dicom_patient_id,sbrgrade
0,ACRIN-6698-453236,II (Intermediate)
1,ACRIN-6698-229047,
2,ACRIN-6698-384705,II (Intermediate)
3,ACRIN-6698-415631,III (High)
4,ACRIN-6698-793283,
...,...,...
380,ACRIN-6698-765671,II (Intermediate)
381,ACRIN-6698-962153,III (High)
382,ACRIN-6698-711476,II (Intermediate)
383,ACRIN-6698-361701,II (Intermediate)


We can next use `dicom_patient_id` to link clinical data with the imaging studies available for the given patient (which is available in the `index` table included in `idc-index`). The query becomes a bit more complex, since we need to join data across two tables.

In [25]:
query = """
SELECT
  ANY_VALUE(PatientID) AS PatientID,
  STRING_AGG(DISTINCT(acrin_6698_clinical_df.sbrgrade)) as tumor_grade,
  STRING_AGG(DISTINCT(Modality)) AS modalities,
  COUNT(DISTINCT(SeriesInstanceUID)) AS num_series,
  ANY_VALUE(CONCAT('https://viewer.imaging.datacommons.cancer.gov/viewer/', StudyInstanceUID)) AS viewer_url
FROM
  index
JOIN
  acrin_6698_clinical_df
ON
  index.PatientID = acrin_6698_clinical_df.dicom_patient_id
GROUP BY
  StudyInstanceUID
ORDER BY
  PatientID
"""

acrin_6698_viewable = c.sql_query(query)

In [26]:
acrin_6698_viewable

  df_html=dataframe._repr_html_(),  # pylint: disable=protected-access


Unnamed: 0,PatientID,tumor_grade,modalities,num_series,viewer_url
0,ACRIN-6698-102212,III (High),"MR,SEG",19,https://viewer.imaging.datacommons.cancer.gov/...
1,ACRIN-6698-102212,III (High),"SEG,MR",19,https://viewer.imaging.datacommons.cancer.gov/...
2,ACRIN-6698-102212,III (High),"SEG,MR",19,https://viewer.imaging.datacommons.cancer.gov/...
3,ACRIN-6698-102212,III (High),"MR,SEG",15,https://viewer.imaging.datacommons.cancer.gov/...
4,ACRIN-6698-103939,III (High),"SEG,MR",18,https://viewer.imaging.datacommons.cancer.gov/...
...,...,...,...,...,...
1118,ACRIN-6698-995480,III (High),"SEG,MR",18,https://viewer.imaging.datacommons.cancer.gov/...
1119,ACRIN-6698-995480,III (High),"SEG,MR",18,https://viewer.imaging.datacommons.cancer.gov/...
1120,ACRIN-6698-995480,III (High),"SEG,MR",18,https://viewer.imaging.datacommons.cancer.gov/...
1121,ACRIN-6698-995480,III (High),"MR,SEG",14,https://viewer.imaging.datacommons.cancer.gov/...


  return self._dataframe._repr_html_()  # pylint: disable=protected-access


## Discovery mode

Sometime you may want to find whether specific clinical attribute is available for the imaging data you can find in IDC.

We can start by looking at the distinct values of `column_label` (which in the general case will be either more descriptive, or identical to `column`).


In [29]:
import pandas as pd

pd.DataFrame({"column_label":c.clinical_index["column_label"].unique()})

  df_html=dataframe._repr_html_(),  # pylint: disable=protected-access


Unnamed: 0,column_label
0,idc_provenance_source_batch
1,T0 (baseline) MRI study included in collection
2,T1 (early-Tx) MRI study included in collection
3,T2 (inter-regimen) MRI study included in colle...
4,T3 (pre-surgery) MRI study included in collection
...,...
3707,ID
3708,Survival_from_surgery_days_UPDATED
3709,Survival_Censor
3710,Time_since_baseline_preop


Let's say we want to know which of the cases have information related to therapy. We can search column metadata for the presence of word "therapy" (since the terms in clinical data are not harmonized, we need to account for the variability in capitalization).

In [31]:
c.clinical_index[c.clinical_index["column_label"].str.contains("[tT]herapy", na=False)][["column_label", "collection_id", "values"]]

  df_html=dataframe._repr_html_(),  # pylint: disable=protected-access


Unnamed: 0,column_label,collection_id,values
59,current or recent hx (6 months prior to MRI) c...,acrin_contralateral_breast_mr,"[{'option_code': '1.0', 'option_description': ..."
201,current use of estrogen replacement therapy,acrin_contralateral_breast_mr,"[{'option_code': '1.0', 'option_description': ..."
202,current use of tamoxifen/serm therapy,acrin_contralateral_breast_mr,"[{'option_code': '1.0', 'option_description': ..."
203,current use aromatase inhibitor therapy,acrin_contralateral_breast_mr,"[{'option_code': '1.0', 'option_description': ..."
205,past use of estrogen replacement therapy,acrin_contralateral_breast_mr,"[{'option_code': '1.0', 'option_description': ..."
206,past use of tamoxifen/serm therapy,acrin_contralateral_breast_mr,"[{'option_code': '1.0', 'option_description': ..."
207,prior use aromatase inhibitor therapy,acrin_contralateral_breast_mr,"[{'option_code': '1.0', 'option_description': ..."
3891,Early response (1-2 months post-therapy),cc_tumor_heterogeneity,"[{'option_code': 'Non-responder', 'option_desc..."
4000,Did the patient have documented renal replacem...,covid_19_ny_sbu,"[{'option_code': 'Yes', 'option_description': ..."
4004,Patient had other anticoagulation therapy as l...,covid_19_ny_sbu,"[{'option_code': 'apixaban', 'option_descripti..."


We observe that there are several collections that contain column named "Chemotherapy". Let's filter these values further, in order to identify subjects that underwent chemotherapy.

In [32]:
c.clinical_index[c.clinical_index["column_label"].str.contains("[Cc]hemotherapy", na=False)][[ "collection_id", "table_name", "column", "column_label","values"]]

  df_html=dataframe._repr_html_(),  # pylint: disable=protected-access


Unnamed: 0,collection_id,table_name,column,column_label,values
5570,hcc_tace_seg,bigquery-public-data.idc_v20_clinical.hcc_tace...,chemotherapy,chemotherapy used for TACE procedure,"[{'option_code': 'Cisplastin', 'option_descrip..."
7141,nsclc_radiogenomics,bigquery-public-data.idc_v20_clinical.nsclc_ra...,chemotherapy,Chemotherapy,"[{'option_code': 'No', 'option_description': N..."


  return self._dataframe._repr_html_()  # pylint: disable=protected-access


From the table above we can observe that there are several collections that have clinical metadata related to chemotherapy regimen of the subject.

Looking at the value sets for the collections/columns that have those, we can observe that subjects that had any chemotherapy could be selected as follows for the respective collections (non-exhaustive list):
* `nsclc_radiogenomics`: subjects that have value `Yes` in table `nsclc_radiogenomics_clinical` column `chemotherapy`
* `hcc_tace_seg` column `chemotherapy` in the `hcc_tace_seg_clinical` table

Let's focus on the clinical data related to chemotherapy for the collection `hcc_tace_seg`.

In [33]:
chemotherapy_subset = c.clinical_index[c.clinical_index["column_label"].str.contains("[Cc]hemotherapy", na=False)][[ "collection_id", "table_name", "column", "column_label","values"]]

chemotherapy_subset = chemotherapy_subset[chemotherapy_subset["collection_id"] == "hcc_tace_seg"]

chemotherapy_subset

  df_html=dataframe._repr_html_(),  # pylint: disable=protected-access


Unnamed: 0,collection_id,table_name,column,column_label,values
5570,hcc_tace_seg,bigquery-public-data.idc_v20_clinical.hcc_tace...,chemotherapy,chemotherapy used for TACE procedure,"[{'option_code': 'Cisplastin', 'option_descrip..."


Before we select subjects that meet the criteria defined above, let's confirm  the values encountered in table `hcc_tace_seg_clinical` column `chemotherapy` match the value set in the `column_metadata` table. Here are the values of the dictionary we observe from the `column_metadata` inventory.

In [34]:
chemotherapy_subset["values"].tolist()

[array([{'option_code': 'Cisplastin', 'option_description': None},
        {'option_code': 'Cisplatin, Mitomycin-C', 'option_description': None},
        {'option_code': 'Cisplatin, doxorubicin, Mitomycin-C', 'option_description': None},
        {'option_code': 'NA', 'option_description': None},
        {'option_code': 'doxorubicin LC beads', 'option_description': None}],
       dtype=object)]

The query below will select the distinct values encountered in the `chemotherapy` column of the `hcc_tace_seg_clinical` table, which we confirm matches those in the dictionary.

In [35]:
hcc_tace_seg_clinical_df = c.get_clinical_table("hcc_tace_seg_clinical")

hcc_tace_seg_clinical_df["chemotherapy"].unique()


array(['Cisplatin, doxorubicin, Mitomycin-C', 'doxorubicin LC beads',
       'NA', 'Cisplastin', 'Cisplatin, Mitomycin-C'], dtype=object)

Next we can examine the data to see what therapy individual patients had.

In [36]:
# prompt: select distinct combinations of values in dicom_patient_id and chemotherapy columns from the hcc_tace_seg_clinical_df pandas dataframe where chemotherapy column is defined

distinct_combinations = hcc_tace_seg_clinical_df[hcc_tace_seg_clinical_df['chemotherapy'].notna()][['dicom_patient_id', 'chemotherapy']].drop_duplicates()
distinct_combinations

  df_html=dataframe._repr_html_(),  # pylint: disable=protected-access


Unnamed: 0,dicom_patient_id,chemotherapy
0,HCC_024,"Cisplatin, doxorubicin, Mitomycin-C"
1,HCC_045,"Cisplatin, doxorubicin, Mitomycin-C"
2,HCC_050,"Cisplatin, doxorubicin, Mitomycin-C"
3,HCC_065,"Cisplatin, doxorubicin, Mitomycin-C"
4,HCC_073,doxorubicin LC beads
...,...,...
100,HCC_056,"Cisplatin, doxorubicin, Mitomycin-C"
101,HCC_070,doxorubicin LC beads
102,HCC_061,
103,HCC_075,doxorubicin LC beads


`dicom_patient_id` is the key to connect clinical data and imaging data. Let's see what imaging studies we have for patient `HCC_101` that was treated with 	doxorubicin LC beads. Along the way we can also generate viewer URLs to conveniently examine the images.

In [40]:
query = """
SELECT
  StudyInstanceUID,
  STRING_AGG(DISTINCT(Modality)) AS modalities,
  STRING_AGG(DISTINCT(collection_id)) AS collection_id,
  COUNT(DISTINCT(SeriesInstanceUID)) AS num_series,
  ANY_VALUE(CONCAT('https://viewer.imaging.datacommons.cancer.gov/viewer/', StudyInstanceUID)) as viewer_url
FROM index
WHERE PatientID = 'HCC_103'
GROUP BY StudyInstanceUID
"""

c.sql_query(query)

  df_html=dataframe._repr_html_(),  # pylint: disable=protected-access
  df_html=dataframe._repr_html_(),  # pylint: disable=protected-access


Unnamed: 0,StudyInstanceUID,modalities,collection_id,num_series,viewer_url
0,1.3.6.1.4.1.14519.5.2.1.1706.8374.304819071818...,"CT,SEG",hcc_tace_seg,3,https://viewer.imaging.datacommons.cancer.gov/...
1,1.3.6.1.4.1.14519.5.2.1.1706.8374.121752675166...,CT,hcc_tace_seg,2,https://viewer.imaging.datacommons.cancer.gov/...


# Want to learn more?

* check out other notebooks: https://github.com/ImagingDataCommons/IDC-Tutorials
* join our community forum to ask any questions about IDC: https://discourse.canceridc.dev/
* ask your questions during live discussions with IDC developers at the IDC weekly office hours - join us on Google Meet at https://meet.google.com/xyt-vody-tvb every Tuesday 16:30 – 17:30 (New York) and Wednesday 10:30-11:30 (New York)
* browse IDC portal: https://imaging.datacommons.cancer.gov/explore/
* read IDC paper: https://doi.org/10.1148/rg.230180
* watch a recent presentation about IDC: https://youtu.be/P9ateg9ZUEs