<a href="https://colab.research.google.com/github/ImagingDataCommons/IDC-Examples/blob/master/notebooks/clinical_data_intro.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# [WIP] IDC clinical data exploration

The goal of this notebook is to introduce the users of NCI Imaging Data Commons (IDC) to the organization of clinical data BigQuery tables that accompany some of the IDC imaging data.

[NCI Imaging Data Commons (IDC)](https://imaging.datacommons.cancer.gov) is a cloud-based repository of publicly available cancer imaging data co-located with the analysis and exploration tools and resources. IDC is a node within the broader [NCI Cancer Research Data Commons (CRDC)](https://datacommons.cancer.gov/) infrastructure that provides secure access to a large, comprehensive, and expanding collection of cancer research data.

If you are not familiar with IDC, we recommend you first take a look at the [Getting started](https://github.com/ImagingDataCommons/IDC-Examples/blob/master/notebooks/getting_started.ipynb) notebook that is intended to serve as the introduction into working with IDC programmatically.

If you have any questions about this tutorial, please post your questions on the [IDC user forum](https://discourse.canceridc.dev/) (preferred) or email IDC support at support@canceridc.dev!

Authored by Andrey Fedorov and George White

Prepared: July 2022

Updated: Sept 2022

# Prerequisites

In order to be able to run the cells in this notebook, you must complete the prerequisites to set up your Google Cloud Platform account, as discussed here: https://learn.canceridc.dev/introduction/getting-started-with-gcp.

Once you completed the prerequisites, insert your Google Cloud Platform project ID in the cell below in place of `REPLACE_ME_WITH_YOUR_PROJECT_ID`.

In [1]:
# initialize this variable with your Google Cloud Project ID!
my_ProjectID = "idc-sandbox-000"

import os
os.environ["GCP_PROJECT_ID"] = my_ProjectID

!gcloud config set project $GCP_PROJECT_ID

Updated property [core/project].


In the following cell you will be asked to authorize Google Collaboratory to act on your behalf. In order to proceed with the following cells you must allow this.

In [2]:
# you will need to authenticate with your Google ID to do anything meaningful with IDC
from google.colab import auth
auth.authenticate_user()

# Clinical data - background

By clinical data we refer to the broad spectrum of image-related data that may accompany images. Such data may include demographics of the patients, observations related to their clinical history (therapies, diagnoses, findings), lab tests, surgeries.

Clinical data is often critical in understanding imaging data, and is essential for the development and validation of imaging biomarkers. However, such data is most often stored in spreadsheets that follow conventions specific to the site that collected the data, may not be accompanied by the dictionary defining the terms used in describing clinical data, and is rarely harmonized. As an example, you can consider examing the clinical data that accompanies the [ACRIN 6698 collection](https://wiki.cancerimagingarchive.net/pages/viewpage.action?pageId=50135447) curated by The Cancer Imaging Archive (TCIA). File named "Full Ancillary Patient Information file.xlsx" linked from the collection page contains two sheets, one of which is the dictionary, and the othe one contains per-patient data with the columns defined by the dictionary.

Not only the terms used in the clinical data accompanying individual collection are not harmonized, but the format of the spreadsheets is also collection-specific. In order to search and navigate clinical data, one has to parse those collection specific tables, and there is no interface to support searching across collections.

With the release v11 of IDC, we make the attempt to lower the barriers for accessing clinical data accompanying IDC imaging collections. We parse collection-specific tables, and organize the underlying data into BigQuery tables that can be accessed using standard SQL queries. You can also see the summary of clinical data available for IDC collections in [this dashboard](https://datastudio.google.com/u/0/reporting/04cf5976-4ea0-4fee-a749-8bfd162f2e87/page/p_s7mk6eybqc). 

At the completion of this tutorial you will learn how IDC clinical data is organized, and how to write queries to interrogate this data.

# Overall organization of clinical data in IDC

Clinical data accompanying imaging collections is organized into collection-specific tables in the `idc_current_clinical` BigQuery dataset (in order to locate this dataset, you will need to navigate to the [BigQuery console](https://console.cloud.google.com/bigquery), and locate `idc_current_clinical` under the `bigquery-public-data` project, which you should have added in the prerequisites mentioned earlier. 

Those per-collection clinical tables 

The only column that is initialized consistently across those collection-specific tables is `dicom_patient_id`, which corresponds to the `PatientID` column in the `dicom_all` table.

Note that some imaging collections do not have any clinical data at all, while others may have multiple tables containing various types of clinical data.

`column_metadata` can be used as a "key" for identifying information of interest in clinical metadata. It contains information about the columns across all of the collections for which we have clinical data.

First, instantiate the query client, which can next be configured to run the query.

In [3]:
# python API is the most flexible way to query IDC BigQuery metadata tables
from google.cloud import bigquery
bq_client = bigquery.Client(my_ProjectID)

In [4]:
selection_query = """
SELECT *
FROM bigquery-public-data.idc_v11_clinical.column_metadata
"""

selection_result = bq_client.query(selection_query)
column_metadata_df = selection_result.result().to_dataframe()

In [5]:
column_metadata_df.columns

Index(['collection_id', 'case_col', 'table_name', 'column', 'column_label',
       'data_type', 'original_column_headers', 'values', 'files',
       'sheet_names', 'batch', 'column_numbers'],
      dtype='object')

The most important columns in this table are the following:
* `column`: name of the column in a collection-specific clinical metadata table; note that for some of the collections this name may not contain any human readable information, and may instead be a coded value
* `column_label`: for some collections it may contain a description for a column name that otherwise may not be human readable. For other collections it is the original name of the column in the source Excel file
* `collection_id` and `table_name` identify the collection the specific column corresponds to, and the name of the clinical data table where this column is present.

## Exploring clinical data: discovery mode

Sometime you may want to find whether specific clinical attribute is available for the imaging data you can find in IDC. 

Here's an example of the columns that correspond to the clinical metadata in one of the ACRIN collections, where `column` is not particularly helpful, but `column_label` provides human readable information to allow interpretation of the column.


In [None]:
column_metadata_df[column_metadata_df["collection_id"] == "acrin_nsclc_fdg_pet"][:5]

Unnamed: 0,collection_id,case_col,table_name,column,column_label,data_type,original_column_headers,values,files,sheet_names,batch,column_numbers
750,acrin_nsclc_fdg_pet,False,bigquery-public-data.idc_v11_clinical.acrin_ns...,a0e9,ETHNIC CATEGORY,List,"[['a0e9'], ['a0e9']]","[{'option_code': '1', 'option_description': 'H...",[ACRIN 6668 NSCLC FDG PET - ACRIN Data TCIA An...,[],"[0, 1]","[8, 8]"
751,acrin_nsclc_fdg_pet,False,bigquery-public-data.idc_v11_clinical.acrin_ns...,a0e10,RACE,List,"[['A0E10'], ['A0E10']]","[{'option_code': '1', 'option_description': 'A...",[ACRIN 6668 NSCLC FDG PET - ACRIN Data TCIA An...,[],"[0, 1]","[9, 9]"
752,acrin_nsclc_fdg_pet,False,bigquery-public-data.idc_v11_clinical.acrin_ns...,a0e11,GENDER,List,"[['a0e11'], ['a0e11']]","[{'option_code': '1', 'option_description': 'M...",[ACRIN 6668 NSCLC FDG PET - ACRIN Data TCIA An...,[],"[0, 1]","[10, 10]"
753,acrin_nsclc_fdg_pet,False,bigquery-public-data.idc_v11_clinical.acrin_ns...,a0e12,COUNTRY OF RESIDENCE,List,"[['a0e12'], ['a0e12']]","[{'option_code': '1', 'option_description': 'U...",[ACRIN 6668 NSCLC FDG PET - ACRIN Data TCIA An...,[],"[0, 1]","[11, 11]"
754,acrin_nsclc_fdg_pet,False,bigquery-public-data.idc_v11_clinical.acrin_ns...,a0e14,PARTICIPANT'S INSURANCE STATUS,List,"[['a0e14'], ['a0e14']]","[{'option_code': '3', 'option_description': 'M...",[ACRIN 6668 NSCLC FDG PET - ACRIN Data TCIA An...,[],"[0, 1]","[12, 12]"


For some columns, the values come from a defined set. In the example above, we can, for example, examine the values assigned to encode patient race.

In [None]:
race_values = column_metadata_df[(column_metadata_df["collection_id"] == "acrin_nsclc_fdg_pet") & (column_metadata_df["column_label"] == "RACE")]["values"]

race_values.tolist()

[[{'option_code': '1',
   'option_description': 'American Indian or Alaska Native'},
  {'option_code': '2', 'option_description': 'Asian'},
  {'option_code': '3', 'option_description': 'Black or African American'},
  {'option_code': '4',
   'option_description': 'Native Hawaiian or other Pacific Islander'},
  {'option_code': '5', 'option_description': 'White'},
  {'option_code': '6', 'option_description': 'More than one race'},
  {'option_code': '9', 'option_description': 'Unknown'}]]

On the other hand, if we look at the metadata available for the `c4kc_kits` collection, `column_label` and `column` are identical. 


In [None]:
column_metadata_df[column_metadata_df["collection_id"] == "c4kc_kits"][:3]

Unnamed: 0,collection_id,case_col,table_name,column,column_label,data_type,original_column_headers,values,files,sheet_names,batch,column_numbers
1186,c4kc_kits,False,bigquery-public-data.idc_v11_clinical.c4kc_kit...,hospitalization,hospitalization,String,[['hospitalization']],"[{'option_code': '0', 'option_description': No...",[C4KC KiTS_Clinical Data_Version 1.csv],[],[0],[31]
1187,c4kc_kits,False,bigquery-public-data.idc_v11_clinical.c4kc_kit...,tumor_histologic_subtype,tumor_histologic_subtype,String,[['tumor_histologic_subtype']],"[{'option_code': 'angiomyolipoma', 'option_des...",[C4KC KiTS_Clinical Data_Version 1.csv],[],[0],[39]
1483,c4kc_kits,True,bigquery-public-data.idc_v11_clinical.c4kc_kit...,patient_id,patient_id,String,[['patient_id']],[],[C4KC KiTS_Clinical Data_Version 1.csv],[],[0],[0]


**As a general rule of thumb**:
* when selecting specific columns from clinical tables, use `column`
* when searching for concepts of interest in `column_metadata`, use `column_label`

Now, let's look at the distinct values of `column_label` (which in the general case will be either more descriptive, or identical to `column`).

In [None]:
import pandas as pd

pd.DataFrame({"column_label":column_metadata_df["column_label"].unique()})

Unnamed: 0,column_label
0,T0 (baseline) MRI study included in collection
1,T1 (early-Tx) MRI study included in collection
2,T2 (inter-regimen) MRI study included in colle...
3,T3 (pre-surgery) MRI study included in collection
4,Patient included in the BMMR2 challenge traini...
...,...
4010,cn
4011,Data receipt (from base date)
4012,Data Entry Date (Days from base date)
4013,idc_provenance_source_batch


Let's say we want to know which of the cases have information related to therapy. We can search column metadata for the presence of word "therapy" (since the terms in clinical data are not harmonized, we need to account for the variability in capitalization).

In [None]:
column_metadata_df[column_metadata_df["column_label"].str.contains("[tT]herapy", na=False)][["column_label", "collection_id", "values"]]

Unnamed: 0,column_label,collection_id,values
1193,radiotherapy_total_treat_time,head_neck_radiomics_hn1,"[{'option_code': '31', 'option_description': N..."
1197,CCRT Chemotherapy Regimen,hnscc,"[{'option_code': 'Carbolatin weekly', 'option_..."
1246,OnStudy Therapy Surgery Procedure Title,rembrandt,"[{'option_code': ' --', 'option_description': ..."
1604,Induction Chemotherapy,hnscc,[]
1754,OnStudy Therapy Chemo Agent Name,rembrandt,[]
...,...,...,...
5353,Other.anticoagulation.therapy,covid_19_ny_sbu,"[{'option_code': 'apixaban', 'option_descripti..."
5355,Chemotherapy Regimen,hnscc,"[{'option_code': 'Carbolatin weekly', 'option_..."
5356,Chemotherapy Medication,hnscc_3dct_rt,"[{'option_code': 'Carboplatin', 'option_descri..."
5364,OnStudy Therapy Radiation Type,rembrandt,"[{'option_code': ' --', 'option_description': ..."


We observe that there are several collections that contain column named "Chemotherapy". Let's filter these values further, in order to identify subjects that underwent chemotherapy.

In [None]:
column_metadata_df[column_metadata_df["column_label"].str.contains("[Cc]hemotherapy", na=False)][[ "collection_id", "table_name", "column", "column_label","values"]]

Unnamed: 0,collection_id,table_name,column,column_label,values
1197,hnscc,bigquery-public-data.idc_v11_clinical.hnscc_cl...,ccrt_chemotherapy_regimen,CCRT Chemotherapy Regimen,"[{'option_code': 'Carbolatin weekly', 'option_..."
1604,hnscc,bigquery-public-data.idc_v11_clinical.hnscc_cl...,induction_chemotherapy,Induction Chemotherapy,[]
4085,opc_radiomics,bigquery-public-data.idc_v11_clinical.opc_radi...,chemotherapy_,Chemotherapy,"[{'option_code': 'Yes', 'option_description': ..."
4441,head_neck_radiomics_hn1,bigquery-public-data.idc_v11_clinical.head_nec...,chemotherapy_given,chemotherapy_given,"[{'option_code': 'concomitant', 'option_descri..."
4449,hnscc,bigquery-public-data.idc_v11_clinical.hnscc_cl...,platinum_based_chemotherapy,Platinum-based chemotherapy,"[{'option_code': 'No', 'option_description': N..."
4476,nsclc_radiogenomics,bigquery-public-data.idc_v11_clinical.nsclc_ra...,chemotherapy,Chemotherapy,"[{'option_code': 'No', 'option_description': N..."
4977,hcc_tace_seg,bigquery-public-data.idc_v11_clinical.hcc_tace...,chemotherapy,chemotherapy used for TACE procedure,"[{'option_code': 'Cisplastin', 'option_descrip..."
5355,hnscc,bigquery-public-data.idc_v11_clinical.hnscc_cl...,chemotherapy_regimen,Chemotherapy Regimen,"[{'option_code': 'Carbolatin weekly', 'option_..."
5356,hnscc_3dct_rt,bigquery-public-data.idc_v11_clinical.hnscc_3d...,chemotherapy_medication,Chemotherapy Medication,"[{'option_code': 'Carboplatin', 'option_descri..."


From the table above we can observe that:

1. There are 5 collections that have clinical metadata related to chemotherapy regimen of the subject; 

2. There are no value sets assigned for the column `induction_chemotherapy` in the table `hnscc_clinical`. Value sets may be missing if the dictionary is not provided, or if there are a large number (>20) distinct values of the column in the table. In such cases one can check the distinct values of the column by querying the table directly.  

Looking at the value sets for the collections/columns that have those, we can observe that subjects that had any chemotherapy could be selected as follows for the respective collections (non-exhaustive list):
* `head_neck_radiomics_hn1`: subjects that have value other than `none` in table `head_neck_radiomics_hn1_clinical` column `chemotherapy_given`
* `hnscc`: subjects that have value other than `No` in table `hnscc_clinical` column `chemotherapy_regime`
* `hnscc_3dct_rt`: subjects that have value other than `None` in table `hnscc_3dct_rt_demographics` column `chemotherapy_medication`
* `nsclc_radiogenomics`: subjects that have value `Yes` in table `nsclc_radiogenomics_clinical` column `chemotherapy`
* `hcc_tace_seg` column `chemotherapy` in the `hcc_tace_seg_clinical` table

Before we select subjects that meet the criteria defined above, let's confirm  the values encountered in table `hcc_tace_seg_clinical` column `chemotherapy` match the value set in the `column_metadata` table.

The query below will select the distinct values encountered in the `chemotherapy` column of the `hcc_tace_seg_clinical` table.

In [None]:
selection_query = """
SELECT DISTINCT(chemotherapy)
FROM bigquery-public-data.idc_v11_clinical.hcc_tace_seg_clinical
"""

selection_result = bq_client.query(selection_query)
hcc_tace_seg_clinical_df = selection_result.result().to_dataframe()
hcc_tace_seg_clinical_df

Unnamed: 0,chemotherapy
0,"Cisplatin, doxorubicin, Mitomycin-C"
1,
2,doxorubicin LC beads
3,Cisplastin
4,"Cisplatin, Mitomycin-C"


The following cell will get the dictionary for that column from the `column_metadata` table.

In [None]:
chemotherapy_medication_values = \
  column_metadata_df[(column_metadata_df["table_name"] == "bigquery-public-data.idc_v11_clinical.hcc_tace_seg_clinical") \
                     & (column_metadata_df["column"] == "chemotherapy")]["values"]

print(chemotherapy_medication_values.tolist())


[[{'option_code': 'Cisplastin', 'option_description': None}, {'option_code': 'Cisplatin, Mitomycin-C', 'option_description': None}, {'option_code': 'Cisplatin, doxorubicin, Mitomycin-C', 'option_description': None}, {'option_code': 'doxorubicin LC beads', 'option_description': None}, {'option_code': 'null', 'option_description': None}]]


Next we can examine the data to see what therapy individual patients had.

In [None]:
selection_query = """
SELECT DISTINCT(dicom_patient_id), chemotherapy
FROM bigquery-public-data.idc_v11_clinical.hcc_tace_seg_clinical
WHERE chemotherapy <> "null"
"""

selection_result = bq_client.query(selection_query)
hcc_tace_seg_clinical_df = selection_result.result().to_dataframe()
hcc_tace_seg_clinical_df

Unnamed: 0,dicom_patient_id,chemotherapy
0,HCC_001,"Cisplatin, doxorubicin, Mitomycin-C"
1,HCC_006,"Cisplatin, doxorubicin, Mitomycin-C"
2,HCC_010,"Cisplatin, doxorubicin, Mitomycin-C"
3,HCC_017,"Cisplatin, doxorubicin, Mitomycin-C"
4,HCC_026,"Cisplatin, doxorubicin, Mitomycin-C"
...,...,...
83,HCC_050,"Cisplatin, doxorubicin, Mitomycin-C"
84,HCC_065,"Cisplatin, doxorubicin, Mitomycin-C"
85,HCC_073,doxorubicin LC beads
86,HCC_085,doxorubicin LC beads


`dicom_patient_id` is the key to connect clinical data and imaging data. Let's see what imaging studies we have for patient `HCC_101` that was treated with 	doxorubicin LC beads. Along the way we can also generate viewer URLs to conveniently examine the images.

In [None]:
%%bigquery --project=idc-sandbox-000

SELECT 
  StudyInstanceUID, 
  STRING_AGG(DISTINCT(Modality)) AS modalities, 
  STRING_AGG(DISTINCT(collection_id)) AS collection_id, 
  STRING_AGG(DISTINCT(Access)) AS access,
  COUNT(DISTINCT(SeriesInstanceUID)) AS num_series,
  # TODO: use production URL once the release is out!
  #ANY_VALUE(CONCAT("https://viewer.imaging.datacommons.cancer.gov/viewer/", StudyInstanceUID)) as viewer_url
  ANY_VALUE(CONCAT("https://testing-viewer.canceridc.dev/viewer/", StudyInstanceUID)) as viewer_url
FROM bigquery-public-data.idc_v11.dicom_all
WHERE PatientID = "HCC_103"
GROUP BY StudyInstanceUID

Unnamed: 0,StudyInstanceUID,modalities,collection_id,access,num_series,viewer_url
0,1.3.6.1.4.1.14519.5.2.1.1706.8374.304819071818...,"CT,SEG",hcc_tace_seg,Public,3,https://testing-viewer.canceridc.dev/viewer/1....
1,1.3.6.1.4.1.14519.5.2.1.1706.8374.121752675166...,CT,hcc_tace_seg,Public,2,https://testing-viewer.canceridc.dev/viewer/1....


## Exploring clinical data: collection-focused exploration

Let's see which of the collections in IDC have clinical data in BigQuery:


In [6]:
column_metadata_df["collection_id"].unique().tolist()

['acrin_6698',
 'acrin_flt_breast',
 'acrin_fmiso_brain',
 'acrin_hnscc_fdg_pet_ct',
 'acrin_nsclc_fdg_pet',
 'b_mode_and_ceus_liver',
 'breast_diagnosis',
 'breast_mri_nact_pilot',
 'c4kc_kits',
 'covid_19_ar',
 'head_neck_pet_ct',
 'head_neck_radiomics_hn1',
 'hnscc',
 'hnscc_3dct_rt',
 'ispy2',
 'lidc_idri',
 'lung_pet_ct_dx',
 'midrc_ricord_1a',
 'midrc_ricord_1b',
 'nsclc_radiomics_genomics',
 'nsclc_radiomics_interobserver1',
 'opc_radiomics',
 'qin_headneck',
 'rembrandt',
 'soft_tissue_sarcoma',
 'cptac_brca',
 'cptac_ccrcc',
 'cptac_coad',
 'cptac_gbm',
 'cptac_hnscc',
 'cptac_lscc',
 'cptac_luad',
 'cptac_pda',
 'cptac_ucec',
 'tcga_acc',
 'tcga_blca',
 'tcga_brca',
 'tcga_cesc',
 'tcga_chol',
 'tcga_coad',
 'tcga_dlbc',
 'tcga_esca',
 'tcga_hnsc',
 'tcga_kich',
 'tcga_kirc',
 'tcga_kirp',
 'tcga_lgg',
 'tcga_lihc',
 'tcga_luad',
 'tcga_lusc',
 'tcga_meso',
 'tcga_ov',
 'tcga_paad',
 'tcga_pcpg',
 'tcga_prad',
 'tcga_read',
 'tcga_sarc',
 'tcga_skcm',
 'tcga_stad',
 'tcga_tgc

If you are interested in what clinical data is available for the specific collection, you can select only the rows corresponding to that collection in the `column_metadata` table. Here we select a subset of columns to improve readability of the dataframe.

Note that for some collections, clinical data sheets are accompanied by dictionaries, which formalize the values encountered. Examples of such collections are [ISPY1](https://wiki.cancerimagingarchive.net/display/Public/ISPY1) or ACRIN trials.

For many other collections there are no such dictionaries available. In those situations, the values you will see in the `values` columns have been derived by examining the distinct values encountered in the clinical data sheets.

In [8]:
ispy1_clinical_df = column_metadata_df[column_metadata_df["collection_id"] == "acrin_6698"] #[["collection_id", "table_name", "column", "column_label", "data_type", "values"]]

ispy1_clinical_df

Unnamed: 0,collection_id,case_col,table_name,column,column_label,data_type,original_column_headers,values,files,sheet_names,batch,column_numbers
0,acrin_6698,False,bigquery-public-data.idc_v11_clinical.acrin_66...,t0,T0 (baseline) MRI study included in collection,int64,[['T0']],"[{'option_code': '0', 'option_description': ' ...",[Full Collection Ancillary Patient Information...,[ACRIN_6698_Patient_Cohorts_Clin],[0],[2]
1,acrin_6698,False,bigquery-public-data.idc_v11_clinical.acrin_66...,t1,T1 (early-Tx) MRI study included in collection,int64,[['T1']],"[{'option_code': '0', 'option_description': ' ...",[Full Collection Ancillary Patient Information...,[ACRIN_6698_Patient_Cohorts_Clin],[0],[3]
2,acrin_6698,False,bigquery-public-data.idc_v11_clinical.acrin_66...,t2,T2 (inter-regimen) MRI study included in colle...,int64,[['T2']],"[{'option_code': '0', 'option_description': ' ...",[Full Collection Ancillary Patient Information...,[ACRIN_6698_Patient_Cohorts_Clin],[0],[4]
3,acrin_6698,False,bigquery-public-data.idc_v11_clinical.acrin_66...,t3,T3 (pre-surgery) MRI study included in collection,int64,[['T3']],"[{'option_code': '0', 'option_description': ' ...",[Full Collection Ancillary Patient Information...,[ACRIN_6698_Patient_Cohorts_Clin],[0],[5]
4,acrin_6698,False,bigquery-public-data.idc_v11_clinical.acrin_66...,bmmr2_train,Patient included in the BMMR2 challenge traini...,int64,[['BMMR2_TRAIN']],"[{'option_code': '0', 'option_description': ' ...",[Full Collection Ancillary Patient Information...,[ACRIN_6698_Patient_Cohorts_Clin],[0],[6]
5,acrin_6698,False,bigquery-public-data.idc_v11_clinical.acrin_66...,bmmr2_test,Patient included in the BMMR2 challenge test c...,int64,[['BMMR2_TEST']],"[{'option_code': '0', 'option_description': ' ...",[Full Collection Ancillary Patient Information...,[ACRIN_6698_Patient_Cohorts_Clin],[0],[7]
6,acrin_6698,False,bigquery-public-data.idc_v11_clinical.acrin_66...,primary_aim_t0,T0 MRI study included in the ACRIN-6698 primar...,int64,[['PRIMARY_AIM_T0']],"[{'option_code': '0', 'option_description': ' ...",[Full Collection Ancillary Patient Information...,[ACRIN_6698_Patient_Cohorts_Clin],[0],[8]
7,acrin_6698,False,bigquery-public-data.idc_v11_clinical.acrin_66...,primary_aim_t1,T1 MRI study included in the ACRIN-6698 primar...,int64,[['PRIMARY_AIM_T1']],"[{'option_code': '0', 'option_description': ' ...",[Full Collection Ancillary Patient Information...,[ACRIN_6698_Patient_Cohorts_Clin],[0],[9]
8,acrin_6698,False,bigquery-public-data.idc_v11_clinical.acrin_66...,primary_aim_t2,T2 MRI study included in the ACRIN-6698 primar...,int64,[['PRIMARY_AIM_T2']],"[{'option_code': '0', 'option_description': ' ...",[Full Collection Ancillary Patient Information...,[ACRIN_6698_Patient_Cohorts_Clin],[0],[10]
9,acrin_6698,False,bigquery-public-data.idc_v11_clinical.acrin_66...,primary_aim_t3,T3 MRI study included in the ACRIN-6698 primar...,int64,[['PRIMARY_AIM_T3']],"[{'option_code': '0', 'option_description': ' ...",[Full Collection Ancillary Patient Information...,[ACRIN_6698_Patient_Cohorts_Clin],[0],[11]


In [None]:
ispy1_clinical_df[["collection_id", "table_name", "column", "column_label", "data_type", "values"]]

Unnamed: 0,collection_id,table_name,column,column_label,data_type,values
1661,ispy1,bigquery-public-data.idc_v11_clinical.ispy1_cl...,subjectid,SUBJECTID,String,[]
1662,ispy1,bigquery-public-data.idc_v11_clinical.ispy1_cl...,age,age,float64,[]
1663,ispy1,bigquery-public-data.idc_v11_clinical.ispy1_cl...,mri_ld_baseline,MRI LD Baseline,String,[]
1664,ispy1,bigquery-public-data.idc_v11_clinical.ispy1_cl...,mri_ld_1_3dac,MRI LD 1-3dAC,String,[]
1665,ispy1,bigquery-public-data.idc_v11_clinical.ispy1_cl...,mri_ld_interreg,MRI LD InterReg,String,[]
1666,ispy1,bigquery-public-data.idc_v11_clinical.ispy1_cl...,mri_ld_presurg,MRI LD PreSurg,String,[]
1667,ispy1,bigquery-public-data.idc_v11_clinical.ispy1_ou...,subjectid,SUBJECTID,String,[]
1668,ispy1,bigquery-public-data.idc_v11_clinical.ispy1_ou...,survdtd2_tx,survDtD2 (tx),int64,[]
1669,ispy1,bigquery-public-data.idc_v11_clinical.ispy1_ou...,rfs,RFS,int64,[]
2316,ispy1,bigquery-public-data.idc_v11_clinical.ispy1_cl...,dataextractdt,DataExtractDt,String,[{'option_code': '2009-09-03T00:00:00.00000000...


`table_name` gives us the location of the BigQuery table that contains the column described in the `column_metadata` row.

Given the information available in the per-collection clinical data, we can proceed with selecting a subset of patients that meet the criteria of your interest. As an example, the following query will select all of the patients from the ISPY1 collections that had positive HER2 status.

In [None]:
%%bigquery --project=idc-sandbox-000 ispy1_her2_status_patients

SELECT DISTINCT(dicom_patient_id), hr_her2_status
FROM bigquery-public-data.idc_v11_clinical.ispy1_clinical


In [None]:
ispy1_her2_status_patients

Unnamed: 0,dicom_patient_id,hr_her2_status
0,ISPY1_1001,HRposHER2neg
1,ISPY1_1003,HRposHER2neg
2,ISPY1_1005,HRposHER2neg
3,ISPY1_1007,HRposHER2neg
4,ISPY1_1012,HRposHER2neg
...,...,...
216,ISPY1_1208,TripleNeg
217,ISPY1_1216,TripleNeg
218,ISPY1_1220,TripleNeg
219,ISPY1_1227,TripleNeg


Extensive metadata is available for the CPTAC collections.

In [None]:
cptac_coad_clinical_df = column_metadata_df[column_metadata_df["collection_id"] == "cptac_coad"]

cptac_coad_clinical_df

Unnamed: 0,collection_id,case_col,table_name,column,column_label,data_type,original_column_headers,values,files,sheet_names,batch,column_numbers
1255,cptac_coad,False,bigquery-public-data.idc_v11_clinical.cptac_co...,diag__ajcc_pathologic_stage,diag__ajcc_pathologic_stage,STRING,[],"[{'option_code': 'Stage I', 'option_descriptio...",[],[],[0],[]
1795,cptac_coad,True,bigquery-public-data.idc_v11_clinical.cptac_co...,submitter_id,submitter_id,STRING,[],[],[],[],[0],[]
1796,cptac_coad,False,bigquery-public-data.idc_v11_clinical.cptac_co...,case_id,case_id,STRING,[],[],[],[],[0],[]
1797,cptac_coad,False,bigquery-public-data.idc_v11_clinical.cptac_co...,demo__demographic_id,demo__demographic_id,STRING,[],[],[],[],[0],[]
1798,cptac_coad,False,bigquery-public-data.idc_v11_clinical.cptac_co...,diag__diagnosis_id,diag__diagnosis_id,STRING,[],[],[],[],[0],[]
...,...,...,...,...,...,...,...,...,...,...,...,...
4122,cptac_coad,False,bigquery-public-data.idc_v11_clinical.cptac_co...,diag__prior_malignancy,diag__prior_malignancy,STRING,[],"[{'option_code': 'no', 'option_description': N...",[],[],[0],[]
4516,cptac_coad,False,bigquery-public-data.idc_v11_clinical.cptac_co...,demo__ethnicity,demo__ethnicity,STRING,[],"[{'option_code': 'hispanic or latino', 'option...",[],[],[0],[]
5007,cptac_coad,False,bigquery-public-data.idc_v11_clinical.cptac_co...,demo__race,demo__race,STRING,[],[{'option_code': 'american indian or alaska na...,[],[],[0],[]
9650,cptac_coad,False,bigquery-public-data.idc_v11_clinical.cptac_co...,source_batch,idc_provenance_source_batch,INTEGER,[],"[{'option_code': '0', 'option_description': No...",[],[],[0],[]


## TODO: Exploring clinical data: Patient-focused exploration

Sometime you may want to know whether specific patient has any clinical data available. One way to do this is to locate the collection that patient belongs to, and then check whether any of the clinical data tables (if any) that are available for that collection have that patient identifier.

Alternatively, we can build a complete list of patients that have clinical data by performing a union on all of the `dicom_patient_id` columns across all of the clinical data tables, which is what we do in the next cell.

In [None]:
all_clinical_tables = column_metadata_df["table_name"].unique()
query = "with patients_unionized as (SELECT dicom_patient_id, collection_id FROM "+all_clinical_tables[0].replace("v11","current")
for clinical_table in all_clinical_tables[1:]:
  query = query+" UNION ALL SELECT dicom_patient_id, collection_id FROM "+clinical_table.replace("v11","current")

selection_query = query+") select distinct(dicom_patient_id) from patients_unionized"

print(selection_query)

selection_result = bq_client.query(selection_query)
patients_df = selection_result.result().to_dataframe()
patients = patients_df["dicom_patient_id"].unique()

with patients_unionized as (SELECT dicom_patient_id, collection_id FROM bigquery-public-data.idc_current_clinical.acrin_6698_clinical UNION ALL SELECT dicom_patient_id, collection_id FROM bigquery-public-data.idc_current_clinical.acrin_flt_breast_A0 UNION ALL SELECT dicom_patient_id, collection_id FROM bigquery-public-data.idc_current_clinical.acrin_flt_breast_A1 UNION ALL SELECT dicom_patient_id, collection_id FROM bigquery-public-data.idc_current_clinical.acrin_flt_breast_A3 UNION ALL SELECT dicom_patient_id, collection_id FROM bigquery-public-data.idc_current_clinical.acrin_flt_breast_AA UNION ALL SELECT dicom_patient_id, collection_id FROM bigquery-public-data.idc_current_clinical.acrin_flt_breast_AB UNION ALL SELECT dicom_patient_id, collection_id FROM bigquery-public-data.idc_current_clinical.acrin_flt_breast_AC UNION ALL SELECT dicom_patient_id, collection_id FROM bigquery-public-data.idc_current_clinical.acrin_flt_breast_AF UNION ALL SELECT dicom_patient_id, collection_id FROM 

BadRequest: ignored