<a href="https://colab.research.google.com/github/ImagingDataCommons/IDC-Tutorials/blob/master/notebooks/getting_started/exploring_clinical_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Working with IDC clinical data without BigQuery

In this notebook we cover the basics of how you can access and search IDC clinical data without depending on Google BigQuery.

In addition to maintaining clinical data in Google BigQuery tables, we also export those in Parquet format into a public cloud-based storage bucket. Those files are free to download, and are rather small (as of IDC v18, less than 65MB altogether).

Once downloaded, you can search the content using Pandas sytax of SQL.

This brief notebook will guide you through the steps of the above.

If you have never worked with IDC before, we recommend you first complete the getting started tutorial [here](https://github.com/ImagingDataCommons/IDC-Tutorials/blob/master/notebooks/getting_started/part2_searching_basics.ipynb).

---
Initial version: Jul 2024

Updated: Feb 2025

## Prerequisites

The only prerequisite is [`idc-index`](https://github.com/ImagingDataCommons/idc-index) - python package that contains various utilities to simplify access to IDC data. As part of this package installation, you will get several other packages that we will use later:
* `s5cmd` for very efficient download of data from cloud buckets using S3 API
* `pandas` for dataframe operations
* `duckdb` for querying pandas dataframes using SQL syntax

In [None]:
%%capture
!pip install --upgrade idc-index

## Fetch clinical data index

`idc-index` packages various tables with the key metadata. We refer to those as _indices_. The main index that supports API calls related to download and search is installed by default. To support search of the clinical data accompanying IDC images you will need the `clinical_index` table, which contains the list of all columns and all tables across all of the IDC collections that are available.

In [None]:
from idc_index import IDCClient

c = IDCClient()

c.fetch_index('clinical_index')

print('Columns avaialable in clinical_index:\n'+'\n'.join(c.clinical_index.keys()))

Columns avaialable in clinical_index:
collection_id
table_name
short_table_name
column
column_label
values


# Find all clinical metadata available for the specific collection

A common use case is to find all clinical data available for a specific IDC collection.

The key columns of this dataframe are:
* `collection_id`: which collection given metadata attribute corresponds to
* `table_name`: the name of the table where this metadata attribute is located
* `column`: name of the column (attribute)

Depending on the specific attribute and how it was provided/documented by the submitter, you may find more information about it in the `column_label` column.

Let's assume we are interested in the clinical data accompanying the `rms_mutation_prediction` collection. We can select all clinical data attributes that are available for this collection as shown next.

In [None]:
# define the query that selects all rows where collection_id is 'rms_mutation_prediction'
# note that we can refer to clinical_index table in the query
query = """
SELECT *
FROM clinical_index
WHERE collection_id = 'rms_mutation_prediction'
"""

# execute the query
matching_items = c.sql_query(query)


In [None]:
matching_items

Unnamed: 0,collection_id,table_name,short_table_name,column,column_label,values
0,rms_mutation_prediction,bigquery-public-data.idc_v20_clinical.rms_muta...,rms_mutation_prediction_demographics,source_batch,idc_provenance_source_batch,"[{'option_code': '0', 'option_description': No..."
1,rms_mutation_prediction,bigquery-public-data.idc_v20_clinical.rms_muta...,rms_mutation_prediction_demographics,race,race,"[{'option_code': '', 'option_description': Non..."
2,rms_mutation_prediction,bigquery-public-data.idc_v20_clinical.rms_muta...,rms_mutation_prediction_demographics,gender,gender,"[{'option_code': 'Female', 'option_description..."
3,rms_mutation_prediction,bigquery-public-data.idc_v20_clinical.rms_muta...,rms_mutation_prediction_demographics,dicom_patient_id,idc_provenance_dicom_patient_id,[]
4,rms_mutation_prediction,bigquery-public-data.idc_v20_clinical.rms_muta...,rms_mutation_prediction_demographics,participant_id,participant_id,[]
5,rms_mutation_prediction,bigquery-public-data.idc_v20_clinical.rms_muta...,rms_mutation_prediction_diagnosis,source_batch,idc_provenance_source_batch,"[{'option_code': '0', 'option_description': No..."
6,rms_mutation_prediction,bigquery-public-data.idc_v20_clinical.rms_muta...,rms_mutation_prediction_diagnosis,diagnosis_id,diagnosis_id,"[{'option_code': '', 'option_description': None}]"
7,rms_mutation_prediction,bigquery-public-data.idc_v20_clinical.rms_muta...,rms_mutation_prediction_diagnosis,disease_type,disease_type,[{'option_code': 'Soft Tissue Tumors and Sarco...
8,rms_mutation_prediction,bigquery-public-data.idc_v20_clinical.rms_muta...,rms_mutation_prediction_diagnosis,primary_diagnosis,primary_diagnosis,"[{'option_code': 'Rhabdomyosarcoma', 'option_d..."
9,rms_mutation_prediction,bigquery-public-data.idc_v20_clinical.rms_muta...,rms_mutation_prediction_diagnosis,primary_diagnosis_reference_source,primary_diagnosis_reference_source,"[{'option_code': '', 'option_description': None}]"


# Access table that contains specific metadata attribute

Let's assume we are interested in the `tumor_grade` attribute (row 13 in the table above). From the above, column `short_table_name` is telling us that it is contained in the table `rms_mutation_prediction_sample`. `IDCClient` provides a convenient function to load the table given that short table name into a pandas dataframe.

In [None]:
import pandas as pd

clinical_table_df = c.get_clinical_table('rms_mutation_prediction_sample')

clinical_table_df

Unnamed: 0,dicom_patient_id,source_batch,participantparticipant_id,sample_id,sample_type,sample_anatomic_site,participant_age_at_collection,histological_classification,tumor_grade,tumor_stage_clinical_t,tumor_stage_clinical_n,tumor_stage_clinical_m,tumor_morphology,tumor_incidence_type,sample_description,sample_tumor_status
0,RMS2325,0,RMS2325,PAWDLM,Tumor,Leg,44.56,,,,,,,,,Tumor
1,RMS2124,0,RMS2124,PATMDI,Tumor,,0.90,BOTRYOID,,,,,,,,Tumor
2,RMS2137,0,RMS2137,PATVPL,Tumor,,0.83,BOTRYOID,,,,,,,,Tumor
3,RMS2140,0,RMS2140,PATYYW,Tumor,,1.07,BOTRYOID,,,,,,,,Tumor
4,RMS2145,0,RMS2145,PAUKHP,Tumor,,2.72,BOTRYOID,,,,,,,,Tumor
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
398,RMS2374,0,RMS2374,PAUPVA,Tumor,"Paratesticular, left",2.46,SPINDLE CELL RHABDOMYOSARCOMA,,,,,,,,Tumor
399,RMS2352,0,RMS2352,PASGZC,Tumor,"Paratesticular, right",0.68,SPINDLE CELL RHABDOMYOSARCOMA,,,,,,,,Tumor
400,RMS2205,0,RMS2205,PAMSJL,Tumor,Pelvis,2.76,MIXED ALVEOLAR AND EMBRYONAL RHABDOMYOSARCOMA,,,,,,,,Tumor
401,RMS2267,0,RMS2267,PALWAA,Tumor,"Soft tissue, abdomen",17.96,MIXED ALVEOLAR AND EMBRYONAL RHABDOMYOSARCOMA,,,,,,,,Tumor


Now that this table is loaded, you can search it as you would any pandas dataframe (or you can use SQL with duckdb, as shown earlier!).

Note that the `dicom_patient_id` column, which you will find in **every** clinical data table, can be used to link clinical metadata attributes to the DICOM image metadata!

As an example, we can find all image series that are available for the first patient in the list as follows.

In [None]:
one_patient_images = c.index[c.index['PatientID']==clinical_table_df.iloc[0]['dicom_patient_id']]

one_patient_images

Unnamed: 0,collection_id,analysis_result_id,PatientID,SeriesInstanceUID,StudyInstanceUID,source_DOI,PatientAge,PatientSex,StudyDate,StudyDescription,BodyPartExamined,Modality,Manufacturer,ManufacturerModelName,SeriesDate,SeriesDescription,SeriesNumber,instanceCount,license_short_name,series_aws_url,series_size_MB,crdc_series_uuid
30353,rms_mutation_prediction,,RMS2325,1.3.6.1.4.1.5962.99.1.3049362635.826230255.168...,2.25.178459043216141635419628463075271643072,10.5281/zenodo.8225131,534M,F,2018-11-21,Histopathology,,SM,Leica Biosystems,Aperio converted by com.pixelmed.convert.TIFFT...,2018-11-21,FFPE HE,1,6,CC BY 4.0,s3://idc-open-data/1dcfa24f-bbce-47e3-99bf-ff0...,465.75,1dcfa24f-bbce-47e3-99bf-ff00a8d21221


We can also use `IDCClient` to generate the URL to view the image for the given patient!

In [None]:
viewer_url = c.get_viewer_URL(seriesInstanceUID=one_patient_images['SeriesInstanceUID'].values[0], viewer_selector="slim")

viewer_url

'https://viewer.imaging.datacommons.cancer.gov/slim/studies/2.25.178459043216141635419628463075271643072/series/1.3.6.1.4.1.5962.99.1.3049362635.826230255.1686676575435.4.0'

We can even view the image using IDC viewer directly in the notebook cell.

In [None]:
from IPython.display import IFrame
IFrame(viewer_url, width=1500, height=900)

## Summary

We hope you enjoyed this tutorial! If something didn't work as expected, if you have any feedback or suggestions of what should be added to this tutorial, please contact IDC support by sending email to support@canceridc.dev or posting your question on [IDC User forum](https://discourse.canceridc.dev).

## Acknowledgments

Imaging Data Commons has been funded in whole or in part with Federal funds from the National Cancer Institute, National Institutes of Health, under Task Order No. HHSN26110071 under Contract No. HHSN261201500003l.

If you use IDC in your research, please cite the following publication:

> Fedorov, A., Longabaugh, W. J. R., Pot, D., Clunie, D. A., Pieper, S. D., Gibbs, D. L., Bridge, C., Herrmann, M. D., Homeyer, A., Lewis, R., Aerts, H. J. W., Krishnaswamy, D., Thiriveedhi, V. K., Ciausu, C., Schacherer, D. P., Bontempi, D., Pihl, T., Wagner, U., Farahani, K., Kim, E. & Kikinis, R. _National Cancer Institute Imaging Data Commons: Toward Transparency, Reproducibility, and Scalability in Imaging Artificial Intelligence_. RadioGraphics (2023). https://doi.org/10.1148/rg.230180