<a href="https://colab.research.google.com/github/fedorov/IDC-Examples/blob/reorg-intro/notebooks/getting_started/part2_searching_basics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Getting started with IDC - Part 2: Basics of searching IDC data

---


## Summary

This notebook is part of [the series](link) introducing NCI Imaging Data Commons to the users who want to interact with IDC programmatically.

In Part 1 of this tutorial series, you learned how to complete the prerequisites that are required to be able to search IDC data using GCP BigQuery.

In this notebook you will be introduced into how IDC organizes the metadata accompanying images available in IDC, and how that metadata can be used to define subsets of data.

---
Initial version: Nov 2022

Updated: 


## What is IDC?

NCI Imaging Data Commons is ...

## Prerequisites

It is assumed that you completed tutorial 1 in the series. You will need to have a Google identity, Google Cloud activated for that account, and the ID of the project created under that account.

Before doing **anything** you **must**  authorize Colab Runtime to act on your behalf, and initialize the variable pointing to your Google Cloud project ID.

The following cell initializes project ID that is needed for all operations with the cloud. You should have project ID if you completed the [first tutorial]() in this series.

In [1]:
# initialize this variable with your Google Cloud Project ID!
my_ProjectID = "idc-tcia"

import os
os.environ["GCP_PROJECT_ID"] = my_ProjectID

from google.colab import auth
auth.authenticate_user()

## Why do I need to search?

Think of IDC as a library. Image files are books, and we have ~45 TB of those. When you go to a library, you want to check out just the books that you want to read. In order to find a book in a large library you need a catalog. 

Just as in the library, IDC maintains a catalog that indexes a variety of metadata fields describing the files we curate. That metadata catalog is accessible in a large database table that you should be using to search and subset the images. Each row in that table corresponds to a file, and includes the location of the file alongside the metadata attributes describing that file.


## What does it mean to search?

When you search, or _query_ IDC catalog, you specify what criteria should the metadata describing the selected files satisfy. 

Queries can be as simple as 

* "_everything in collection X_", 

or as complex as 

* "_files corresponding to CT images of female patients that are accompanied by annotations of lung tumors that are larger than 10 ml in volume_".

Although it would be very nice to just state what you need in free form, in practice queries need to be written in a formal way.

IDC organizes all of the metadata into large tables, where each row corresponds to one image file (as of writing, IDC indexes ~42 millions of files) and each column represents a metadata attribute present in one or more files in IDC (currently, we have hundreds of such attributes). 

IDC metadata tables are maintained in [GCP BigQuery](https://cloud.google.com/bigquery),  with only a tiny subset of the attributes indexed in the catalog available via the [IDC Portal exploration page](https://imaging.datacommons.cancer.gov/explore/). IDC metadata can be queried using Standard Query Language (SQL), and does not require learning any IDC-specific API. 

## TODO - probably not - Searching using IDC portal

IDC Portal is intended to serve as the interface to get basic understanding of the data available in IDC, and 

## First query and BigQuery workspace

To get started, let's build the queries that replicate the information about IDC data shown in the IDC Portal.

As the very first query, let's get the list of all the image collections available in IDC. Here is that query:

```sql
SELECT 
  DISTINCT(collection_id) 
FROM 
  bigquery-public-data.idc_current.dicom_all
```

To run this query interactively, copy the query above to the clipboard, paste it into the Editor tab in the [BigQuery SQL workspace](https://console.cloud.google.com/bigquery), and hit the "Run" button. Within few moments you should be able to see the list of collections in IDC in the "Query results" section of the interface.

![bq_run](https://www.dropbox.com/s/6ah98n6e9ik18if/bq_run.png?raw=1)

Let's look into how this query works:
* `SELECT` defines the list of columns that should be returned by the query,
* `DISTINCT` indicates that we want to see the distinct values encountered in the selected column,
* `FROM` defines which table should be queried.

In this query we work with the [`dicom_all` table](https://console.cloud.google.com/bigquery?p=bigquery-public-data&d=idc_current&t=dicom_all&page=table), which contains the DICOM metadata extracted from IDC images along with collection-level metadata that does not originate from DICOM.

## Same query using Python SDK

BigQuery SQL workspace is a very convenient tool for exploring schemas of the tables, experimenting with writing and debugging queries, profiling their execution. But you can also run those queries programmatically, which is very convenient if you want to load the result the query into a pandas dataframe, or just perform your searches programmatically.

BigQuery API is implemented in a variety of languages, with the python bindings available in the `google-cloud-bigquery` package. Conveniently, this package is pre-installed in Colab!

HINT: SQL query syntax is not sensitive to indentation or capitalization - although those are quite helpful to make the query more readable!

In [5]:
from google.cloud import bigquery

# BigQuery client is initialized with the ID of the project
# we specified in the beginning of the notebook!
bq_client = bigquery.Client(my_ProjectID)

In [None]:
selection_query = """
SELECT 
  DISTINCT(collection_id) 
FROM 
  bigquery-public-data.idc_current.dicom_all
"""

selection_result = bq_client.query(selection_query)
selection_df = selection_result.result().to_dataframe()

selection_df

## Exploring other IDC portal attributes via SQL

Next we will explore few other attributes that are available in the IDC portal (with the few exceptions, the mapping is pretty straightforward):

![portal_filters](https://www.dropbox.com/s/qt3dhzara1ap7s3/portal_filters.png?raw=1)

* "Primary site location" maps to the `tcia_tumorLocation` column
* "Body Part Examined" -> `BodyPartExamined`
* "Modality" -> `Modality`



### Exercise 1: Write the query to list all values for `tcia_tumorLocation`

In the cell below, we pre-filled the code by copying the previous cell. All you need to do is write the query!

In [None]:
# Execution of this cell will fail unless you wrote the query below!
selection_query = """
REPLACE THIS TEXT WITH YOUR QUERY!
"""

selection_result = bq_client.query(selection_query)
selection_df = selection_result.result().to_dataframe()

selection_df

## Combining multiple attributes in a single query

With the understanding of the attributes that are available in IDC, you can use SQL to build queries that summarize those attributes to better understand the data.

In the following, we use the earlier discussed attributes to list distinct values encountered for tumor locations, body parts examined and modalities for the individual collections.

In [7]:
# Execution of this cell will fail unless you wrote the query below!
selection_query = """
SELECT
  collection_id,
  STRING_AGG(DISTINCT(Modality)) as modalities,
  STRING_AGG(DISTINCT(tcia_tumorLocation)) as tumor_locations
FROM
  bigquery-public-data.idc_current.dicom_all
GROUP BY
  collection_id
"""

selection_result = bq_client.query(selection_query)
selection_df = selection_result.result().to_dataframe()

selection_df

Unnamed: 0,collection_id,modalities,tumor_locations
0,qiba_ct_1c,"CT,PR,SR,SEG",Phantom
1,tcga_brca,"MR,SR,SM,MG",Breast
2,mouse_mammary,MR,Abdomen
3,cmb_mel,"PT,CT,US",Various
4,naf_prostate,"CT,PT",Prostate
...,...,...,...
123,pdmr_997537_175_t,"SR,MR",Colon
124,pdmr_292921_168_r,"SR,MR",Abdomen
125,acrin_contralateral_breast_mr,"MR,CR",Breast
126,cptac_ccrcc,"CT,CR,SR,SM,DX,MR",Kidney


In the query above, we use the familiar operators `SELECT` and `FROM`, but also couple of new ones:

* `GROUP BY` in the end of the query indicates that we want to get a single row per the distinct value of the `collection_id`
* `STRING_AGG` and `DISTINCT` indicate how the values of the selected columns should be aggregated while combining into single row per `collection_id`: we take all the distinct values per individual `collection_id`, and the concatenate them into a single string



### Exercise 2: Use multiple criteria to select image collections of interest

In the previous query we learned how to get collection-level list of the modalities and tumor locations. In this exercise you task is to write a query that identifies all of the collections that include MR modality and "Lung" as a tumor location.

In [None]:
# Execution of this cell will fail unless you wrote the query below!
selection_query = """
SELECT 
  DISTINCT(collection_id)
FROM
  bigquery-public-data.idc_current.dicom_all
WHERE
  # write the selection criteria under this line!
"""

selection_result = bq_client.query(selection_query)
selection_df = selection_result.result().to_dataframe()

selection_df

## DICOM data model: Patients, studies, series and instances

Up to now we searched the data at the granularity of the collections. In practice, we often want to know how many patients meet our search criteria, or what are the specific images that we need to download. 

IDC is using DICOM for data representation, and in the DICOM data model, patients (identified by `PatientID`) undergo imaging exams (or _studies_, in DICOM nomenclature). 

Each patient will have one or more studies, with each study identified uniquely by the attribute `StudyInstanceUID`. During each of the imaging studies one or more imaging _series_ will be collected. As an example, a Computed Tomography (CT) imaging study may include a volume sweep before and after administration of the contrast agent. Imaging series are uniqiely identified by `SeriesInstanceUID`. 

Finally, each imaging series contains one or more _instances_, where each instance corresponds to a file. Most often, one instance corresponds to a single slice from a cross-sectional image. Individual instances are identified by unique `SOPInstanceUID` values.

In the following we will select all distinct segmentation series that are available within the `nsclc_radiomics` collection.