<a href="https://colab.research.google.com/github/ImagingDataCommons/idc-pathomics-use-case-1/blob/development/pathomics/lung_cancer_cptac_DataExploration.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# IDC Tutorial: Data exploration of slide microscopy images

This notebook demonstrates how to explore slide microscopy data using the [Imaging Data Commons (IDC)](https://portal.imaging.datacommons.cancer.gov/). 

Besides radiology, slide microscopy is the second major imaging modality in the IDC. Slide microscopy images show thin sections of tissue samples (e.g., from a resected tumor) at microscopic resolution. They provide a unique glimpse into cellular architecture and function that is essential for diagnosing complex diseases like cancer. Computerized analysis makes the assessment of slide microscopy images more reproducible and less time consuming and it enables the extration of novel digital biomarkers from tissue images.

This tutorial gives a quick introduction to the way slide microscopy data is organized within the IDC and how best to examine available data and build a data set for further analysis. For a more comprehensive tutorial including training of a tissue classification model on IDC-hosted slide microscopy data, see [here](https://github.com/ImagingDataCommons/idc-pathomics-use-case-1/blob/development/pathomics/lung_cancer_cptac_TissueClassificationModel.ipynb).  
  
To learn more about the IDC platform, please visit the [IDC user guide](https://learn.canceridc.dev/).

If you have any questions, bug reports, or feature requests please feel free to contact us at the [IDC discussion forum](https://discourse.canceridc.dev/)!

## Prerequisites

**Authenticate:** To access IDC resources, you have to authenticate with your Google identity. Follow the link generated by the code below and enter the displayed verification code to complete the Google authentication process.

In [None]:
from google.colab import auth
auth.authenticate_user()

**Create a Google Cloud Platform project:** In order to run this notebook you need to have a Google Cloud Platform project. You can learn how to create your own project [here](https://www.youtube.com/watch?v=i08S0KJLnyw). Billing information is not required for running this tutorial. You are still encouraged to apply for free cloud credits from IDC by submitting the application form referenced [here](https://learn.canceridc.dev/introduction/requesting-gcp-cloud-credits) and use them for other tutorials. Once you have the google project, set `my_project_id` below to the ID of your GCP project.

In [None]:
my_project_id = 'idc-pathomics-000'

## Environment setup

Import the required Python modules.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.ticker import MaxNLocator
import seaborn as sns
sns.set_theme()
import warnings 
with warnings.catch_warnings(): # Hide Python warnings to improve readability.
    warnings.simplefilter('ignore')

Determine who and where we are.

In [None]:
curr_dir = !pwd
curr_droid = !hostname
curr_pilot = !whoami

print('Current directory :', curr_dir[-1])
print('Hostname          :', curr_droid[-1])
print('Username          :', curr_pilot[-1])

## Dataset selection and exploration

IDC relies on the Google Cloud Platform (GCP) for storage and management of DICOM data. The data are contained in so-called [storage buckets](https://cloud.google.com/storage/docs/key-terms#buckets), from which they can be retrieved on a requester pays basis. Currently, all pathology whole-slide images (WSI) are located in the *idc-open* bucket.

Metadata for the DICOM files — including standard DICOM tags, but also non-DICOM metadata — are stored in the BigQuery table *canceridc-data.idc_current.dicom_all*. The IDC Documentation gives further information on [data organization](https://learn.canceridc.dev/data/organization-of-data) and [code examples](https://learn.canceridc.dev/cookbook/bigquery) on how to query the table. The easiest way to access BigQuery tables from a Jupyter notebook is to use [BigQuery cell magic](https://cloud.google.com/bigquery/docs/visualize-jupyter#querying-and-visualizing-bigquery-data) with the `%%bigquery` command. 

The following statement loads relevant metadata of all slide images from the CPTAC-LUAD and CPTAC-LSCC datasets into a pandas data frame called `slides_df`.

In [None]:
%%bigquery slides_df --project=$my_project_id 

SELECT
    ContainerIdentifier AS slide_id,
    PatientID AS patient_id,
    ClinicalTrialProtocolID AS dataset,
    TotalPixelMatrixColumns AS width,
    TotalPixelMatrixRows AS height,
    StudyInstanceUID AS idc_viewer_id,        
    gcs_url, -- URL of the Google Cloud storage bucket
    CAST(SharedFunctionalGroupsSequence[OFFSET(0)].
          PixelMeasuresSequence[OFFSET(0)].
          PixelSpacing[OFFSET(0)] AS FLOAT64) AS pixel_spacing,
    -- rename TransferSyntaxUIDs for readability
    CASE TransferSyntaxUID
        WHEN '1.2.840.10008.1.2.4.50' THEN 'jpeg'
        WHEN '1.2.840.10008.1.2.4.91' THEN 'jpeg2000'
        ELSE 'other'
    END AS compression
FROM canceridc-data.idc_current.dicom_all
WHERE
  NOT (ContainerIdentifier IS NULL)
  AND (ClinicalTrialProtocolID = "CPTAC-LUAD" OR ClinicalTrialProtocolID = "CPTAC-LSCC")

We reduce the obtained data frame to the images that are digitized at 5x magnification (corresponding to a pixel spacing between 0.0019 and 0.0021 mm) and compressed in JPEG format.

In [None]:
slides_df.query('pixel_spacing > 0.0019 & pixel_spacing < 0.0021 & compression=="jpeg"', inplace=True)

The tissue type of the slides (whether it shows tumor tissue or healthy/normal tissue) is not yet included in the *canceridc-data.idc_current.dicom_all* table. This information has to be supplemented from a separate CSV file provided by the TCIA. The cancer subtype (LSCC or LUAD) can in principle be inferred from the dataset name, but for clarity we also use the information in the CSV file.

In [None]:
!curl -OL https://raw.githubusercontent.com/ImagingDataCommons/idc-pathomics-use-case-1/master/idc_pathomics/tissue_type_data_TCIA.csv

In [None]:
type_df = pd.read_csv('./tissue_type_data_TCIA.csv')[['Slide_ID', 'Specimen_Type', 'Tumor']]
# Harmonize column names and labels
type_df.rename(columns={'Slide_ID': 'slide_id', 'Specimen_Type': 'tissue_type', 'Tumor': 'cancer_subtype'}, inplace=True)
type_df.replace({'tissue_type': {'normal_tissue': 'normal', 'tumor_tissue': 'tumor'}}, inplace=True)
type_df.replace({'cancer_subtype': {'LSCC': 'lscc', 'LUAD': 'luad'}}, inplace=True)
slides_df = pd.merge(slides_df, type_df, how='inner', on='slide_id', sort=True)

With standard [pandas](https://pandas.pydata.org/) functionality, we can easily validate and summarize the compiled metadata.

In [None]:
# Assert uniqueness of slide_id values
assert(slides_df.slide_id.is_unique)

# Assert validity of class labels
assert set(slides_df.tissue_type.unique()) == set(['normal', 'tumor'])
assert set(slides_df.cancer_subtype.unique()) == set(['luad', 'lscc'])

display(slides_df.head())
print('Total number of slides: ', len(slides_df))
nr_slides = slides_df.groupby('cancer_subtype').size()
nr_patients = slides_df.drop_duplicates('patient_id').groupby('cancer_subtype').size()
print('--> %d slides from %d LUAD patients' % (nr_slides['luad'], nr_patients['luad']))
print('--> %d slides from %d LSCC patients' % (nr_slides['lscc'], nr_patients['lscc']))

Using standard [matplotlib](https://matplotlib.org/) functionality, we can easily visualize some aspects of interest. The following code produces two histograms. The left graph shows the frequencies of numbers of slides per patient, while the right graph shows the proportions of slides derived from healthy or tumor tissue.

In [None]:
fig1, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

# Number of slides per patient 
slides_per_patient = slides_df.groupby(['patient_id']).size()
plot1 = sns.histplot(data=slides_per_patient, discrete=True, ax=ax1, shrink=0.9, color=['C7'])
ax1.update({'xlabel': 'Number of slides', 'ylabel': 'Number of patients'})
ax1.xaxis.set_major_locator(MaxNLocator(integer=True)) # Force integer labels on x-axis

# Distribution of tissue types
plot2 = sns.histplot(data=slides_df, x='cancer_subtype', hue='tissue_type', multiple='stack', palette = ['C1', 'C2'], ax=ax2, shrink=0.7)
ax2.update({'xlabel': 'Cancer subtype', 'ylabel': 'Number of slides'})
legend = plot2.get_legend()
legend.set_title('Tissue type')
legend.set_bbox_to_anchor((1, 1))

Any slide can also be viewed and explored in detail using the IDC viewer.

In [None]:
def get_idc_viewer_url(study_UID):
    return "https://viewer.imaging.datacommons.cancer.gov/slim/studies/" + study_UID

print(get_idc_viewer_url(slides_df['idc_viewer_id'].iloc[0]))
print(get_idc_viewer_url(slides_df['idc_viewer_id'].iloc[100]))

Finally, we can save the information as CSV file *slides_metadata.csv* to be used later for further analysis such as downloading the cohort and training plus evaluation of a tissue classification model as outlined in [this tutorial](https://github.com/ImagingDataCommons/idc-pathomics-use-case-1/blob/development/pathomics/lung_cancer_cptac_TissueClassificationModel.ipynb).  

In [None]:
slides_df.to_csv('./slides_metadata.csv', index=False)