<a href="https://colab.research.google.com/github/ImagingDataCommons/IDC-Tutorials/blob/master/notebooks/collections_demos/nlst_exploration.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# National Lung Screening Trial (NLST) @ NCI Imaging Data Commons

This notebook is part of the [NCI Imaging Data Commons](https://imaging.datacommons.cancer.gov) tutorials available in [GitHub IDC-Tutorials repository](https://github.com/ImagingDataCommons/IDC-Tutorials).

One of the largest and most comprehensive collections in [NCI Imaging Data Commons](https://imaging.datacommons.cancer.gov) is the one for the [National Lung Screening Trial (NLST)](https://cdas.cancer.gov/nlst/) (note that the NLST CDAS website contains outdated information - all of the CT and pathology images for this trial are publicly available and are downloadable from IDC!)

IDC hosts the images collected by the trial, all of which are publicly available for download:
* Computed Tomography images, replicated from The Cancer Imaging Archive (TCIA) [NLST collection](https://doi.org/10.7937/tcia.hmq8-j677)
* [pathology images](https://doi.org/10.5281/zenodo.12689650) converted into DICOM Slide Microscopy representation

In addition, IDC NLST collection includes analysis results contributed by various initiatives:
* [TotalSegmentator-CT-Segmentations](https://doi.org/10.5281/zenodo.8347011): segmentations of the anatomical regions and organs for most of the NLST CT images of using [TotalSegmentator](https://github.com/wasserth/TotalSegmentator) + first order and shape radiomics features for each segment
* [BAMF-AIMI-Annotations](https://doi.org/10.5281/zenodo.8345959): AI-based volumetric segmentations for a small subset of the CT images
* [nnU-Net-BPR-annotations](https://doi.org/10.5281/zenodo.7473970): AI-based annotation of the body part and volumetric segmentations of chest organs using [nnUNet](https://github.com/MIC-DKFZ/nnUNet) package

A user of IDC asked the following question:


> _Our lab is interested in downloading all the NLST dataset. We are interested in mirroring the data onto our cluster. However, there is some confusion about how much space the data takes. On the cancerimagearchive page for the NLST, it states that the DICOM takes 11.14TB, while on the IDC platform, it shows that the NLST takes 23.95TB. Would you be able to clarify which is more accurate?_

_**TL;DR** IDC NLST collection contains a lot more than the CT images available in TCIA, which explains the much larger size. You can still download just the CT images, if that's what you need!_

In this notebook you will find a more involved answer to this question, and explain organization of data in IDC on this example.

If you have any questions about this notebook or the NLST collection on IDC, please post your questions on [IDC user forum](https://discourse.canceridc.dev/).

---

Initial release: Sept 2024

Prepared by: Andrey Fedorov

## Prerequisites

This notebook has minimal prerequisites. You will need to have python and pip on your sustem, and install [`idc-index`](https://github.com/imagingdatacommos/idc-index) package that simplifies interaction with IDC

In [1]:
%%capture
!pip install --upgrade idc-index

To interact with IDC we need to instantiate `IDCClient`, which provides the interface to the functinality of the package.

In [2]:
from idc_index import IDCClient

client = IDCClient()

# Navigating IDC metadata

`IDCClient` contains a table - a pandas dataframe - with the key metadata attributes describing the data available in IDC. This table can be accessed in the `index` variable of the client. It has one row per DICOM series, and the following columns, which are documented in [this page](https://idc-index.readthedocs.io/en/latest/#the-index-of-idc-index).

In [6]:
print(f'Attributes in IDC index: \n{client.index.dtypes}\n')

Attributes in IDC index: 
collection_id             object
analysis_result_id        object
PatientID                 object
SeriesInstanceUID         object
StudyInstanceUID          object
source_DOI                object
PatientAge                object
PatientSex                object
StudyDate                 object
StudyDescription          object
BodyPartExamined          object
Modality                  object
Manufacturer              object
ManufacturerModelName     object
SeriesDate                object
SeriesDescription         object
SeriesNumber              object
instanceCount              Int64
license_short_name        object
series_aws_url            object
series_size_MB           float64
dtype: object



## IDC "collections" vs "analysis results"

IDC content is not static. Our goal is to continuously enrich the images available in IDC with annotations, analysis results, extracted features etc to make it more usable.

Every DICOM series in IDC is accompanied by the following 3 attributes (among many others!):
* `collection_id`: this attribute is always non-empty, and indicates that the series belongs to a given collection
* `analysis_result_id`: this attribute is assigned when the series was produced by post-processing, annotating or otherwise analyzing the content of the original collection.
* `source_DOI`: Digital Object Identifier (DOI) of the record that describes the dataset that contributed a given series. You should use this DOI to cite the dataset you use, and to learn more about how it was generated and by whom.

On the example of NLST, the "original" content of this collection included CT images in DICOM format, and slide microscopy images in Aperio SVS format. Subsequently, the CT images were segmented using different tools, producing analysis results that became part of the collection, but which can be distinguished from the original content using the `analysis_result_id`.

To demonstrate how this works for the NLST collection, let's see the analysis results collections that are included in it, and the DOIs describing those results.

In [36]:
client.index[client.index['collection_id']=='nlst'][['analysis_result_id','source_DOI']].drop_duplicates()

Unnamed: 0,analysis_result_id,source_DOI
101,,10.5281/zenodo.12689650
173065,nnU-Net-BPR-annotations,10.5281/zenodo.7473970
180879,BAMF-AIMI-Annotations,10.5281/zenodo.8345959
189079,TotalSegmentator-CT-Segmentations,10.5281/zenodo.8347011
622754,,10.7937/tcia.hmq8-j677


As an example, the DOI for the `TotalSegmentator-CT-Segmentations` is `10.5281/zenodo.8347011`, and the corresponding URL https://doi.org/10.5281/zenodo.8347011 will bring you to the data descriptor with the details of this collection.

## Breaking down the collection content

We are getting close to answer the question from the user!

Given the organization of data in collections and analysis results, in the following cell we create a summary of all of the data in the NLST collection, grouping series by `analysis_result_id`, `source_DOI` and `Modality` (to easier understand the content).

This cell uses SQL interface for selection. If SQL is unfamiliar to you, you can also work with the dataframe using pandas interface - as we show in a bit!

In [39]:
query = """
SELECT
  collection_id,
  analysis_result_id,
  source_DOI,
  Modality,
  SUM(series_size_MB)/1E6 AS total_size_TB,
  COUNT(DISTINCT SeriesInstanceUID) AS series_count
FROM
  index
WHERE
  collection_id = 'nlst'
GROUP BY
  collection_id,
  analysis_result_id,
  source_DOI,
  Modality
ORDER BY
  analysis_result_id,
  Modality
"""

result_df = client.sql_query(query)

result_df

Unnamed: 0,collection_id,analysis_result_id,source_DOI,Modality,total_size_TB,series_count
0,nlst,BAMF-AIMI-Annotations,10.5281/zenodo.8345959,SEG,0.008014,1144
1,nlst,TotalSegmentator-CT-Segmentations,10.5281/zenodo.8347011,SEG,14.187099,126051
2,nlst,TotalSegmentator-CT-Segmentations,10.5281/zenodo.8347011,SR,0.141128,252102
3,nlst,nnU-Net-BPR-annotations,10.5281/zenodo.7473970,SEG,0.012109,1039
4,nlst,nnU-Net-BPR-annotations,10.5281/zenodo.7473970,SR,0.000368,3117
5,nlst,,10.7937/tcia.hmq8-j677,CT,11.137335,203087
6,nlst,,10.5281/zenodo.12689650,SM,0.846308,1232


As you can see in the above, there is only one row in the summary table that contains CT modality, and it is accompanied by the DOI that points back to the TCIA: `10.7937/tcia.hmq8-j677`, and the total size of the CT component is ~11.14TB.

In the following cell, the same query is accomplished using the pandas interface.

In [40]:
result = (
    client.index[client.index['collection_id'] == 'nlst']
    .groupby(['collection_id', 'analysis_result_id', 'source_DOI','Modality'], dropna=False)
    .agg({
        'series_size_MB': lambda x: x.sum() / 1e6,
        'SeriesInstanceUID': 'nunique'
    })
    .reset_index()
    .rename(columns={
        'series_size_MB': 'total_size_TB',
        'SeriesInstanceUID': 'series_count'
    })
    .sort_values(['analysis_result_id', 'Modality'], na_position='first')
)

result

Unnamed: 0,collection_id,analysis_result_id,source_DOI,Modality,total_size_TB,series_count
6,nlst,,10.7937/tcia.hmq8-j677,CT,11.137335,203087
5,nlst,,10.5281/zenodo.12689650,SM,0.846308,1232
0,nlst,BAMF-AIMI-Annotations,10.5281/zenodo.8345959,SEG,0.008014,1144
1,nlst,TotalSegmentator-CT-Segmentations,10.5281/zenodo.8347011,SEG,14.187099,126051
2,nlst,TotalSegmentator-CT-Segmentations,10.5281/zenodo.8347011,SR,0.141128,252102
3,nlst,nnU-Net-BPR-annotations,10.5281/zenodo.7473970,SEG,0.012109,1039
4,nlst,nnU-Net-BPR-annotations,10.5281/zenodo.7473970,SR,0.000368,3117


## Downloading specific components of the collection

Now that we explained how to identify subsets of the collection, download of a given subset can be accomplished given the list of `SeriesInstanceUID`s for the series of interest. In the next two cells, this same process is demonstrated using SQL and pandas interface.

For your convenience, the download function will not proceed if you do not have sufficient disk space in the destination directory, and will report progress of the download. Further, the downloaded files will be sorted into collection/patient/study/series folder hiearchy. Learn more about the function from its documentation `print(client.download_from_selection.__doc__)`.

In [28]:
query = """
SELECT
  SeriesInstanceUID
FROM
  index
WHERE
  collection_id = 'nlst' AND
  Modality = 'CT'
"""

selection_df = client.sql_query(query)

client.download_from_selection(
    seriesInstanceUID=list(selection_df["SeriesInstanceUID"].values[:10]),
    downloadDir=".",
)

ERROR:idc_index.index:Not enough free space on disk to download the files.
ERROR:idc_index.index:Total size of files to download: 11.14 TB
ERROR:idc_index.index:Total free space on disk: 204.55 GB


In [32]:
nlst_ct_series = client.index[(client.index['collection_id'] == 'nlst') & (client.index['Modality'] == 'CT')]['SeriesInstanceUID']

client.download_from_selection(
    seriesInstanceUID=list(selection_df["SeriesInstanceUID"]),
    downloadDir=".",
)

ERROR:idc_index.index:Not enough free space on disk to download the files.
ERROR:idc_index.index:Total size of files to download: 11.14 TB
ERROR:idc_index.index:Total free space on disk: 204.55 GB


## Summary

We hope you enjoyed this tutorial!  If something didn't work as expected, if you have any feedback or suggestions of what should be added to this tutorial, please contact IDC support by sending email to support@canceridc.dev or posting your question on [IDC User forum](https://discourse.canceridc.dev).

You will find more IDC tutorials in the [IDC GitHub IDC-Tutorials repository](https://github.com/ImagingDataCommons/IDC-Tutorials).

## Acknowledgments

Imaging Data Commons has been funded in whole or in part with Federal funds from the National Cancer Institute, National Institutes of Health, under Task Order No. HHSN26110071 under Contract No. HHSN261201500003l.

If you use IDC in your research, please cite the following publication:

> Fedorov, A., Longabaugh, W. J. R., Pot, D., Clunie, D. A., Pieper, S. D., Gibbs, D. L., Bridge, C., Herrmann, M. D., Homeyer, A., Lewis, R., Aerts, H. J. W., Krishnaswamy, D., Thiriveedhi, V. K., Ciausu, C., Schacherer, D. P., Bontempi, D., Pihl, T., Wagner, U., Farahani, K., Kim, E. & Kikinis, R. _National Cancer Institute Imaging Data Commons: Toward Transparency, Reproducibility, and Scalability in Imaging Artificial Intelligence_. RadioGraphics (2023). https://doi.org/10.1148/rg.230180