This section describes the organization of IDC data from IDC Version 2. IDC Version 1 data organization is described in the Organization of data in V1 section.
IDC storage and management of DICOM data relies on the Google Cloud Platform. We maintain three representations of the data, which are fully synchronized and correspond to the same dataset, but are intended to serve different use cases.
{% hint style="warning" %} In order to access the resources listed below, it is assumed you have completed the "getting started" steps to access the Google Cloud console! {% endhint %}
All of the resources listed below are accessible under the canceridc-data
GCP project.
{% hint style="info" %} Storage Buckets are basic containers in Google Cloud that provide storage for data objects (you can read more about the relevant terms in the Google Cloud Storage documentation here). {% endhint %}
All IDC DICOM file data for all IDC data versions are maintained in Google Cloud Storage (GCS), from which it is available to the user on a Requester Pays basis. Currently all DICOM files are in the idc-open
bucket.
The object namespace is flat, where every object name is composed of a standard format UUIDs and with the ".dcm" file extension, e.g. 905c82fd-b1b7-4610-8808-b0c8466b4dee.dcm
. For example, that instance can be accessed using gsutil as gs://idc-open/905c82fd-b1b7-4610-8808-b0c8466b4dee.dcm
You can read about accessing GCP storage buckets from a Compute VM here. Because the idc-open
bucket has a Requester Pays policy, you will need to provide a Project ID for a project for which billing has been configured in order to be able to download data from that bucket.
{% hint style="warning" %} Make sure you understand the data egress charges! As a general rule of thumb, downloading of the data to a GCP compute VM within the same GCS location as the bucket free, while downloading to your laptop, or another Google VM in a different GCS location, or a VM that belongs to a different cloud provider is expensive!
As an example, if you were to download to your laptop ALL of the DICOM data included in the V2 release of IDC, which is about 6 TB, you would need to pay a total of around $120) in egress charges. {% endhint %}
Assuming you have a list of GCS URLs in gcs_paths.txt
, you can download the corresponding items using the command below, substituting $PROJECT_ID
with the valid GCP Project ID (see the complete example in this notebook):
$ cat gcs_paths.txt | gsutil -u $PROJECT_ID -m cp -I .
{% hint style="info" %} Google BigQuery (BQ) is a massively-parallel analytics engine ideal for working with tabular data. Data stored in BQ can be accessed using standard SQL queries. {% endhint %}
The flat address space of IDC DICOM objects in GCS storage is accompanied by BigQuery tables that allow the researcher to reconstruct the DICOM hierarchy as it exists for any given version.
There are several important BQ tables and views in which we keep copies of the metadata exposed via the TCIA interface at the time this version was captured and other pertinent information.
There is an instance of each of the following tables and views per IDC version. The set of tables and views corresponding to an IDC version are collected in a single BQ dataset per IDC version, idc_<idc_version_number>
. E.G. the BQ tables for IDC version 2 are in the canceridc-data.idc_v2
dataset.
Several Google BigQuery (BQ) tables support searches against metadata extracted from the data files. Additional BQ tables define the composition of each IDC data version.
We maintain several additional tables that curate metadata non-DICOM metadata (e.g., attribution of a given item to a specific collection and DOI, collection-level metadata, etc).
canceridc-data.idc_v<idc_version_number>.auxiliary_metadata:
This table defines the contents of the corresponding IDC version. There is a row for each instance in the version. Version attributes:-
idc_version_number:
The IDC version number. -
version_hash
: the md5 hash of the sortedcollection_hashes
of all collections in this versionCollection attributes:
-
tcia_api_collection_id:
The ID, as accepted by the TCIA API, of the original data collection containing this instance. -
idc_webapp_collection_id:
The ID, as accepted by the IDC web app, of the original data collection containing this instance. -
source_doi:
A DOI of the TCIA wiki page corresponding to the original data collection or analysis results that is the source of this instance. -
collection_hash
: The md5 hash of the sortedpatient_hashes
of all patients in the collection containing this instance.Patient attributes:
-
submitter_case_id:
The submitter’s (of data to TCIA) ID of the patient containing this instance. This is the DICOM PatientID. -
idc_case_id:
IDC generated UUID that uniquely identifies the patient containing this instance.This is needed because DICOM PatientIDs are not required to be globally unique.
-
patient_hash
: the md5 hash of the sortedstudy_hashes
of all studies in the patient containing this instance.Study attributes:
-
StudyInstanceUID:
DICOM UID of the study containing this instance. -
study_uuid:
IDC assigned UUID that identifies a version of the study containing this instance. -
study_hash
: the md5 hash of the sortedseries_hashes
of all series in study containing this instance.Series attributes:
-
SeriesInstanceUID:
DICOM UID of the series containing this instance. -
series_uuid:
IDC assigned UUID that identifies a version of the series containing this instance. -
series_hash
: the md5 hash of the sortedinstance_hashes
of all instance in the series containing this instance.Instance attributes:
-
SOPInstanceUID:
DICOM UID of this instance. -
instance_uuid:
IDC assigned UUID that identifies a version of this instance. -
gcs_url:
The GCS URL of a file containing the version of this instance that is identified by theinstance_uuid
-
instance_hash
: the md5 hash of the version of this instance that is identified by theinstance_uuid
-
instance_size:
the size, in bytes, of this version of the instance that is identified by theinstance_uuid
-
canceridc-data.idc_v<idc_version_number>.dicom_metadata
: DICOM metadata for each instance in the corresponding IDC version. IDC utilizes the standard capabilities of the Google Healthcare API to extract all of the DICOM metadata from the hosted collections into a single BQ table. Conventions of how DICOM attributes of various types are converted into BQ form are covered in the Understanding the BigQuery DICOM schema Google Healthcare API documentation article. IDC users can access this table to conduct detailed exploration of the metadata content, and build cohorts using fine-grained controls not accessible from the IDC portal. The schema is too large to document here. Refer to the BQ table and the above referenced documentation.
{% hint style="warning" %} Due to the existing limitations of Google Healthcare API, not all of the DICOM attributes are extracted and are available in BigQuery tables. Specifically:
-
sequences that have more than 15 levels of nesting are not extracted (see https://cloud.google.com/bigquery/docs/nested-repeated) - we believe this limitation does not affect the data stored in IDC
-
sequences that contain around 1MiB of data are dropped from BigQuery export and RetrieveMetadata output currently. 1MiB is not an exact limit, but it can be used as a rough estimate of whether or not the API will drop the tag (this limitation was not documented as of writing this) - we know that some of the instances in IDC will be affected by this limitation. The fix for this limitation is targeted for sometime in 2021, according to the communication with Google Healthcare support. {% endhint %}
-
canceridc-data.idc_v<idc_version_number>.original_collections_metadata
: Collection-level metadata for the original TCIA data collections hosted by IDC, for the most part corresponding to the content available in this table at TCIA. One row per collection:tcia_api_collection_id:
The collection ID as is accepted by the TCIA APtcia_wiki_collection_id:
The collection ID as on the TCIA wiki pageidc_webapp_collection_id:
The collection ID as accepted by the IDC web appStatus:
Collection status" Ongoing or completeAccess:
Collection access conditions: Limited or PublicImageType:
Enumeration of image types/modalities in the collectionSubjects:
Number of subjects in the collectionDOI:
DOI that can be resolved at doi.org to the TCIA wiki page for this collectionCancerType:
TCIA assigned cancer type of this collectionSupportingData:
Type(s) of additional data availableLocation:
Body location that was studiedDescription:
TCIA description of the collection (HTML format)
-
canceridc-data.idc.idc_v<idc_version_number>.analysis_results_metadata
: Metadata for the TCIA analysis results hosted by IDC, for the most part corresponding to the content available in this table at TCIA. One row per analysis result:Collection:
Descriptive nameDOI:
DOI that can be resolved at doi.org to the TCIA wiki page for this analysis resultCancerType:
TCIA assigned cancer type of this analysis resultLocation:
Body location that was studiedSubjects:
Number of subjects in the analysis resultCollections:
Original collections studiedAnalysisArtifactsonTCIA:
Type(s) of analysis artifacts generatedUpdated:
Data when results were last updated
In addition to the tables above, we provide the following BigQuery views (virtual tables defined by queries) that extract specific subsets of metadata, or combine attributes across different tables, for convenience of the users
canceridc-data.idc_v<idc_version_number>.dicom_all
: DICOM metadata together with selected auxiliary and collection metadatacanceridc-data.idc_v<idc_version_number>.segmentations
: Attributes of the segments stored in DICOM Segmentation objectcanceridc-data.idc_v<idc_version_number>.measurement_groups
: Measurement group sequences extracted from the DICOM SR TID1500 objectscanceridc-data.idc_v<idc_version_number>.qualitative_measurements
: Coded evaluation results extracted from the DICOM SR TID1500 objectscanceridc-data.idc_v<idc_version_number>.quantitative_measurements
: Quantitative evaluation results extracted from the DICOM SR TID1500 objects
IDC utilizes a single Google Healthcare DICOM store to host all of the instances in the current IDC version. That store, however, is primarily intended to support visualization of the data using OHIF Viewer. At this time, we do not support access of the hosted data via DICOMWeb interface by the IDC users. See more details in the discussion here, and please comment about your use case if you have a need to access data via the DICOMweb interface.
In addition to the DICOM data, some of the image-related data hosted by IDC is stored in additional tables. These include the following:
- BigQuery TCGA clinical data:
isb-cgc:TCGA_bioclin_v0.clinical_v1
. Note that this table is hosted under the ISB-CGC Google project, as documented here, and its location may change in the future!