Skip to content

claude_report_idc_vs_midrc.md

Andrey Fedorov edited this page Feb 12, 2026 · 1 revision

Comparative Analysis: IDC vs MIDRC

Generated on 2026-02-12 using Claude Code (Claude Opus 4.6). The conversation was prompted by Andrey Fedorov (Brigham and Women's Hospital / Harvard Medical School), who is a co-PI of IDC and is also supported in part by the MIDRC project. His dual involvement means the prompts likely reflected more domain knowledge about both platforms than an average user would have, but also that the framing may have been influenced by deeper familiarity with IDC's architecture and capabilities. IDC statistics were verified programmatically using the imaging-data-commons Claude skill, which provides direct access to idc-index (v0.11.9, IDC data v23). MIDRC statistics were verified via live Guppy GraphQL API queries against both the MIDRC portal (data.midrc.org/guppy/graphql) and the BDF Imaging Hub (imaging-hub.data-commons.org/guppy/graphql) — both publicly accessible without authentication. Additional MIDRC data size verified via Indexd API (data.midrc.org/index/_stats).


Executive Summary

The NCI Imaging Data Commons (IDC) and the NIBIB/ARPA-H Medical Imaging and Data Resource Center (MIDRC) are two major federally funded open imaging data platforms in the United States. Both launched in 2020, both use DICOM as their primary data standard, and both are built on open-source infrastructure — but they serve different communities, solve different problems, and make different architectural choices.

IDC is a cancer-focused repository hosting 79,569 patients, 994,073 series, and 95.33 TB across 161 collections spanning 26 imaging modalities (CT, MRI, digital pathology, mammography, PET, and more). All data is harmonized to DICOM and served from public cloud storage (AWS + GCS) with zero authentication required — for both metadata queries and file downloads. IDC's strength is its depth of derived data: 188,013 segmentation series, 6.17 billion pathology annotations, and radiomics features for nearly 10 million structures, all queryable via SQL. IDC operates as a node within the NCI Cancer Research Data Commons (CRDC), enabling cross-commons queries spanning imaging, genomics, and proteomics.

MIDRC is a COVID-19/respiratory disease platform that has expanded to chronic diseases, hosting 84,768 cases, 202,222 studies, and 12.36 TB across 78 dataset batches from 27 U.S. hospitals. While narrower in modality (primarily chest X-ray and CT), MIDRC offers unique capabilities: 203,374 clinical measurements, 12,124 radiology reports, and 59,492 annotations as dedicated queryable data types — plus a 20% sequestered dataset for unbiased AI evaluation. MIDRC's defining strengths are its bias/fairness tooling (MELODY, REACT, AI Reliability Tool), its privacy-preserving record linkage with EHR data (N3C), and the BDF Imaging Hub — a federated search spanning 7 imaging repositories including IDC itself.

Licensing differs significantly. IDC assigns licenses per DICOM object (not per collection), with 97%+ of series under CC-BY (4.0 or 3.0), permitting commercial use with attribution. MIDRC uses a custom Data Use Agreement (DUA) covering 94% of its series, with commercial use requiring a separate agreement through the University of Chicago. IDC also makes 100% of its data publicly accessible, while MIDRC sequesters ~20% for unbiased algorithm evaluation.

Key finding for programmatic users: Both platforms offer public metadata APIs requiring no authentication. IDC provides idc-index (local SQL over Parquet, millisecond latency). MIDRC provides Guppy GraphQL at data.midrc.org/guppy/graphql (remote Elasticsearch, seconds latency, 6 data types). MIDRC's Guppy API being publicly accessible was not well-documented and was discovered during this analysis through API testing — it substantially narrows the programmatic access gap between the platforms for metadata exploration.

The platforms are complementary, not competing. Use IDC for cancer imaging, large-scale cloud analytics, digital pathology, and reproducible versioned pipelines. Use MIDRC for COVID-19/respiratory imaging, AI fairness evaluation, sequestered benchmarking, multi-modal clinical integration (imaging + EHR + genomics via cross-commons linkage), and federated cross-repository search. MIDRC's BDF Imaging Hub federates IDC, making both accessible through a single search interface.


Overview

Dimension IDC (Imaging Data Commons) MIDRC (Medical Imaging & Data Resource Center)
Funder NCI (Cancer Moonshot) NIBIB + ARPA-H
Host / Operator Brigham & Women's / Harvard (Fedorov & Kikinis) University of Chicago (Giger); co-led by ACR, RSNA, AAPM
Ecosystem Node within NCI Cancer Research Data Commons (CRDC) Independent commons; ARPA-H Biomedical Data Fabric performer; NAIRR pilot
Launch 2020 2020 (COVID-19 response)

1. Mission & Disease Focus

IDC MIDRC
Core mission Cloud-based repository for publicly available cancer imaging data, co-located with analysis tools AI-ready data commons for machine learning innovation, initially COVID-19, now expanding to chronic diseases
Disease scope Cancer only (all organ sites) COVID-19 + expanding: cancer, diabetes, chronic liver disease, coronary artery disease, COPD, emphysema
Strategic emphasis Transparency, reproducibility, scalability of imaging AI Bias mitigation, fairness, real-world AI testing, interoperability

2. Data Scale & Modalities

IDC numbers verified live via idc-index v0.11.9 (IDC data v23, Feb 2026). MIDRC numbers verified via MIDRC portal Guppy GraphQL API + Indexd API (Feb 2026). BIH numbers used where portal data unavailable.

IDC (v23) MIDRC (portal-verified)
Collections / Datasets 161 curated collections 78 dataset batches (RSNA + ACR submissions; largest: RSNA_20230830 with 21,465 cases)
Patients / Cases 79,569 patients 84,768 cases (portal); 76,193 subjects (BIH — discrepancy likely due to BIH not indexing all data types)
Studies 160,199 studies 202,222 imaging studies (portal); 189,854 (BIH)
Series 994,073 series 469,324 series (BIH — portal does not expose series-level counts)
Instances (files) 46,885,909 DICOM instances 637,431 files (Indexd; includes 492K DCM, 122K JSON, 12K TXT, 2.5K NIfTI)
Data volume 95.33 TB 12.36 TB (Indexd: data.midrc.org/index/_stats)
Imaging modalities 26 modalities: CT (35K patients), SM/pathology (20K), MG/mammography (17K), MR (8K), PET (1.1K), CR, DX, US, and 18 others Portal study_modality: DX (94.6K studies), CR (79.6K), CT (25.6K), MR (2.5K). BIH series: CT (170K), CR (127K), DX (107K), SEG (58K), MR (5.9K), SR (1K), + 6 others
Anatomical breadth 20+ body parts: chest (30.7K), breast (20.9K), abdomen (1.8K), lung (1.4K), prostate (1K), colon (850), pelvis, liver, brain, kidney, etc. Predominantly thoracic: CHEST (221K series), PORT CHEST (60K), ABDOMEN (12K), HEAD (2.7K), HEART (1.4K); ~60% chest, ~34% unlabeled
Image-derived data 188,013 segmentation series covering 154,233 source series; 23 analysis result collections; 7,108 annotation series 59,492 annotations (portal): SIFT auto-annotations (57,419), mRALE Mastermind Challenge expert annotations (2,072). 58K SEG + 1K SR series (BIH)
Additional data types 203,374 measurements, 12,124 radiology reports (portal — unique MIDRC data types not found in IDC)
Clinical data tables 97 collections with clinical data across 224 tables (TCGA, CPTAC, HTAN, NLST, ACRIN, RICORD, etc.) Demographics (portal: Female 39.9K, Male 38.5K; White 41.5K, Black/AA 22.5K, Asian 3.6K; Not Hispanic 62.7K, Hispanic 8K), conditions, medications, vitals, labs, COVID status (positive: 33.8K, negative: 43.5K)
Data sources TCGA, TCIA, CPTAC, CCDI, HTAN, LIDC, QIN, NLST, VHP 27 U.S. subsites (academic + community hospitals); 78 RSNA + ACR dataset batches

3. Data Access & Licensing

IDC MIDRC
Registration required No - fully open, anonymous access Partial - metadata queries via Guppy GraphQL work without authentication (see Appendix D); data download requires free registration
Authentication None for data download; GCP auth only for BigQuery None for metadata (Guppy GraphQL at data.midrc.org/guppy/graphql is publicly accessible); Gen3 auth (OpenID Connect / OAuth 2.0) required for file downloads
Licensing Per-object (not per-collection): each DICOM series tagged with its own license. Analysis results contributed to a collection may differ from original images. Overall: 97%+ CC-BY (CC BY 4.0: 817K series; CC BY 3.0: 146K series); 3% CC-BY-NC (30K series) MIDRC DUA (441K series, 94%); CC BY 4.0 (26K series, 5.5%); CC BY-NC 4.0 (2.4K series, 0.5%). Commercial use requires separate DUA agreement via UChicago
Egress fees Zero - free from both AWS and GCS Free download from Gen3 portal
Sequestered data None - 100% public 20% sequestered for unbiased algorithm evaluation / regulatory approval

4. Technical Architecture

IDC MIDRC
Platform Google Cloud Platform + AWS (dual-cloud) Gen3 Data Ecosystem (open-source, UChicago data center)
Data format All data harmonized to DICOM DICOM + multiple delivery formats (DICOM SR, SEG, JSON, NIfTI, CSV, XLSX)
Storage 3 public cloud buckets (GCS + S3 mirrors) Gen3 object storage
Metadata query BigQuery (4000+ DICOM tags, SQL); idc-index Python package (~50 columns, no auth); Parquet on S3 (DuckDB) Guppy GraphQL API at data.midrc.org/guppy/graphql (no auth, 6 data types: case, imaging_study, data_file, measurement, annotation, radiology_report); Gen3 portal explorer; GA4GH DRS API; graph database for clinical/phenotype data
Image retrieval DICOMweb (IDC proxy, no auth); S3/GCS direct download; Google Healthcare API Gen3 download client; TCIA integration
Programmatic access idc-index Python package; BigQuery SQL; DICOMweb REST Guppy GraphQL (no auth for metadata); Gen3 SDK; GA4GH DRS; Indexd API (/index/_stats); Gen3 mesh services
Visualization OHIF v3 (radiology), SliM (pathology), VolView (3D), 3D Slicer extension Gen3 portal; limited built-in visualization
Compute integration Google Colab, Vertex AI, NIH Cloud Lab, ACCESS HPC Gen3 workspaces
Data versioning Versioned releases (v1-v23+), CRDC UUIDs persist across versions Incremental data additions; no formal versioning scheme documented
Federated search No (centralized) Yes - MIDRC BDF Imaging Hub federates 7 repositories: IDC, MIDRC, TCIA, ACRDart, Stanford AIMI, NIHCC, OpenNeuro (327,895 subjects, 1.96M series total)

Interoperability: Standards and Connected Systems

Both platforms invest in interoperability, but with different philosophies: IDC prioritizes DICOM standards + CRDC ecosystem integration, while MIDRC prioritizes federated access + cross-commons record linkage.

Standards Comparison

Standard IDC MIDRC
DICOM Core philosophy — all data harmonized to DICOM Native storage format
DICOMweb (WADO-RS, QIDO-RS, STOW-RS) Full support via Google Healthcare API + IDC proxy (no auth) WADO-RS support
GA4GH DRS Yes — CRDC UUIDs resolve as DRS IDs (dg.4DFC/<uuid>) via CRDC centralized service Yes — GUIDs resolve to file locations across Gen3 repositories
GA4GH Passports Via CRDC/DCF infrastructure Core auth mechanism for cross-commons access
OpenID Connect / OAuth 2.0 GCP auth only (for BigQuery); data access is unauthenticated Required for all access
FHIR (HL7) Not directly (NCPI alignment in progress) Supported for clinical data interoperability
PPRL (Privacy-Preserving Record Linkage) Not implemented Gen3 Crosswalk Service for secure patient matching across systems
Cloud-native BigQuery SQL; Parquet on S3; dual-cloud (AWS+GCS) mirroring Gen3 object storage; single cloud
CRDC-H (Harmonized Data Model) Yes — enables cross-commons queries via Cancer Data Aggregator No (not a CRDC node)

IDC: Connected Systems and Mechanisms

IDC connects through the NCI CRDC ecosystem and NCPI program:

System Linkage Mechanism What It Provides
GDC (Genomic Data Commons) CRDC shared identifiers; Cancer Data Aggregator Genomic sequencing and variant data
PDC (Proteomic Data Commons) CRDC shared identifiers; Cancer Data Aggregator Proteomic analysis data
ICDC (Canine Data Commons) CRDC shared identifiers Canine clinical trial data
CTDC (Clinical & Translational Data Commons) CRDC shared identifiers Clinical trial data
TCIA Data ingestion — IDC mirrors all public TCIA collections automatically Archival repository; IDC adds cloud-native access
NCPI platforms (AnVIL, BioData Catalyst, dbGaP, Kids First) GA4GH DRS; shared auth standards Cross-NIH platform interoperability
PACS systems DICOMweb (QIDO-RS, WADO-RS) Direct query from clinical imaging systems

Cancer Data Aggregator (CDA): Unified search across all 6 CRDC nodes — users can build "virtual cohorts" spanning imaging (IDC), genomics (GDC), proteomics (PDC), and clinical data using disease name, anatomical location, demographics, or data type. Uses the CRDC-H harmonized data model for cross-node element mapping.

CRDC Data Commons Framework (DCF): Gen3-based infrastructure minting persistent GUIDs for all 52 million FAIR objects (4.9 PB) across CRDC. DRS IDs resolve to access methods for both GCS and AWS, decoupling logical identifiers from physical storage.

Dual-cloud mirroring: IDC maintains identical copies on AWS S3 and GCS (migrated 40M DICOM objects / 63 TB via AWS DataSync in <41 hours). Parquet metadata exports in idc-open-metadata S3 bucket enable tool-agnostic access via DuckDB, Spark, etc.

MIDRC: Connected Systems and Mechanisms

MIDRC connects through the Gen3 mesh and ARPA-H BDF Imaging Hub:

System Linkage Mechanism What It Provides Published Use Case
N3C (National COVID Cohort Collaborative) PPRL via honest broker EHR data (diagnoses, labs, vitals, meds, procedures) COVID-19 severity prediction from chest X-rays + EHR (Bergquist et al., J Imaging Informatics Med 2025)
BioData Catalyst (BDC) Gen3 mesh / Crosswalk Genomic + clinical data from NHLBI studies Multimodal cohort construction (Chen/Whitney et al., Scientific Data 2025)
IDC BDF Imaging Hub federated search Cancer imaging (95 TB, 161 collections) Cross-repository discovery via BIH
TCIA Direct collection hosting + BIH MIDRC-RICORD-1A/1B/1C collections served via TCIA RICORD datasets published through both
ACRDart BDF Imaging Hub affiliate Radiology imaging data (ACR registry) Federated search
Stanford AIMI BDF Imaging Hub affiliate Stanford medical imaging datasets Federated search
All of Us Gen3 mesh capability Large-scale population health data Supported (in development)

Gen3 "Narrow Middle" architecture: Standardized core services (OpenAPI/RESTful) sit between diverse data ingest/curation and analysis/processing applications. FAIR APIs auto-generated from data models. The Crosswalk Service passes privacy-preserving MIDRC IDs to match patients across repositories without exposing PHI.

MIDRC BDF Imaging Hub (ARPA-H funded):

  • Federated hub providing unified search across 7 repositories: MIDRC, IDC, TCIA, ACRDart, Stanford AIMI, NIHCC, and OpenNeuro
  • Does not centralize/duplicate imaging files — aggregates structured metadata and provides identifiers for researchers to access from original nodes
  • Features cohort builder over aggregated structured metadata
  • Part of ARPA-H Biomedical Data Fabric Toolbox (September 2023)
  • API: Guppy GraphQL endpoint at https://imaging-hub.data-commons.org/guppy/graphql — works without authentication for read-only queries
  • Live-verified scale (Feb 2026): 327,895 subjects, 722,695 studies, 1,963,865 series across all 7 repositories
Repository Subjects Series
MIDRC 76,193 469,324
IDC 69,223 946,957
Stanford AIMI 65,514 224,496
OpenNeuro 49,010
NIHCC 35,232 14,601
TCIA 28,434 201,337
ACRdart 4,289 107,150

See Appendix C for live-tested API query examples.

Key Architectural Difference

IDC MIDRC
Interop model Hub-in-ecosystem — centralized data with CRDC cross-commons query Federated mesh — distributed data with cross-repository search
Cross-commons search Cancer Data Aggregator queries all 6 CRDC nodes BDF Imaging Hub federates 7 imaging repositories (verified via live API)
Patient matching Shared CRDC identifiers (same patient across GDC/PDC/IDC) Privacy-preserving record linkage (PPRL) for patient matching across separate systems
Data movement Data ingested and harmonized centrally (DICOM) Data stays at source; identifiers routed through mesh
NIH ecosystem NCPI (AnVIL, BDC, dbGaP, Kids First) ARPA-H BDF + Gen3 mesh (N3C, BDC, All of Us)

5. AI / ML Support

IDC MIDRC
AI-generated annotations Yes - large-scale (e.g., NLST: 10M structures from 125K CT images with radiomics) Yes - 59,492 annotations (portal-verified): SIFT auto-annotations (57,419, Retrospective_auto), mRALE Mastermind Challenge expert annotations (2,072, Retrospective_expert)
Bias / fairness tools Not a primary focus Core strength: MIDRC-MELODY (subgroup performance), MIDRC-REACT (dataset representativeness), AI Reliability Tool (30 bias sources across 5 pipeline stages), MetricTree (metric selection)
Sequestered evaluation No Yes - 20% held-out data for unbiased benchmarking and regulatory use
Model hosting No 30+ algorithms on GitHub, HuggingFace, PhysioNet
Real-world testing No TDP 2: Real-time connection with healthcare facilities for dynamic algorithm testing
NLP tools No RadGraph (entity/relation extraction from radiology reports)

6. Clinical Data & Multi-modal Integration

IDC MIDRC
Clinical data Cancer staging, treatment history, outcomes (varies by collection); structured via eCRFs Portal-verified demographics: sex (Female 39.9K, Male 38.5K), race (White 41.5K, Black/AA 22.5K, Asian 3.6K), ethnicity (Not Hispanic 62.7K, Hispanic 8K), COVID status (negative 43.5K, positive 33.8K). Additional: conditions, medications, vitals, labs, procedures. 203,374 measurements and 12,124 radiology reports as dedicated queryable data types
Cross-commons linkage CRDC (GDC genomics, PDC proteomics, ICDC canine) via shared identifiers N3C (EHR), All of Us, BioData Catalyst (genomics); demonstrated COVID severity prediction use case
Clinical data standards DICOM metadata; BigQuery-queryable LOINC-harmonized; graph database

7. Data Standards & De-identification

IDC MIDRC
Primary standard DICOM (universal harmonization - all data converted to DICOM) DICOM (native storage) + multiple export formats
Identifiers CRDC UUIDs (instance, series, study) + standard DICOM UIDs Gen3 object IDs + DICOM UIDs
De-identification Two-stage process; NEMA/RSNA CTP profiles Stanford De-identifier (text); RSNA DICOM Anonymizer (images); DICOM Harmonization Tool
Metadata richness 4000+ DICOM tags queryable in BigQuery Curated subset via Gen3 data model

8. Governance & Community

IDC MIDRC
Governance model NCI-funded contract (Leidos + BWH); advisory boards NIBIB contract; tri-society leadership (ACR, RSNA, AAPM); 5 TDPs + 12 CRPs
Open-source idc-index, viewers, tutorials (GitHub) Gen3 platform (Apache License); all AI tools open-source with publications
User base Broad cancer research community >450 registered data users
Community engagement Tutorials, Colab notebooks, 3D Slicer extension 27 subsites; equity/diversity initiatives; NAIRR pilot

9. Programmatic Access Comparison

Setup & Authentication

IDC - No setup beyond pip install idc-index. No authentication for queries or downloads:

from idc_index import IDCClient
client = IDCClient()  # Ready immediately

MIDRC - Metadata queries require no authentication (Guppy GraphQL is public). Data downloads require registration + Gen3 SDK:

import requests

# NO AUTH needed for metadata queries via Guppy GraphQL
response = requests.post(
    "https://data.midrc.org/guppy/graphql",
    json={"query": '{ _aggregation { case { _totalCount } imaging_study { _totalCount } } }'}
)
print(response.json())  # Returns: case 84,768; imaging_study 202,222

# For data DOWNLOADS, Gen3 auth is required:
# pip install gen3
from gen3.auth import Gen3Auth
# credentials.json downloaded from https://data.midrc.org/identity (valid 30 days)
auth = Gen3Auth(endpoint="https://data.midrc.org", refresh_file="credentials.json")

Querying Metadata

IDC - SQL queries against a local Parquet index (~50 columns, no network call):

results = client.sql_query("""
    SELECT collection_id, PatientID, SeriesInstanceUID, Modality, series_size_MB
    FROM index
    WHERE Modality = 'CT' AND BodyPartExamined = 'CHEST'
    LIMIT 100
""")

MIDRC - GraphQL queries against Guppy (Elasticsearch-backed, no auth for metadata):

import requests

# Direct Guppy GraphQL — no authentication needed
response = requests.post(
    "https://data.midrc.org/guppy/graphql",
    json={"query": """{ imaging_study(first: 100, filter: {AND: [
        {IN: {body_part_examined: ["CHEST"]}},
        {IN: {study_modality: ["CT"]}}
    ]}) { case_ids study_uid study_modality body_part_examined } }"""}
)
studies = response.json()["data"]["imaging_study"]

# Or via Gen3 SDK (requires auth):
# response = query.query(data_type='imaging_study', first=100,
#     fields=["case_ids", "study_uid", "study_modality", "object_id"],
#     filters={"body_part_examined": "CHEST", "study_modality": "CT"})

Downloading Data

IDC - One-liner, downloads from public S3/GCS buckets (no auth):

client.download_from_selection(
    seriesInstanceUID=["1.3.6.1.4..."],
    downloadDir="./data"
)

MIDRC - Two-step: resolve DRS URI, then download via presigned URL:

# CLI approach
# gen3 --endpoint https://data.midrc.org --auth creds.json drs-pull object "dg.XXTS/<guid>"

# Or Python: resolve object_id → presigned URL → download
import requests
url = f"https://data.midrc.org/index/{object_id}"
file_info = requests.get(url, headers={"Authorization": f"Bearer {auth.token}"}).json()

Joining Imaging with Clinical Data

IDC - SQL JOINs across local index tables:

client.fetch_index("clinical_index")
clinical_df = client.get_clinical_table("nlst_prsn")  # loads locally
# Join with imaging data via PatientID or collection_id

MIDRC - Separate queries on Gen3 nodes, merged in pandas:

imaging = query.query(data_type='imaging_study', fields=["case_ids", "study_uid", "days_to_study"], ...)
measurements = query.query(data_type='measurement', fields=["case_ids", "test_days_from_index"], ...)
cohort = pd.merge(pd.DataFrame(imaging), pd.DataFrame(measurements), on='case_ids')

Browser Visualization

IDC - Generate viewer URL, opens OHIF/SliM automatically:

url = client.get_viewer_URL(seriesInstanceUID="1.3.6.1.4...")
webbrowser.open(url)

MIDRC - No equivalent programmatic viewer URL generation; visualization is portal-based.

Side-by-Side Summary

Scenario IDC MIDRC
Install pip install idc-index None for metadata (direct HTTP to Guppy); pip install gen3 + register for downloads
Auth None None for metadata queries; API key (30-day expiry) for downloads
Query language SQL (DuckDB on local Parquet) GraphQL (Guppy on remote Elasticsearch)
Query latency Milliseconds (local) Seconds (network)
Download method Direct S3/GCS public URLs DRS URI resolution → presigned URL (requires auth)
Batch download download_from_selection() with list of UIDs gen3 drs-pull manifest manifest.json ./output/
Clinical data join SQL JOIN on local tables Separate GraphQL queries on 6 data types, merge in pandas
Viewer integration get_viewer_URL() → OHIF/SliM Portal only
Offline capability Full (index is local Parquet) None (all queries require network)
Queryable data types 1 primary index + 9 specialized indices 6 data types: case, imaging_study, data_file, measurement, annotation, radiology_report
Tutorials IDC-Tutorials MIDRC tutorial_notebooks

10. Relationship Between the Two

IDC and MIDRC are complementary, not competing:

  • MIDRC's BDF Imaging Hub federates IDC - users can search IDC data through MIDRC's unified interface alongside TCIA, ACRDart, Stanford AIMI, NIHCC, and OpenNeuro (7 repositories total; see Appendix C)
  • Different funding agencies (NCI vs NIBIB) with different mandates
  • Different technical philosophies: IDC emphasizes DICOM harmonization + cloud-native analytics; MIDRC emphasizes interoperability + federated access + bias-aware AI
  • Minimal data overlap: IDC = cancer imaging from NCI programs; MIDRC = COVID-19/respiratory + multi-institutional clinical data
  • Shared standards: Both use DICOM, both integrate with TCIA

11. Summary: When to Use Which

Use case Recommended platform
Cancer imaging research (any organ) IDC
COVID-19 / respiratory disease imaging MIDRC
Large-scale cloud analytics (BigQuery, SQL) IDC
AI fairness / bias evaluation MIDRC
Unbiased model benchmarking (sequestered data) MIDRC
No-registration, immediate data access IDC
Multi-modal integration (imaging + EHR + genomics) MIDRC (via N3C, BioData Catalyst links)
Digital pathology IDC
Real-world clinical AI testing MIDRC (TDP 2)
Cross-repository federated search MIDRC (BDF Imaging Hub, which includes IDC)
Reproducible, versioned research pipelines IDC (formal versioning + CRDC UUIDs)
Commercial use of data IDC (97%+ of series CC-BY; check license_short_name per object)

Appendix A: Image-Derived Data Deep Dive

IDC Image-Derived Data (verified via idc-index)

IDC stores all derived data (segmentations, annotations, measurements) in standard DICOM formats alongside original images. This is a key architectural distinction — derived data is queryable and downloadable using the same tools as original images.

Derived Modality Summary

DICOM Modality Description Patients Series Instances Size (GB)
SR Structured Reports (radiomics, measurements) 29,065 270,687 270,687 143
SEG DICOM Segmentation objects 42,023 188,013 214,182 19,474
ANN Bulk Simple Annotations (pathology) 5,431 7,108 7,108 1,824
RTSTRUCT RT Structure Sets 1,747 4,938 4,938 10
M3D 3D model objects 842 2,328 2,328 0.3
PR, KO, RWV, REG Presentation states, key objects, etc. varies 1,833 1,932 ~0.1
Total derived 475,431 series 21,455 GB

23 Analysis Result Collections

Top collections by coverage:

Analysis Result Description Subjects Source Collections Modalities
TotalSegmentator v1.5.6 AI segmentation of 104 anatomical structures in CT 26,194 nlst SEG, SR
TCGA-SBU-TIL-Maps Tumor-infiltrating lymphocyte maps 7,600 23 TCGA collections SEG
Pan-Cancer-Nuclei-Seg Nuclei segmentation in H&E pathology 5,185 14 TCGA collections ANN, SEG
BAMF-AIMI-Annotations Multi-organ AI segmentations 4,226 22 collections SEG
nnU-Net-BPR-annotations Body part regression for CT 985 nlst, nsclc_radiomics SEG, SR
DICOM-LIDC-IDRI-Nodules Standardized lung nodule annotations 875 lidc_idri SEG, SR

Segmentation Detail (188,013 series)

Top segmented collections:

Collection Modality Source Series Segmented Seg Series
nlst CT 126,077 128,830
ispy2 MR 2,688 2,688
acrin_6698 MR 2,213 2,213
upenn_gbm MR 2,164 2,384
ispy1 MR 1,992 2,568
tcga_brca SM (pathology) 1,130 4,224

Top segmentation algorithms:

Algorithm Type Seg Series Source Series
TotalSegmentator v1.5.6 AUTOMATIC 126,051 126,051
Stony Brook TIL Inception-V4 2022 AUTOMATIC 15,868 7,934
Pan-Cancer-Nuclei-Seg AUTOMATIC 6,074 6,064
BAMF-Brain-MR AUTOMATIC 2,164 2,164
3d_fullres-tta_nnU-Net AUTOMATIC 1,453 1,453
BAMF-Prostate-MR AUTOMATIC 1,164 1,161
BAMF-Lung-CT-v2 AUTOMATIC 1,158 1,158

Segments per segmentation: Most TotalSegmentator SEGs contain 73-80 segments (anatomical structures per scan); single-structure segmentations (44,603 series) are common for tumor/lesion annotations.

Pathology Annotations (7,108 ANN series)

Graphic Type Groups Total Annotations
POLYGON 6,075 6.17 billion
RECTANGLE 9,452 111,181

All polygon annotations generated by Pan-Cancer-Nuclei-Seg algorithm — nuclei boundary polygons across 14 TCGA pathology collections.

Structured Reports (270,687 SR series)

Top SR collections (radiomics features, measurements):

Collection Patients SR Series
nlst 26,205 257,391
lidc_idri 875 6,859
nsclc_radiomics 414 2,419
lung_pet_ct_dx 354 1,091
ispy1 221 845

The NLST SRs contain radiomics features (~20 per segmented structure: volume, surface area, flatness, CT intensity statistics) for nearly 10 million structures from 125,000 CT images.

MIDRC Image-Derived Data

MIDRC's annotation approach differs fundamentally from IDC's:

Aspect IDC MIDRC
Annotation format DICOM SEG, SR, ANN, RTSTRUCT (standardized, machine-readable) DICOM + NIfTI, JSON, TXT, CSV, XLSX (multiple formats); portal reports 492K DCM, 122K JSON, 12K TXT, 2.5K NIfTI files
Scale 475,431 derived series, 6.17 billion nuclei annotations 59,492 annotations (portal-verified) + 58K SEG + 1K SR series (BIH) + 203,374 measurements + 12,124 radiology reports
Annotation type Volumetric segmentations, radiomics features, nuclei polygons, TIL maps SIFT auto-segmentations (57,419), mRALE severity scores (2,072 expert), diagnostic labels, COVID severity
Discovery SQL-queryable via seg_index, ann_index, ann_group_index Guppy GraphQL on annotation data type (no auth); queryable fields include annotation_method (Retrospective_auto, Retrospective_expert) and annotation_name (SIFT, midrc_mRALE_Mastermind_Challenge)
AI-generated at scale Yes (TotalSegmentator, BAMF-AIMI, Nuclei-Seg) Yes — SIFT algorithm produced 57,419 auto annotations; 30+ tools also published externally (GitHub/HuggingFace)
Algorithms catalogued Yes (AlgorithmName, AlgorithmType indexed) Partially — annotation_name and annotation_method are queryable via Guppy; not as granular as IDC's per-segment metadata
Additional data types 203,374 measurements and 12,124 radiology reports are distinct queryable data types in MIDRC (not available in IDC)

Programmatic Access: Finding and Downloading Annotations

IDC: Find segmentations by algorithm and type

The seg_index provides per-series attributes: AlgorithmName, AlgorithmType (AUTOMATIC, MANUAL, SEMIAUTOMATIC), SegmentationType (BINARY, FRACTIONAL), total_segments, and segmented_SeriesInstanceUID (link to source image).

from idc_index import IDCClient
client = IDCClient()
client.fetch_index("seg_index")

# Find TotalSegmentator automatic segmentations in NLST
nlst_segs = client.sql_query("""
    SELECT
        s.SeriesInstanceUID as seg_series,
        s.segmented_SeriesInstanceUID as source_series,
        s.AlgorithmName,
        s.AlgorithmType,
        s.SegmentationType,
        s.total_segments
    FROM seg_index s
    JOIN index i ON s.segmented_SeriesInstanceUID = i.SeriesInstanceUID
    WHERE i.collection_id = 'nlst'
      AND s.AlgorithmName = 'TotalSegmentator v1.5.6'
      AND s.AlgorithmType = 'AUTOMATIC'
    LIMIT 5
""")
# Each row has 77 segments (104 anatomical structures grouped into series)

# Download segmentation + its source CT together
source_uid = nlst_segs.iloc[0]['source_series']
seg_uid = nlst_segs.iloc[0]['seg_series']
client.download_from_selection(
    seriesInstanceUID=[source_uid, seg_uid],
    downloadDir="./nlst_with_seg"
)

IDC: Find pathology annotations by property type

The ann_group_index provides rich DICOM-coded attributes: AnnotationGroupLabel, AnnotationPropertyCategory_CodeMeaning, AnnotationPropertyType_CodeMeaning, GraphicType (POLYGON, RECTANGLE), NumberOfAnnotations, AlgorithmName, and AnnotationGroupGenerationType (AUTOMATIC, MANUAL).

client.fetch_index("ann_index")
client.fetch_index("ann_group_index")

# Find nuclei polygon annotations specifically (not all ANN are nuclei!)
nuclei_ann = client.sql_query("""
    SELECT
        g.SeriesInstanceUID as ann_series,
        a.referenced_SeriesInstanceUID as source_slide,
        g.AnnotationGroupLabel,
        g.AnnotationPropertyType_CodeMeaning,
        g.GraphicType,
        g.NumberOfAnnotations,
        g.AlgorithmName
    FROM ann_group_index g
    JOIN ann_index a ON g.SeriesInstanceUID = a.SeriesInstanceUID
    JOIN index i ON a.referenced_SeriesInstanceUID = i.SeriesInstanceUID
    WHERE g.AnnotationPropertyType_CodeMeaning = 'Nucleus'
      AND g.AnnotationGroupGenerationType = 'AUTOMATIC'
      AND i.collection_id = 'tcga_brca'
    LIMIT 5
""")

# Find manual blood cell annotations (bone marrow pathology)
blood_cells = client.sql_query("""
    SELECT
        g.AnnotationGroupLabel,
        g.AnnotationPropertyType_CodeMeaning,
        g.AnnotationGroupGenerationType,
        COUNT(DISTINCT g.SeriesInstanceUID) as ann_series,
        SUM(g.NumberOfAnnotations) as total
    FROM ann_group_index g
    WHERE g.AnnotationGroupGenerationType = 'MANUAL'
      AND g.AnnotationPropertyCategory_CodeMeaning = 'Anatomical structure'
    GROUP BY g.AnnotationGroupLabel, g.AnnotationPropertyType_CodeMeaning,
             g.AnnotationGroupGenerationType
    ORDER BY ann_series DESC
    LIMIT 10
""")

IDC: Find measurement Structured Reports

SR series contain measurements and radiomics features, but their content varies — use SeriesDescription and analysis_result_id to identify what they contain. Individual radiomics features within SRs are not searchable via idc-index; use BigQuery for feature-level queries.

# Find TotalSegmentator radiomics SRs (shape and firstorder features)
ts_radiomics = client.sql_query("""
    SELECT
        SeriesInstanceUID,
        PatientID,
        SeriesDescription,
        analysis_result_id
    FROM index
    WHERE Modality = 'SR'
      AND analysis_result_id = 'TotalSegmentator-CT-Segmentations'
      AND SeriesDescription LIKE '%shape%'
    LIMIT 5
""")
# These contain shape radiomics (volume, surface area, flatness, etc.)

# Find non-radiomics SRs: lesion bounding boxes, clinical reports
other_sr = client.sql_query("""
    SELECT DISTINCT SeriesDescription, analysis_result_id, COUNT(*) as count
    FROM index
    WHERE Modality = 'SR'
      AND analysis_result_id NOT IN ('TotalSegmentator-CT-Segmentations')
    GROUP BY SeriesDescription, analysis_result_id
    ORDER BY count DESC
    LIMIT 10
""")
# Returns: BPR annotations, breast imaging reports, tumor bounding boxes, etc.

# For feature-level radiomics queries, BigQuery is required:
# SELECT * FROM `bigquery-public-data.idc_current.measurement_groups`
# WHERE finding_category = 'Radiomics' AND ...

MIDRC: Find and download annotated data

MIDRC annotations are linked to imaging studies through the Gen3 graph model. Querying annotation metadata requires navigating Gen3 nodes rather than querying annotation-level attributes directly.

from gen3.auth import Gen3Auth
from gen3.query import Gen3Query

auth = Gen3Auth(endpoint="https://data.midrc.org", refresh_file="credentials.json")
query = Gen3Query(auth)

# Query imaging studies — annotations are linked as separate file nodes
response = query.query(
    data_type='imaging_study',
    first=100,
    fields=["case_ids", "study_uid", "study_modality", "object_id",
            "study_description", "body_part_examined"],
    filters={
        "body_part_examined": "CHEST",
        "study_modality": "CR"
    }
)

# Annotation files are separate nodes in Gen3 graph
# Navigate: case → imaging_study → annotation_file
# No equivalent to IDC's seg_index/ann_group_index for querying
# annotation attributes (algorithm, segment count, property type) directly
# Download via DRS:
# gen3 drs-pull object "dg.XXTS/<object_id>"

Key difference: In IDC, segmentations and annotations are first-class DICOM objects with dedicated queryable indices exposing algorithm names, segment counts, annotation property types, and graphic types. In MIDRC, annotations are queryable via Guppy GraphQL as a dedicated data type with annotation_method and annotation_name fields — less granular than IDC's per-segment metadata, but publicly accessible without authentication. MIDRC also offers unique data types (measurements, radiology reports) not available in IDC.


Appendix B: IDC Live Data Breakdown (idc-index v0.11.9, IDC v23)

Modality Distribution

Modality Patients Series Instances Size (GB)
SEG (Segmentation) 42,023 188,013 214,182 19,474
CT 35,172 252,008 29,163,879 15,501
SR (Structured Report) 29,065 270,687 270,687 143
SM (Slide Microscopy) 20,344 71,132 380,054 47,224
MG (Mammography) 17,026 48,125 268,365 5,304
MR 8,062 124,072 15,189,444 4,837
ANN (Annotations) 5,431 7,108 7,108 1,824
RTSTRUCT 1,747 4,938 4,938 10
CR (Computed Radiography) 1,705 12,416 12,427 186
US (Ultrasound) 1,411 2,240 5,128 380
PT (PET) 1,143 4,065 1,338,343 74
Other (M3D, DX, OT, RT*, PR, NM, REG, etc.) varies ~8,900 ~28,000 ~375

Top Body Parts

Body Part Patients Series
CHEST 30,745 356,133
BREAST 20,919 106,757
ABDOMEN 1,841 10,601
LUNG 1,442 10,971
PROSTATE 1,035 23,008
COLON 850 3,544
PELVIS 808 5,713
LIVER 527 3,989
BRAIN 494 2,044
KIDNEY 359 3,466

License Distribution

License Collections Patients Series Size (GB)
CC BY 4.0 102 55,097 817,461 68,228
CC BY 3.0 92 27,797 146,372 24,708
CC BY-NC 4.0 5 6,570 28,783 2,039
CC BY-NC 3.0 3 533 1,418 46
NLM Terms 1 2 39 313

Note: Licenses are assigned per DICOM object (series), not per collection. A single collection may contain objects under different licenses (e.g., original images under CC BY 3.0 and contributed analysis results under CC BY 4.0). Always check license_short_name on individual series before use. Collection counts per license sum > 161 because collections span multiple licenses.


Appendix C: MIDRC BDF Imaging Hub API (Live-Tested)

All queries below were tested on 2026-02-12 against the live BIH Guppy GraphQL endpoint. No authentication was required.

Endpoint

POST https://imaging-hub.data-commons.org/guppy/graphql
Content-Type: application/json

The BIH uses Gen3's Guppy service backed by Elasticsearch. Four data types are indexed: subject, imaging_study, imaging_series, and dataset.

Query 1: Total Counts Across All Repositories

curl -s -X POST https://imaging-hub.data-commons.org/guppy/graphql \
  -H "Content-Type: application/json" \
  -d '{"query": "{ _aggregation { subject { _totalCount } imaging_study { _totalCount } imaging_series { _totalCount } } }"}'

Result:

{
  "data": {
    "_aggregation": {
      "subject": { "_totalCount": 327895 },
      "imaging_study": { "_totalCount": 722695 },
      "imaging_series": { "_totalCount": 1963865 }
    }
  }
}

Query 2: Subjects Per Repository

curl -s -X POST https://imaging-hub.data-commons.org/guppy/graphql \
  -H "Content-Type: application/json" \
  -d '{"query": "{ _aggregation { subject { commons_name { histogram { key count } } } } }"}'

Result:

Repository Subjects
MIDRC 76,193
IDC 69,223
Stanford AIMI 65,514
OpenNeuro 49,010
NIHCC 35,232
TCIA 28,434
ACRdart 4,289

Query 3: Series Per Repository

curl -s -X POST https://imaging-hub.data-commons.org/guppy/graphql \
  -H "Content-Type: application/json" \
  -d '{"query": "{ _aggregation { imaging_series { commons_name { histogram { key count } } } } }"}'

Result:

Repository Series
IDC 946,957
MIDRC 469,324
Stanford AIMI 224,496
TCIA 201,337
ACRdart 107,150
NIHCC 14,601

Note: IDC leads in series count despite fewer subjects due to its multi-modality, multi-series-per-study collections (e.g., MRI protocols with T1, T2, DWI, ADC series per exam).

Query 4: Filtered Query — IDC CT Series with Metadata

curl -s -X POST https://imaging-hub.data-commons.org/guppy/graphql \
  -H "Content-Type: application/json" \
  -d '{
    "query": "{ imaging_series(first: 3, filter: {AND: [{IN: {commons_name: [\"IDC\"]}}, {IN: {Modality: [\"CT\"]}}]}) { SeriesInstanceUID Modality BodyPartExamined commons_name collection_id } }"
  }'

Result: Returns individual series with SeriesInstanceUID, Modality, BodyPartExamined, commons_name, and collection_id — sufficient to then retrieve the actual data from the source repository (e.g., via idc-index for IDC series).

Query 5: Modality Distribution Across All Repositories

curl -s -X POST https://imaging-hub.data-commons.org/guppy/graphql \
  -H "Content-Type: application/json" \
  -d '{"query": "{ _aggregation { imaging_series { Modality { histogram { key count } } } } }"}'

Returns series counts per modality (CT, MR, CR, DX, SM, etc.) aggregated across all 7 federated repositories.

Query 6: MIDRC-Specific Data Breakdown

The BIH API can be filtered by commons_name to extract verified statistics for any single repository. This was used to independently verify MIDRC data scale.

MIDRC totals: 76,193 subjects, 189,854 studies, 469,324 series

MIDRC modalities (series):

Modality Series
CT 170,150
CR (Computed Radiography) 126,858
DX (Digital X-ray) 106,539
SEG (Segmentation) 58,000
MR 5,894
SR (Structured Report) 1,074
PT, RF, NM, US, XA, MG <300 each

MIDRC collections (series):

Collection Series
MIDRC-Open-R1 209,142
MIDRC-Open-A1 202,202
MIDRC-TCIA-COVID-19-NY-SBU 24,482
MIDRC-Open-A1_SCCM_VIRUS 13,906
MIDRC-Open-A1_PETAL_BLUECORAL 7,948
MIDRC-Open-A1_PETAL_REDCORAL 7,913
MIDRC-TCIA-RICORD_1c 2,048
+ 4 smaller collections <700 each

MIDRC licenses (series):

License Series %
MIDRC DUA 441,111 94.0%
CC BY 4.0 25,816 5.5%
CC BY-NC 4.0 2,397 0.5%

MIDRC race distribution (subjects):

Race Subjects
White 37,990
Black or African American 21,393
Not Reported 5,667
No data 5,146
Asian 3,410
Other 2,256
American Indian or Alaska Native 176
Native Hawaiian or other Pacific Islander 155

MIDRC top study descriptions (series):

Study Description Series
XR Chest AP or PA 215,531
CHEST PORT 1 VIEW 13,385
CHEST AP PORT 13,289
CT CHEST WITH CONTRAST 9,287
CTA CHEST (PE STUDY) 8,893
CT CHEST PULMONARY EMBOLISM 8,473
CT CHEST WO CONTRAST 8,359
CT CHEST W CONTRAST 8,204

Key Observations from Live Testing

  1. No authentication required for read-only queries — unlike the MIDRC Gen3 portal itself, the BIH Guppy endpoint is publicly accessible
  2. 7 repositories federated (not 5 as initially documented): MIDRC, IDC, TCIA, ACRdart, Stanford AIMI, NIHCC, and OpenNeuro
  3. Sub-second response times for aggregation queries; filtered queries with first: N also return quickly
  4. Cross-repository discovery workflow: Query BIH to find series across repositories → use commons_name + SeriesInstanceUID/collection_id to retrieve data from source (e.g., idc-index for IDC, Gen3 DRS for MIDRC)
  5. Metadata only — the BIH indexes structured metadata (identifiers, modality, body part, collection); actual DICOM files remain at source repositories

Appendix D: MIDRC Portal Guppy API (Live-Tested)

All queries below were tested on 2026-02-12 against the live MIDRC portal Guppy GraphQL endpoint. No authentication was required. This is distinct from the BDF Imaging Hub API (Appendix C) — the portal Guppy indexes MIDRC-specific data at higher granularity.

Endpoints

Endpoint Method Auth Returns
POST https://data.midrc.org/guppy/graphql GraphQL None Metadata queries across 6 data types
GET https://data.midrc.org/guppy/_status REST None Index schema (data types, fields, array fields)
GET https://data.midrc.org/index/_stats REST None Total file count and data size
GET https://data.midrc.org/mds/metadata?limit=N REST None Object GUIDs and metadata

Six Queryable Data Types

The MIDRC portal indexes 6 data types via Guppy (vs 4 in BIH):

Data Type Total Count Key Fields
case 84,768 sex, race, ethnicity, age_at_index, covid19_positive, zip, conditions, medications, procedures, measurements, imaging_studies (array fields)
imaging_study 202,222 case_ids, study_uid, study_modality, body_part_examined, days_to_study, loinc_code, loinc_contrast, loinc_long_common_name, sex, race, age_at_index, covid19_positive
data_file 631,786 data_format (DCM/JSON/TXT/NII), data_type, data_category, instance_uids, contrast_bolus_agent, convolution_kernel, diffusion_b_value, echo_time, pixel_spacing, slice_thickness, study_year
measurement 203,374 case_ids, test_days_from_index, test_name, test_result_text, dataset_submitter_id
annotation 59,492 case_ids, instance_uids, annotation_method, annotation_name
radiology_report 12,124 body_part_examined, study_modality, sex, race, age_at_index, covid19_positive, ethnicity

Query 1: Total Counts for All Data Types

curl -s -X POST https://data.midrc.org/guppy/graphql \
  -H "Content-Type: application/json" \
  -d '{"query": "{ _aggregation { case { _totalCount } imaging_study { _totalCount } data_file { _totalCount } measurement { _totalCount } annotation { _totalCount } radiology_report { _totalCount } } }"}'

Result:

{
  "data": {
    "_aggregation": {
      "case": {"_totalCount": 84768},
      "imaging_study": {"_totalCount": 202222},
      "data_file": {"_totalCount": 631786},
      "measurement": {"_totalCount": 203374},
      "annotation": {"_totalCount": 59492},
      "radiology_report": {"_totalCount": 12124}
    }
  }
}

Query 2: Demographics Breakdown

curl -s -X POST https://data.midrc.org/guppy/graphql \
  -H "Content-Type: application/json" \
  -d '{"query": "{ _aggregation { case { sex { histogram { key count } } race { histogram { key count } } ethnicity { histogram { key count } } covid19_positive { histogram { key count } } } } }"}'

Result:

Field Value Count
sex Female 39,914
Male 38,500
no data 6,330
Not reported 24
race White 41,466
Black or African American 22,467
no data 8,484
Not Reported 5,655
Asian 3,571
Other 2,267
American Indian or Alaska Native 179
Native Hawaiian or other Pacific Islander 156
ethnicity Not Hispanic or Latino 62,678
Hispanic or Latino 8,004
no data 8,396
Not reported 5,670
Unknown 20
covid19_positive No 43,473
Yes 33,842
no data 7,453

Query 3: Study Modality and Data File Format Distribution

curl -s -X POST https://data.midrc.org/guppy/graphql \
  -H "Content-Type: application/json" \
  -d '{"query": "{ _aggregation { imaging_study { study_modality { histogram { key count } } } data_file { data_format { histogram { key count } } data_type { histogram { key count } } } } }"}'

Result:

Dimension Value Count
study_modality DX 94,587
CR 79,646
CT 25,615
MR 2,479
data_format DCM 492,134
JSON 122,091
TXT 12,124
NII 2,510
CSV 1,888
XLSX 1,039
data_type DICOM 433,060
External Annotation 177,070
Radiology Report 12,124
Internal Annotation 8,676
Clinical Supplement 856

Query 4: Annotation Details

curl -s -X POST https://data.midrc.org/guppy/graphql \
  -H "Content-Type: application/json" \
  -d '{"query": "{ _aggregation { annotation { annotation_method { histogram { key count } } annotation_name { histogram { key count } } } } }"}'

Result:

Field Value Count
annotation_method Retrospective_auto 57,420
Retrospective_expert 2,072
annotation_name SIFT 57,419
midrc_mRALE_Mastermind_Challenge 2,072
no data 1

Query 5: Dataset/Collection Breakdown (Top 10)

curl -s -X POST https://data.midrc.org/guppy/graphql \
  -H "Content-Type: application/json" \
  -d '{"query": "{ _aggregation { case { dataset { histogram { key count } } } } }"}'

Result (top 10 of 78 datasets):

Dataset Cases
RSNA_20230830 21,465
RSNA_20230725 13,605
RSNA_20231012 10,456
RSNA_20221011 8,279
RSNA_20230420 5,576
ACR_20230530 5,026
RSNA_20220412 3,972
ACR_20230823 3,395
ACR_20240226 2,539
RSNA_20240315 2,369

Query 6: Total Data Size (Indexd)

curl -s https://data.midrc.org/index/_stats

Result:

{"fileCount": 637431, "totalFileSize": 12360285440149}

Total: 637,431 files, 12.36 TB

Key Observations: Portal vs BIH Discrepancy

Metric MIDRC Portal BIH (MIDRC filter) Difference
Cases/Subjects 84,768 76,193 +11.2%
Studies 202,222 189,854 +6.5%
Files vs Series 631,786 files 469,324 series Different units

Likely explanations:

  1. BIH indexes at series level, not file level — one series may contain multiple files (DICOM instances), plus annotation files (JSON, NIfTI, TXT) are files but not DICOM series
  2. BIH may lag behind the portal — federated index may not be synchronized in real-time
  3. Different data types indexed — the portal indexes 6 data types including measurements and radiology reports; BIH indexes 4 (subject, imaging_study, imaging_series, dataset)

MIDRC Portal Guppy vs IDC idc-index: Comparison of Public Queryability

Capability IDC (idc-index) MIDRC (Portal Guppy)
Auth required No No
Query interface SQL (local DuckDB) GraphQL (remote Elasticsearch)
Data types 1 primary + 9 specialized indices 6 data types
Latency Milliseconds (local) Seconds (network)
Offline Yes (Parquet cached locally) No
Schema discovery client.indices_overview GET /guppy/_status
Filter + paginate SQL LIMIT/OFFSET/WHERE GraphQL first/offset/filter
Aggregations SQL GROUP BY, COUNT, SUM GraphQL _aggregation with histogram
Instance-level metadata Yes (4000+ DICOM tags via BigQuery) Partial (select fields on data_file type: pixel_spacing, slice_thickness, echo_time, etc.)
Annotation metadata Rich (algorithm name, type, segment count, property codes) Basic (annotation_method, annotation_name)

Limitations of This Analysis

Verification depth

Data Source Verification Method Depth
IDC idc-index v0.11.9 programmatic queries Exact counts (patients, series, instances, sizes, modalities, body parts, licenses, segmentation algorithms, annotation types)
MIDRC (portal Guppy) data.midrc.org/guppy/graphql + data.midrc.org/index/_stats Exact counts (cases, studies, files, data volume in TB, demographics, COVID status, modalities, annotations, measurements, radiology reports, datasets)
MIDRC (via BIH) imaging-hub.data-commons.org/guppy/graphql Exact counts (subjects, studies, series, modalities, body parts, collections, licenses, race demographics, study descriptions)

Both platforms were verified programmatically using publicly accessible APIs that require no authentication. The MIDRC portal Guppy and Indexd APIs resolved several items previously listed as "not publicly stated" — including data volume (12.36 TB), file count (637,431), annotation methods, and demographic breakdowns.

Things not verified or not possible to verify

  1. MIDRC data volume in TBRESOLVED: 12.36 TB via data.midrc.org/index/_stats
  2. MIDRC instance countRESOLVED: 637,431 files via Indexd (492K DCM + 122K JSON + 12K TXT + 2.5K NIfTI + others); note this counts files, not DICOM instances per se
  3. MIDRC annotation-level metadataRESOLVED: Portal Guppy exposes annotation_method (Retrospective_auto, Retrospective_expert) and annotation_name (SIFT, mRALE_Mastermind_Challenge) as queryable fields. Less granular than IDC's per-segment metadata but searchable.
  4. MIDRC sequestered data characteristics — the 80/20 open/sequestered split is documented, but both portal and BIH likely index only the public portion. The actual composition of the sequestered portion could not be independently verified
  5. Portal vs BIH discrepancy — the MIDRC portal shows more data than BIH (84,768 vs 76,193 cases; 202,222 vs 189,854 studies). Likely explanations: (a) BIH federation lag, (b) BIH indexes fewer data types (4 vs 6), (c) different entity definitions (BIH "subject" vs portal "case"). MIDRC self-reports ">300,000 studies collected" which likely includes sequestered data and pre-release collections
  6. Cross-commons query performance — CDA (IDC/CRDC) query capabilities were described from documentation, not tested programmatically. Both BIH and MIDRC portal Guppy APIs were tested live (see Appendices C and D) and confirmed to work without authentication
  7. MIDRC Gen3 SDK code examples — the Gen3 SDK download examples were assembled from documentation but not executed (requires registration). However, the Guppy GraphQL metadata query examples were executed live and verified
  8. Radiomics feature-level content — IDC's SR series were described by SeriesDescription and analysis_result_id, but individual radiomics features within SRs are only searchable via BigQuery (not idc-index), and this was not demonstrated
  9. MIDRC tool maturity — 30+ algorithms are listed on midrc.org, but their integration depth with the data portal was not verified. SIFT appears to be the primary integrated algorithm (57,419 annotations indexed in portal)
  10. Data freshness — IDC version v23 was current at time of analysis. MIDRC portal and BIH data may differ from each other and from the latest additions
  11. MIDRC data_file field granularity — the portal Guppy data_file type exposes DICOM-level fields (pixel_spacing, slice_thickness, echo_time, contrast_bolus_agent, diffusion_b_value) but the actual queryable values and their coverage across files were not tested

Potential biases

Tool-induced bias

This analysis was generated using Claude Code with an imaging-data-commons skill loaded into context. This skill provided a detailed, authoritative reference for IDC — its data model, SQL patterns, index table schemas, API capabilities, and exact column names. No equivalent skill existed for MIDRC. This created an asymmetry — not by restricting MIDRC research, but by making IDC exploration substantially easier and more directed.

How the skill created asymmetry:

  1. Structured guidance vs. open-ended search. The IDC skill gave the LLM a roadmap of what to explore (e.g., seg_index, ann_group_index, clinical_index, contrast metadata). For MIDRC, there was no equivalent guide — so the analysis of MIDRC's derived data, clinical tables, and internal metadata structure is shallower, not necessarily because MIDRC has less capability, but because there was less guidance on what to look for.

  2. Programmatic verification. The skill enabled live idc-index queries producing exact, authoritative numbers. For MIDRC, programmatic verification was only possible through the BIH Guppy API (aggregate statistics) — not through the Gen3 portal itself (which requires registration). This means IDC's numbers are verified at finer granularity (instance counts, data volume in TB, per-algorithm segmentation counts, per-annotation-group property types) than MIDRC's.

  3. Iterative deepening. Each round of user feedback ("add more detail on annotations", "fix the annotation queries") drove deeper into IDC's capabilities because that's where programmatic tools were available. MIDRC didn't receive equivalent iterative deepening until the BIH API was tested late in the process.

Important caveat: The LLM was not prevented or discouraged from researching MIDRC more thoroughly. Web search, web fetch, and API testing tools were available throughout the session and were used for MIDRC research. The asymmetry was one of pull rather than push — the IDC skill actively directed exploration toward specific metadata tables and query patterns, while MIDRC research required the LLM to independently discover what to look for.

This was confirmed by the late-session discovery of MIDRC's public Guppy API. When explicitly prompted to "search more aggressively for MIDRC information," the LLM discovered that data.midrc.org/guppy/graphql is publicly accessible without authentication — resolving multiple "not publicly stated" gaps in a single round of API testing. The LLM could have made this discovery much earlier in the process; the fact that it required user prompting demonstrates the behavioral nature of the bias. The BIH API discovery also came from user prompting, not LLM initiative. The skill made IDC exploration easier, but nothing made MIDRC exploration harder — the LLM simply didn't try as hard without the structured guidance the IDC skill provided.

Mitigating factors

  • The MIDRC portal Guppy API discovery substantially corrected the asymmetry by providing independently verified MIDRC numbers for cases (84,768), studies (202,222), files (637,431), data volume (12.36 TB), 6 data types, demographics, annotation methods, and dataset breakdowns
  • The BIH API added verified series-level statistics (469,324 series, modality/body part/collection/license breakdowns) for cross-repository context
  • The Indexd API confirmed total data size (12.36 TB) — previously listed as "not publicly stated"
  • Web research covered MIDRC's unique strengths (bias/fairness tools, sequestered evaluation, PPRL, federated search, measurements, radiology reports) that IDC does not offer
  • This limitations section explicitly documents the asymmetry and its behavioral nature

What would reduce the bias further

  • A comparable MIDRC/Gen3 skill providing structured guidance for MIDRC exploration (e.g., documenting Guppy data types, field schemas, query patterns — analogous to the IDC skill's index table guide)
  • Authenticated Gen3 portal access for testing download workflows and exploring data not exposed via public Guppy (e.g., individual file downloads, workspace integration)
  • Testing the Cancer Data Aggregator (CDA) API for IDC's cross-commons capabilities, paralleling the BIH/portal API testing done for MIDRC
  • Exploring MIDRC's data_file Guppy fields (pixel_spacing, slice_thickness, echo_time, contrast_bolus_agent, etc.) to assess DICOM-level metadata queryability — potentially comparable to IDC's BigQuery
  • More proactive LLM exploration of MIDRC resources earlier in the analysis process, without waiting for user prompting

Key Sources

Publications

  • Fedorov et al., RadioGraphics 2023 - IDC transparency & reproducibility
  • Drukker et al., Scientific Data 2025 - MIDRC interoperability use cases
  • Bergquist et al., J Imaging Informatics Med 2025 - MIDRC + N3C COVID severity prediction
  • NCI CRDC Core Standards and Services - Cancer Research 2024

Portals & Documentation

APIs Tested Live (Feb 2026, No Auth Required)

  • IDC: idc-index v0.11.9 Python package (local Parquet queries)
  • MIDRC Portal Guppy: POST https://data.midrc.org/guppy/graphql — 6 data types, full metadata queries
  • MIDRC Portal Indexd: GET https://data.midrc.org/index/_stats — file count and data volume
  • MIDRC Portal Guppy Status: GET https://data.midrc.org/guppy/_status — schema discovery
  • BDF Imaging Hub Guppy: POST https://imaging-hub.data-commons.org/guppy/graphql — 7-repository federated search

Tools

Clone this wiki locally