claude_report_idc_vs_midrc.md

Comparative Analysis: IDC vs MIDRC

Generated on 2026-02-12 using Claude Code (Claude Opus 4.6). The conversation was prompted by Andrey Fedorov (Brigham and Women's Hospital / Harvard Medical School), who is a co-PI of IDC and is also supported in part by the MIDRC project. His dual involvement means the prompts likely reflected more domain knowledge about both platforms than an average user would have, but also that the framing may have been influenced by deeper familiarity with IDC's architecture and capabilities. IDC statistics were verified programmatically using the imaging-data-commons Claude skill, which provides direct access to idc-index (v0.11.9, IDC data v23). MIDRC statistics were verified via live Guppy GraphQL API queries against both the MIDRC portal (data.midrc.org/guppy/graphql) and the BDF Imaging Hub (imaging-hub.data-commons.org/guppy/graphql) — both publicly accessible without authentication. Additional MIDRC data size verified via Indexd API (data.midrc.org/index/_stats).

Executive Summary

The NCI Imaging Data Commons (IDC) and the NIBIB/ARPA-H Medical Imaging and Data Resource Center (MIDRC) are two major federally funded open imaging data platforms in the United States. Both launched in 2020, both use DICOM as their primary data standard, and both are built on open-source infrastructure — but they serve different communities, solve different problems, and make different architectural choices.

IDC is a cancer-focused repository hosting 79,569 patients, 994,073 series, and 95.33 TB across 161 collections spanning 26 imaging modalities (CT, MRI, digital pathology, mammography, PET, and more). All data is harmonized to DICOM and served from public cloud storage (AWS + GCS) with zero authentication required — for both metadata queries and file downloads. IDC's strength is its depth of derived data: 188,013 segmentation series, 6.17 billion pathology annotations, and radiomics features for nearly 10 million structures, all queryable via SQL. IDC operates as a node within the NCI Cancer Research Data Commons (CRDC), enabling cross-commons queries spanning imaging, genomics, and proteomics.

MIDRC is a COVID-19/respiratory disease platform that has expanded to chronic diseases, hosting 84,768 cases, 202,222 studies, and 12.36 TB across 78 dataset batches from 27 U.S. hospitals. While narrower in modality (primarily chest X-ray and CT), MIDRC offers unique capabilities: 203,374 clinical measurements, 12,124 radiology reports, and 59,492 annotations as dedicated queryable data types — plus a 20% sequestered dataset for unbiased AI evaluation. MIDRC's defining strengths are its bias/fairness tooling (MELODY, REACT, AI Reliability Tool), its privacy-preserving record linkage with EHR data (N3C), and the BDF Imaging Hub — a federated search spanning 7 imaging repositories including IDC itself.

Licensing differs significantly. IDC assigns licenses per DICOM object (not per collection), with 97%+ of series under CC-BY (4.0 or 3.0), permitting commercial use with attribution. MIDRC uses a custom Data Use Agreement (DUA) covering 94% of its series, with commercial use requiring a separate agreement through the University of Chicago. IDC also makes 100% of its data publicly accessible, while MIDRC sequesters ~20% for unbiased algorithm evaluation.

Key finding for programmatic users: Both platforms offer public metadata APIs requiring no authentication. IDC provides idc-index (local SQL over Parquet, millisecond latency). MIDRC provides Guppy GraphQL at data.midrc.org/guppy/graphql (remote Elasticsearch, seconds latency, 6 data types). MIDRC's Guppy API being publicly accessible was not well-documented and was discovered during this analysis through API testing — it substantially narrows the programmatic access gap between the platforms for metadata exploration.

The platforms are complementary, not competing. Use IDC for cancer imaging, large-scale cloud analytics, digital pathology, and reproducible versioned pipelines. Use MIDRC for COVID-19/respiratory imaging, AI fairness evaluation, sequestered benchmarking, multi-modal clinical integration (imaging + EHR + genomics via cross-commons linkage), and federated cross-repository search. MIDRC's BDF Imaging Hub federates IDC, making both accessible through a single search interface.

Overview

Dimension	IDC (Imaging Data Commons)	MIDRC (Medical Imaging & Data Resource Center)
Funder	NCI (Cancer Moonshot)	NIBIB + ARPA-H
Host / Operator	Brigham & Women's / Harvard (Fedorov & Kikinis)	University of Chicago (Giger); co-led by ACR, RSNA, AAPM
Ecosystem	Node within NCI Cancer Research Data Commons (CRDC)	Independent commons; ARPA-H Biomedical Data Fabric performer; NAIRR pilot
Launch	2020	2020 (COVID-19 response)

1. Mission & Disease Focus

	IDC	MIDRC
Core mission	Cloud-based repository for publicly available cancer imaging data, co-located with analysis tools	AI-ready data commons for machine learning innovation, initially COVID-19, now expanding to chronic diseases
Disease scope	Cancer only (all organ sites)	COVID-19 + expanding: cancer, diabetes, chronic liver disease, coronary artery disease, COPD, emphysema
Strategic emphasis	Transparency, reproducibility, scalability of imaging AI	Bias mitigation, fairness, real-world AI testing, interoperability

2. Data Scale & Modalities

IDC numbers verified live via idc-index v0.11.9 (IDC data v23, Feb 2026). MIDRC numbers verified via MIDRC portal Guppy GraphQL API + Indexd API (Feb 2026). BIH numbers used where portal data unavailable.

	IDC (v23)	MIDRC (portal-verified)
Collections / Datasets	161 curated collections	78 dataset batches (RSNA + ACR submissions; largest: RSNA_20230830 with 21,465 cases)
Patients / Cases	79,569 patients	84,768 cases (portal); 76,193 subjects (BIH — discrepancy likely due to BIH not indexing all data types)
Studies	160,199 studies	202,222 imaging studies (portal); 189,854 (BIH)
Series	994,073 series	469,324 series (BIH — portal does not expose series-level counts)
Instances (files)	46,885,909 DICOM instances	637,431 files (Indexd; includes 492K DCM, 122K JSON, 12K TXT, 2.5K NIfTI)
Data volume	95.33 TB	12.36 TB (Indexd: `data.midrc.org/index/_stats`)
Imaging modalities	26 modalities: CT (35K patients), SM/pathology (20K), MG/mammography (17K), MR (8K), PET (1.1K), CR, DX, US, and 18 others	Portal study_modality: DX (94.6K studies), CR (79.6K), CT (25.6K), MR (2.5K). BIH series: CT (170K), CR (127K), DX (107K), SEG (58K), MR (5.9K), SR (1K), + 6 others
Anatomical breadth	20+ body parts: chest (30.7K), breast (20.9K), abdomen (1.8K), lung (1.4K), prostate (1K), colon (850), pelvis, liver, brain, kidney, etc.	Predominantly thoracic: CHEST (221K series), PORT CHEST (60K), ABDOMEN (12K), HEAD (2.7K), HEART (1.4K); ~60% chest, ~34% unlabeled
Image-derived data	188,013 segmentation series covering 154,233 source series; 23 analysis result collections; 7,108 annotation series	59,492 annotations (portal): SIFT auto-annotations (57,419), mRALE Mastermind Challenge expert annotations (2,072). 58K SEG + 1K SR series (BIH)
Additional data types	—	203,374 measurements, 12,124 radiology reports (portal — unique MIDRC data types not found in IDC)
Clinical data tables	97 collections with clinical data across 224 tables (TCGA, CPTAC, HTAN, NLST, ACRIN, RICORD, etc.)	Demographics (portal: Female 39.9K, Male 38.5K; White 41.5K, Black/AA 22.5K, Asian 3.6K; Not Hispanic 62.7K, Hispanic 8K), conditions, medications, vitals, labs, COVID status (positive: 33.8K, negative: 43.5K)
Data sources	TCGA, TCIA, CPTAC, CCDI, HTAN, LIDC, QIN, NLST, VHP	27 U.S. subsites (academic + community hospitals); 78 RSNA + ACR dataset batches

3. Data Access & Licensing

	IDC	MIDRC
Registration required	No - fully open, anonymous access	Partial - metadata queries via Guppy GraphQL work without authentication (see Appendix D); data download requires free registration
Authentication	None for data download; GCP auth only for BigQuery	None for metadata (Guppy GraphQL at `data.midrc.org/guppy/graphql` is publicly accessible); Gen3 auth (OpenID Connect / OAuth 2.0) required for file downloads
Licensing	Per-object (not per-collection): each DICOM series tagged with its own license. Analysis results contributed to a collection may differ from original images. Overall: 97%+ CC-BY (CC BY 4.0: 817K series; CC BY 3.0: 146K series); 3% CC-BY-NC (30K series)	MIDRC DUA (441K series, 94%); CC BY 4.0 (26K series, 5.5%); CC BY-NC 4.0 (2.4K series, 0.5%). Commercial use requires separate DUA agreement via UChicago
Egress fees	Zero - free from both AWS and GCS	Free download from Gen3 portal
Sequestered data	None - 100% public	20% sequestered for unbiased algorithm evaluation / regulatory approval

4. Technical Architecture

	IDC	MIDRC
Platform	Google Cloud Platform + AWS (dual-cloud)	Gen3 Data Ecosystem (open-source, UChicago data center)
Data format	All data harmonized to DICOM	DICOM + multiple delivery formats (DICOM SR, SEG, JSON, NIfTI, CSV, XLSX)
Storage	3 public cloud buckets (GCS + S3 mirrors)	Gen3 object storage
Metadata query	BigQuery (4000+ DICOM tags, SQL); idc-index Python package (~50 columns, no auth); Parquet on S3 (DuckDB)	Guppy GraphQL API at `data.midrc.org/guppy/graphql` (no auth, 6 data types: case, imaging_study, data_file, measurement, annotation, radiology_report); Gen3 portal explorer; GA4GH DRS API; graph database for clinical/phenotype data
Image retrieval	DICOMweb (IDC proxy, no auth); S3/GCS direct download; Google Healthcare API	Gen3 download client; TCIA integration
Programmatic access	`idc-index` Python package; BigQuery SQL; DICOMweb REST	Guppy GraphQL (no auth for metadata); Gen3 SDK; GA4GH DRS; Indexd API (`/index/_stats`); Gen3 mesh services
Visualization	OHIF v3 (radiology), SliM (pathology), VolView (3D), 3D Slicer extension	Gen3 portal; limited built-in visualization
Compute integration	Google Colab, Vertex AI, NIH Cloud Lab, ACCESS HPC	Gen3 workspaces
Data versioning	Versioned releases (v1-v23+), CRDC UUIDs persist across versions	Incremental data additions; no formal versioning scheme documented
Federated search	No (centralized)	Yes - MIDRC BDF Imaging Hub federates 7 repositories: IDC, MIDRC, TCIA, ACRDart, Stanford AIMI, NIHCC, OpenNeuro (327,895 subjects, 1.96M series total)

Interoperability: Standards and Connected Systems

Both platforms invest in interoperability, but with different philosophies: IDC prioritizes DICOM standards + CRDC ecosystem integration, while MIDRC prioritizes federated access + cross-commons record linkage.

Standards Comparison

Standard	IDC	MIDRC
DICOM	Core philosophy — all data harmonized to DICOM	Native storage format
DICOMweb (WADO-RS, QIDO-RS, STOW-RS)	Full support via Google Healthcare API + IDC proxy (no auth)	WADO-RS support
GA4GH DRS	Yes — CRDC UUIDs resolve as DRS IDs (`dg.4DFC/<uuid>`) via CRDC centralized service	Yes — GUIDs resolve to file locations across Gen3 repositories
GA4GH Passports	Via CRDC/DCF infrastructure	Core auth mechanism for cross-commons access
OpenID Connect / OAuth 2.0	GCP auth only (for BigQuery); data access is unauthenticated	Required for all access
FHIR (HL7)	Not directly (NCPI alignment in progress)	Supported for clinical data interoperability
PPRL (Privacy-Preserving Record Linkage)	Not implemented	Gen3 Crosswalk Service for secure patient matching across systems
Cloud-native	BigQuery SQL; Parquet on S3; dual-cloud (AWS+GCS) mirroring	Gen3 object storage; single cloud
CRDC-H (Harmonized Data Model)	Yes — enables cross-commons queries via Cancer Data Aggregator	No (not a CRDC node)

IDC: Connected Systems and Mechanisms

IDC connects through the NCI CRDC ecosystem and NCPI program:

System	Linkage Mechanism	What It Provides
GDC (Genomic Data Commons)	CRDC shared identifiers; Cancer Data Aggregator	Genomic sequencing and variant data
PDC (Proteomic Data Commons)	CRDC shared identifiers; Cancer Data Aggregator	Proteomic analysis data
ICDC (Canine Data Commons)	CRDC shared identifiers	Canine clinical trial data
CTDC (Clinical & Translational Data Commons)	CRDC shared identifiers	Clinical trial data
TCIA	Data ingestion — IDC mirrors all public TCIA collections automatically	Archival repository; IDC adds cloud-native access
NCPI platforms (AnVIL, BioData Catalyst, dbGaP, Kids First)	GA4GH DRS; shared auth standards	Cross-NIH platform interoperability
PACS systems	DICOMweb (QIDO-RS, WADO-RS)	Direct query from clinical imaging systems

Cancer Data Aggregator (CDA): Unified search across all 6 CRDC nodes — users can build "virtual cohorts" spanning imaging (IDC), genomics (GDC), proteomics (PDC), and clinical data using disease name, anatomical location, demographics, or data type. Uses the CRDC-H harmonized data model for cross-node element mapping.

CRDC Data Commons Framework (DCF): Gen3-based infrastructure minting persistent GUIDs for all 52 million FAIR objects (4.9 PB) across CRDC. DRS IDs resolve to access methods for both GCS and AWS, decoupling logical identifiers from physical storage.

Dual-cloud mirroring: IDC maintains identical copies on AWS S3 and GCS (migrated 40M DICOM objects / 63 TB via AWS DataSync in <41 hours). Parquet metadata exports in idc-open-metadata S3 bucket enable tool-agnostic access via DuckDB, Spark, etc.

MIDRC: Connected Systems and Mechanisms

MIDRC connects through the Gen3 mesh and ARPA-H BDF Imaging Hub:

System	Linkage Mechanism	What It Provides	Published Use Case
N3C (National COVID Cohort Collaborative)	PPRL via honest broker	EHR data (diagnoses, labs, vitals, meds, procedures)	COVID-19 severity prediction from chest X-rays + EHR (Bergquist et al., J Imaging Informatics Med 2025)
BioData Catalyst (BDC)	Gen3 mesh / Crosswalk	Genomic + clinical data from NHLBI studies	Multimodal cohort construction (Chen/Whitney et al., Scientific Data 2025)
IDC	BDF Imaging Hub federated search	Cancer imaging (95 TB, 161 collections)	Cross-repository discovery via BIH
TCIA	Direct collection hosting + BIH	MIDRC-RICORD-1A/1B/1C collections served via TCIA	RICORD datasets published through both
ACRDart	BDF Imaging Hub affiliate	Radiology imaging data (ACR registry)	Federated search
Stanford AIMI	BDF Imaging Hub affiliate	Stanford medical imaging datasets	Federated search
All of Us	Gen3 mesh capability	Large-scale population health data	Supported (in development)

Gen3 "Narrow Middle" architecture: Standardized core services (OpenAPI/RESTful) sit between diverse data ingest/curation and analysis/processing applications. FAIR APIs auto-generated from data models. The Crosswalk Service passes privacy-preserving MIDRC IDs to match patients across repositories without exposing PHI.

MIDRC BDF Imaging Hub (ARPA-H funded):

Federated hub providing unified search across 7 repositories: MIDRC, IDC, TCIA, ACRDart, Stanford AIMI, NIHCC, and OpenNeuro
Does not centralize/duplicate imaging files — aggregates structured metadata and provides identifiers for researchers to access from original nodes
Features cohort builder over aggregated structured metadata
Part of ARPA-H Biomedical Data Fabric Toolbox (September 2023)
API: Guppy GraphQL endpoint at https://imaging-hub.data-commons.org/guppy/graphql — works without authentication for read-only queries
Live-verified scale (Feb 2026): 327,895 subjects, 722,695 studies, 1,963,865 series across all 7 repositories

Repository	Subjects	Series
MIDRC	76,193	469,324
IDC	69,223	946,957
Stanford AIMI	65,514	224,496
OpenNeuro	49,010	—
NIHCC	35,232	14,601
TCIA	28,434	201,337
ACRdart	4,289	107,150

See Appendix C for live-tested API query examples.

Key Architectural Difference

	IDC	MIDRC
Interop model	Hub-in-ecosystem — centralized data with CRDC cross-commons query	Federated mesh — distributed data with cross-repository search
Cross-commons search	Cancer Data Aggregator queries all 6 CRDC nodes	BDF Imaging Hub federates 7 imaging repositories (verified via live API)
Patient matching	Shared CRDC identifiers (same patient across GDC/PDC/IDC)	Privacy-preserving record linkage (PPRL) for patient matching across separate systems
Data movement	Data ingested and harmonized centrally (DICOM)	Data stays at source; identifiers routed through mesh
NIH ecosystem	NCPI (AnVIL, BDC, dbGaP, Kids First)	ARPA-H BDF + Gen3 mesh (N3C, BDC, All of Us)

5. AI / ML Support

	IDC	MIDRC
AI-generated annotations	Yes - large-scale (e.g., NLST: 10M structures from 125K CT images with radiomics)	Yes - 59,492 annotations (portal-verified): SIFT auto-annotations (57,419, `Retrospective_auto`), mRALE Mastermind Challenge expert annotations (2,072, `Retrospective_expert`)
Bias / fairness tools	Not a primary focus	Core strength: MIDRC-MELODY (subgroup performance), MIDRC-REACT (dataset representativeness), AI Reliability Tool (30 bias sources across 5 pipeline stages), MetricTree (metric selection)
Sequestered evaluation	No	Yes - 20% held-out data for unbiased benchmarking and regulatory use
Model hosting	No	30+ algorithms on GitHub, HuggingFace, PhysioNet
Real-world testing	No	TDP 2: Real-time connection with healthcare facilities for dynamic algorithm testing
NLP tools	No	RadGraph (entity/relation extraction from radiology reports)

6. Clinical Data & Multi-modal Integration

	IDC	MIDRC
Clinical data	Cancer staging, treatment history, outcomes (varies by collection); structured via eCRFs	Portal-verified demographics: sex (Female 39.9K, Male 38.5K), race (White 41.5K, Black/AA 22.5K, Asian 3.6K), ethnicity (Not Hispanic 62.7K, Hispanic 8K), COVID status (negative 43.5K, positive 33.8K). Additional: conditions, medications, vitals, labs, procedures. 203,374 measurements and 12,124 radiology reports as dedicated queryable data types
Cross-commons linkage	CRDC (GDC genomics, PDC proteomics, ICDC canine) via shared identifiers	N3C (EHR), All of Us, BioData Catalyst (genomics); demonstrated COVID severity prediction use case
Clinical data standards	DICOM metadata; BigQuery-queryable	LOINC-harmonized; graph database

7. Data Standards & De-identification

	IDC	MIDRC
Primary standard	DICOM (universal harmonization - all data converted to DICOM)	DICOM (native storage) + multiple export formats
Identifiers	CRDC UUIDs (instance, series, study) + standard DICOM UIDs	Gen3 object IDs + DICOM UIDs
De-identification	Two-stage process; NEMA/RSNA CTP profiles	Stanford De-identifier (text); RSNA DICOM Anonymizer (images); DICOM Harmonization Tool
Metadata richness	4000+ DICOM tags queryable in BigQuery	Curated subset via Gen3 data model

8. Governance & Community

	IDC	MIDRC
Governance model	NCI-funded contract (Leidos + BWH); advisory boards	NIBIB contract; tri-society leadership (ACR, RSNA, AAPM); 5 TDPs + 12 CRPs
Open-source	idc-index, viewers, tutorials (GitHub)	Gen3 platform (Apache License); all AI tools open-source with publications
User base	Broad cancer research community	>450 registered data users
Community engagement	Tutorials, Colab notebooks, 3D Slicer extension	27 subsites; equity/diversity initiatives; NAIRR pilot

9. Programmatic Access Comparison

Setup & Authentication

IDC - No setup beyond pip install idc-index. No authentication for queries or downloads:

from idc_index import IDCClient
client = IDCClient()  # Ready immediately

MIDRC - Metadata queries require no authentication (Guppy GraphQL is public). Data downloads require registration + Gen3 SDK:

import requests

# NO AUTH needed for metadata queries via Guppy GraphQL
response = requests.post(
    "https://data.midrc.org/guppy/graphql",
    json={"query": '{ _aggregation { case { _totalCount } imaging_study { _totalCount } } }'}
)
print(response.json())  # Returns: case 84,768; imaging_study 202,222

# For data DOWNLOADS, Gen3 auth is required:
# pip install gen3
from gen3.auth import Gen3Auth
# credentials.json downloaded from https://data.midrc.org/identity (valid 30 days)
auth = Gen3Auth(endpoint="https://data.midrc.org", refresh_file="credentials.json")

Querying Metadata

IDC - SQL queries against a local Parquet index (~50 columns, no network call):

results = client.sql_query("""
    SELECT collection_id, PatientID, SeriesInstanceUID, Modality, series_size_MB
    FROM index
    WHERE Modality = 'CT' AND BodyPartExamined = 'CHEST'
    LIMIT 100
""")

MIDRC - GraphQL queries against Guppy (Elasticsearch-backed, no auth for metadata):

import requests

# Direct Guppy GraphQL — no authentication needed
response = requests.post(
    "https://data.midrc.org/guppy/graphql",
    json={"query": """{ imaging_study(first: 100, filter: {AND: [
        {IN: {body_part_examined: ["CHEST"]}},
        {IN: {study_modality: ["CT"]}}
    ]}) { case_ids study_uid study_modality body_part_examined } }"""}
)
studies = response.json()["data"]["imaging_study"]

# Or via Gen3 SDK (requires auth):
# response = query.query(data_type='imaging_study', first=100,
#     fields=["case_ids", "study_uid", "study_modality", "object_id"],
#     filters={"body_part_examined": "CHEST", "study_modality": "CT"})

Downloading Data

IDC - One-liner, downloads from public S3/GCS buckets (no auth):

client.download_from_selection(
    seriesInstanceUID=["1.3.6.1.4..."],
    downloadDir="./data"
)

MIDRC - Two-step: resolve DRS URI, then download via presigned URL:

# CLI approach
# gen3 --endpoint https://data.midrc.org --auth creds.json drs-pull object "dg.XXTS/<guid>"

# Or Python: resolve object_id → presigned URL → download
import requests
url = f"https://data.midrc.org/index/{object_id}"
file_info = requests.get(url, headers={"Authorization": f"Bearer {auth.token}"}).json()

Joining Imaging with Clinical Data

IDC - SQL JOINs across local index tables:

client.fetch_index("clinical_index")
clinical_df = client.get_clinical_table("nlst_prsn")  # loads locally
# Join with imaging data via PatientID or collection_id

MIDRC - Separate queries on Gen3 nodes, merged in pandas:

imaging = query.query(data_type='imaging_study', fields=["case_ids", "study_uid", "days_to_study"], ...)
measurements = query.query(data_type='measurement', fields=["case_ids", "test_days_from_index"], ...)
cohort = pd.merge(pd.DataFrame(imaging), pd.DataFrame(measurements), on='case_ids')

Browser Visualization

IDC - Generate viewer URL, opens OHIF/SliM automatically:

url = client.get_viewer_URL(seriesInstanceUID="1.3.6.1.4...")
webbrowser.open(url)

MIDRC - No equivalent programmatic viewer URL generation; visualization is portal-based.

Side-by-Side Summary

Scenario	IDC	MIDRC
Install	`pip install idc-index`	None for metadata (direct HTTP to Guppy); `pip install gen3` + register for downloads
Auth	None	None for metadata queries; API key (30-day expiry) for downloads
Query language	SQL (DuckDB on local Parquet)	GraphQL (Guppy on remote Elasticsearch)
Query latency	Milliseconds (local)	Seconds (network)
Download method	Direct S3/GCS public URLs	DRS URI resolution → presigned URL (requires auth)
Batch download	`download_from_selection()` with list of UIDs	`gen3 drs-pull manifest manifest.json ./output/`
Clinical data join	SQL JOIN on local tables	Separate GraphQL queries on 6 data types, merge in pandas
Viewer integration	`get_viewer_URL()` → OHIF/SliM	Portal only
Offline capability	Full (index is local Parquet)	None (all queries require network)
Queryable data types	1 primary index + 9 specialized indices	6 data types: case, imaging_study, data_file, measurement, annotation, radiology_report
Tutorials	IDC-Tutorials	MIDRC tutorial_notebooks

10. Relationship Between the Two

IDC and MIDRC are complementary, not competing:

MIDRC's BDF Imaging Hub federates IDC - users can search IDC data through MIDRC's unified interface alongside TCIA, ACRDart, Stanford AIMI, NIHCC, and OpenNeuro (7 repositories total; see Appendix C)
Different funding agencies (NCI vs NIBIB) with different mandates
Different technical philosophies: IDC emphasizes DICOM harmonization + cloud-native analytics; MIDRC emphasizes interoperability + federated access + bias-aware AI
Minimal data overlap: IDC = cancer imaging from NCI programs; MIDRC = COVID-19/respiratory + multi-institutional clinical data
Shared standards: Both use DICOM, both integrate with TCIA

11. Summary: When to Use Which

Use case	Recommended platform
Cancer imaging research (any organ)	IDC
COVID-19 / respiratory disease imaging	MIDRC
Large-scale cloud analytics (BigQuery, SQL)	IDC
AI fairness / bias evaluation	MIDRC
Unbiased model benchmarking (sequestered data)	MIDRC
No-registration, immediate data access	IDC
Multi-modal integration (imaging + EHR + genomics)	MIDRC (via N3C, BioData Catalyst links)
Digital pathology	IDC
Real-world clinical AI testing	MIDRC (TDP 2)
Cross-repository federated search	MIDRC (BDF Imaging Hub, which includes IDC)
Reproducible, versioned research pipelines	IDC (formal versioning + CRDC UUIDs)
Commercial use of data	IDC (97%+ of series CC-BY; check `license_short_name` per object)

Appendix A: Image-Derived Data Deep Dive

IDC Image-Derived Data (verified via idc-index)

IDC stores all derived data (segmentations, annotations, measurements) in standard DICOM formats alongside original images. This is a key architectural distinction — derived data is queryable and downloadable using the same tools as original images.

Derived Modality Summary

DICOM Modality	Description	Patients	Series	Instances	Size (GB)
SR	Structured Reports (radiomics, measurements)	29,065	270,687	270,687	143
SEG	DICOM Segmentation objects	42,023	188,013	214,182	19,474
ANN	Bulk Simple Annotations (pathology)	5,431	7,108	7,108	1,824
RTSTRUCT	RT Structure Sets	1,747	4,938	4,938	10
M3D	3D model objects	842	2,328	2,328	0.3
PR, KO, RWV, REG	Presentation states, key objects, etc.	varies	1,833	1,932	~0.1
Total derived			475,431 series		21,455 GB

23 Analysis Result Collections

Top collections by coverage:

Analysis Result	Description	Subjects	Source Collections	Modalities
TotalSegmentator v1.5.6	AI segmentation of 104 anatomical structures in CT	26,194	nlst	SEG, SR
TCGA-SBU-TIL-Maps	Tumor-infiltrating lymphocyte maps	7,600	23 TCGA collections	SEG
Pan-Cancer-Nuclei-Seg	Nuclei segmentation in H&E pathology	5,185	14 TCGA collections	ANN, SEG
BAMF-AIMI-Annotations	Multi-organ AI segmentations	4,226	22 collections	SEG
nnU-Net-BPR-annotations	Body part regression for CT	985	nlst, nsclc_radiomics	SEG, SR
DICOM-LIDC-IDRI-Nodules	Standardized lung nodule annotations	875	lidc_idri	SEG, SR

Segmentation Detail (188,013 series)

Top segmented collections:

Collection	Modality	Source Series Segmented	Seg Series
nlst	CT	126,077	128,830
ispy2	MR	2,688	2,688
acrin_6698	MR	2,213	2,213
upenn_gbm	MR	2,164	2,384
ispy1	MR	1,992	2,568
tcga_brca	SM (pathology)	1,130	4,224

Top segmentation algorithms:

Algorithm	Type	Seg Series	Source Series
TotalSegmentator v1.5.6	AUTOMATIC	126,051	126,051
Stony Brook TIL Inception-V4 2022	AUTOMATIC	15,868	7,934
Pan-Cancer-Nuclei-Seg	AUTOMATIC	6,074	6,064
BAMF-Brain-MR	AUTOMATIC	2,164	2,164
3d_fullres-tta_nnU-Net	AUTOMATIC	1,453	1,453
BAMF-Prostate-MR	AUTOMATIC	1,164	1,161
BAMF-Lung-CT-v2	AUTOMATIC	1,158	1,158

Segments per segmentation: Most TotalSegmentator SEGs contain 73-80 segments (anatomical structures per scan); single-structure segmentations (44,603 series) are common for tumor/lesion annotations.

Pathology Annotations (7,108 ANN series)

Graphic Type	Groups	Total Annotations
POLYGON	6,075	6.17 billion
RECTANGLE	9,452	111,181

All polygon annotations generated by Pan-Cancer-Nuclei-Seg algorithm — nuclei boundary polygons across 14 TCGA pathology collections.

Structured Reports (270,687 SR series)

Top SR collections (radiomics features, measurements):

Collection	Patients	SR Series
nlst	26,205	257,391
lidc_idri	875	6,859
nsclc_radiomics	414	2,419
lung_pet_ct_dx	354	1,091
ispy1	221	845

The NLST SRs contain radiomics features (~20 per segmented structure: volume, surface area, flatness, CT intensity statistics) for nearly 10 million structures from 125,000 CT images.

MIDRC Image-Derived Data

MIDRC's annotation approach differs fundamentally from IDC's:

Aspect	IDC	MIDRC
Annotation format	DICOM SEG, SR, ANN, RTSTRUCT (standardized, machine-readable)	DICOM + NIfTI, JSON, TXT, CSV, XLSX (multiple formats); portal reports 492K DCM, 122K JSON, 12K TXT, 2.5K NIfTI files
Scale	475,431 derived series, 6.17 billion nuclei annotations	59,492 annotations (portal-verified) + 58K SEG + 1K SR series (BIH) + 203,374 measurements + 12,124 radiology reports
Annotation type	Volumetric segmentations, radiomics features, nuclei polygons, TIL maps	SIFT auto-segmentations (57,419), mRALE severity scores (2,072 expert), diagnostic labels, COVID severity
Discovery	SQL-queryable via seg_index, ann_index, ann_group_index	Guppy GraphQL on `annotation` data type (no auth); queryable fields include `annotation_method` (Retrospective_auto, Retrospective_expert) and `annotation_name` (SIFT, midrc_mRALE_Mastermind_Challenge)
AI-generated at scale	Yes (TotalSegmentator, BAMF-AIMI, Nuclei-Seg)	Yes — SIFT algorithm produced 57,419 auto annotations; 30+ tools also published externally (GitHub/HuggingFace)
Algorithms catalogued	Yes (AlgorithmName, AlgorithmType indexed)	Partially — `annotation_name` and `annotation_method` are queryable via Guppy; not as granular as IDC's per-segment metadata
Additional data types	—	203,374 measurements and 12,124 radiology reports are distinct queryable data types in MIDRC (not available in IDC)

Programmatic Access: Finding and Downloading Annotations

IDC: Find segmentations by algorithm and type

The seg_index provides per-series attributes: AlgorithmName, AlgorithmType (AUTOMATIC, MANUAL, SEMIAUTOMATIC), SegmentationType (BINARY, FRACTIONAL), total_segments, and segmented_SeriesInstanceUID (link to source image).

from idc_index import IDCClient
client = IDCClient()
client.fetch_index("seg_index")

# Find TotalSegmentator automatic segmentations in NLST
nlst_segs = client.sql_query("""
    SELECT
        s.SeriesInstanceUID as seg_series,
        s.segmented_SeriesInstanceUID as source_series,
        s.AlgorithmName,
        s.AlgorithmType,
        s.SegmentationType,
        s.total_segments
    FROM seg_index s
    JOIN index i ON s.segmented_SeriesInstanceUID = i.SeriesInstanceUID
    WHERE i.collection_id = 'nlst'
      AND s.AlgorithmName = 'TotalSegmentator v1.5.6'
      AND s.AlgorithmType = 'AUTOMATIC'
    LIMIT 5
""")
# Each row has 77 segments (104 anatomical structures grouped into series)

# Download segmentation + its source CT together
source_uid = nlst_segs.iloc[0]['source_series']
seg_uid = nlst_segs.iloc[0]['seg_series']
client.download_from_selection(
    seriesInstanceUID=[source_uid, seg_uid],
    downloadDir="./nlst_with_seg"
)

IDC: Find pathology annotations by property type

The ann_group_index provides rich DICOM-coded attributes: AnnotationGroupLabel, AnnotationPropertyCategory_CodeMeaning, AnnotationPropertyType_CodeMeaning, GraphicType (POLYGON, RECTANGLE), NumberOfAnnotations, AlgorithmName, and AnnotationGroupGenerationType (AUTOMATIC, MANUAL).

client.fetch_index("ann_index")
client.fetch_index("ann_group_index")

# Find nuclei polygon annotations specifically (not all ANN are nuclei!)
nuclei_ann = client.sql_query("""
    SELECT
        g.SeriesInstanceUID as ann_series,
        a.referenced_SeriesInstanceUID as source_slide,
        g.AnnotationGroupLabel,
        g.AnnotationPropertyType_CodeMeaning,
        g.GraphicType,
        g.NumberOfAnnotations,
        g.AlgorithmName
    FROM ann_group_index g
    JOIN ann_index a ON g.SeriesInstanceUID = a.SeriesInstanceUID
    JOIN index i ON a.referenced_SeriesInstanceUID = i.SeriesInstanceUID
    WHERE g.AnnotationPropertyType_CodeMeaning = 'Nucleus'
      AND g.AnnotationGroupGenerationType = 'AUTOMATIC'
      AND i.collection_id = 'tcga_brca'
    LIMIT 5
""")

# Find manual blood cell annotations (bone marrow pathology)
blood_cells = client.sql_query("""
    SELECT
        g.AnnotationGroupLabel,
        g.AnnotationPropertyType_CodeMeaning,
        g.AnnotationGroupGenerationType,
        COUNT(DISTINCT g.SeriesInstanceUID) as ann_series,
        SUM(g.NumberOfAnnotations) as total
    FROM ann_group_index g
    WHERE g.AnnotationGroupGenerationType = 'MANUAL'
      AND g.AnnotationPropertyCategory_CodeMeaning = 'Anatomical structure'
    GROUP BY g.AnnotationGroupLabel, g.AnnotationPropertyType_CodeMeaning,
             g.AnnotationGroupGenerationType
    ORDER BY ann_series DESC
    LIMIT 10
""")

IDC: Find measurement Structured Reports

SR series contain measurements and radiomics features, but their content varies — use SeriesDescription and analysis_result_id to identify what they contain. Individual radiomics features within SRs are not searchable via idc-index; use BigQuery for feature-level queries.

# Find TotalSegmentator radiomics SRs (shape and firstorder features)
ts_radiomics = client.sql_query("""
    SELECT
        SeriesInstanceUID,
        PatientID,
        SeriesDescription,
        analysis_result_id
    FROM index
    WHERE Modality = 'SR'
      AND analysis_result_id = 'TotalSegmentator-CT-Segmentations'
      AND SeriesDescription LIKE '%shape%'
    LIMIT 5
""")
# These contain shape radiomics (volume, surface area, flatness, etc.)

# Find non-radiomics SRs: lesion bounding boxes, clinical reports
other_sr = client.sql_query("""
    SELECT DISTINCT SeriesDescription, analysis_result_id, COUNT(*) as count
    FROM index
    WHERE Modality = 'SR'
      AND analysis_result_id NOT IN ('TotalSegmentator-CT-Segmentations')
    GROUP BY SeriesDescription, analysis_result_id
    ORDER BY count DESC
    LIMIT 10
""")
# Returns: BPR annotations, breast imaging reports, tumor bounding boxes, etc.

# For feature-level radiomics queries, BigQuery is required:
# SELECT * FROM `bigquery-public-data.idc_current.measurement_groups`
# WHERE finding_category = 'Radiomics' AND ...

MIDRC: Find and download annotated data

MIDRC annotations are linked to imaging studies through the Gen3 graph model. Querying annotation metadata requires navigating Gen3 nodes rather than querying annotation-level attributes directly.

from gen3.auth import Gen3Auth
from gen3.query import Gen3Query

auth = Gen3Auth(endpoint="https://data.midrc.org", refresh_file="credentials.json")
query = Gen3Query(auth)

# Query imaging studies — annotations are linked as separate file nodes
response = query.query(
    data_type='imaging_study',
    first=100,
    fields=["case_ids", "study_uid", "study_modality", "object_id",
            "study_description", "body_part_examined"],
    filters={
        "body_part_examined": "CHEST",
        "study_modality": "CR"
    }
)

# Annotation files are separate nodes in Gen3 graph
# Navigate: case → imaging_study → annotation_file
# No equivalent to IDC's seg_index/ann_group_index for querying
# annotation attributes (algorithm, segment count, property type) directly
# Download via DRS:
# gen3 drs-pull object "dg.XXTS/<object_id>"

Key difference: In IDC, segmentations and annotations are first-class DICOM objects with dedicated queryable indices exposing algorithm names, segment counts, annotation property types, and graphic types. In MIDRC, annotations are queryable via Guppy GraphQL as a dedicated data type with annotation_method and annotation_name fields — less granular than IDC's per-segment metadata, but publicly accessible without authentication. MIDRC also offers unique data types (measurements, radiology reports) not available in IDC.

Appendix B: IDC Live Data Breakdown (idc-index v0.11.9, IDC v23)

Modality Distribution

Modality	Patients	Series	Instances	Size (GB)
SEG (Segmentation)	42,023	188,013	214,182	19,474
CT	35,172	252,008	29,163,879	15,501
SR (Structured Report)	29,065	270,687	270,687	143
SM (Slide Microscopy)	20,344	71,132	380,054	47,224
MG (Mammography)	17,026	48,125	268,365	5,304
MR	8,062	124,072	15,189,444	4,837
ANN (Annotations)	5,431	7,108	7,108	1,824
RTSTRUCT	1,747	4,938	4,938	10
CR (Computed Radiography)	1,705	12,416	12,427	186
US (Ultrasound)	1,411	2,240	5,128	380
PT (PET)	1,143	4,065	1,338,343	74
Other (M3D, DX, OT, RT*, PR, NM, REG, etc.)	varies	~8,900	~28,000	~375

Top Body Parts

Body Part	Patients	Series
CHEST	30,745	356,133
BREAST	20,919	106,757
ABDOMEN	1,841	10,601
LUNG	1,442	10,971
PROSTATE	1,035	23,008
COLON	850	3,544
PELVIS	808	5,713
LIVER	527	3,989
BRAIN	494	2,044
KIDNEY	359	3,466

License Distribution

License	Collections	Patients	Series	Size (GB)
CC BY 4.0	102	55,097	817,461	68,228
CC BY 3.0	92	27,797	146,372	24,708
CC BY-NC 4.0	5	6,570	28,783	2,039
CC BY-NC 3.0	3	533	1,418	46
NLM Terms	1	2	39	313

Note: Licenses are assigned per DICOM object (series), not per collection. A single collection may contain objects under different licenses (e.g., original images under CC BY 3.0 and contributed analysis results under CC BY 4.0). Always check license_short_name on individual series before use. Collection counts per license sum > 161 because collections span multiple licenses.

Appendix C: MIDRC BDF Imaging Hub API (Live-Tested)

All queries below were tested on 2026-02-12 against the live BIH Guppy GraphQL endpoint. No authentication was required.

Endpoint

POST https://imaging-hub.data-commons.org/guppy/graphql
Content-Type: application/json

The BIH uses Gen3's Guppy service backed by Elasticsearch. Four data types are indexed: subject, imaging_study, imaging_series, and dataset.

Query 1: Total Counts Across All Repositories

curl -s -X POST https://imaging-hub.data-commons.org/guppy/graphql \
  -H "Content-Type: application/json" \
  -d '{"query": "{ _aggregation { subject { _totalCount } imaging_study { _totalCount } imaging_series { _totalCount } } }"}'

Result:

{
  "data": {
    "_aggregation": {
      "subject": { "_totalCount": 327895 },
      "imaging_study": { "_totalCount": 722695 },
      "imaging_series": { "_totalCount": 1963865 }
    }
  }
}

Query 2: Subjects Per Repository

curl -s -X POST https://imaging-hub.data-commons.org/guppy/graphql \
  -H "Content-Type: application/json" \
  -d '{"query": "{ _aggregation { subject { commons_name { histogram { key count } } } } }"}'

Result:

Repository	Subjects
MIDRC	76,193
IDC	69,223
Stanford AIMI	65,514
OpenNeuro	49,010
NIHCC	35,232
TCIA	28,434
ACRdart	4,289

Query 3: Series Per Repository

curl -s -X POST https://imaging-hub.data-commons.org/guppy/graphql \
  -H "Content-Type: application/json" \
  -d '{"query": "{ _aggregation { imaging_series { commons_name { histogram { key count } } } } }"}'

Result:

Repository	Series
IDC	946,957
MIDRC	469,324
Stanford AIMI	224,496
TCIA	201,337
ACRdart	107,150
NIHCC	14,601

Note: IDC leads in series count despite fewer subjects due to its multi-modality, multi-series-per-study collections (e.g., MRI protocols with T1, T2, DWI, ADC series per exam).

Query 4: Filtered Query — IDC CT Series with Metadata

curl -s -X POST https://imaging-hub.data-commons.org/guppy/graphql \
  -H "Content-Type: application/json" \
  -d '{
    "query": "{ imaging_series(first: 3, filter: {AND: [{IN: {commons_name: [\"IDC\"]}}, {IN: {Modality: [\"CT\"]}}]}) { SeriesInstanceUID Modality BodyPartExamined commons_name collection_id } }"
  }'

Result: Returns individual series with SeriesInstanceUID, Modality, BodyPartExamined, commons_name, and collection_id — sufficient to then retrieve the actual data from the source repository (e.g., via idc-index for IDC series).

Query 5: Modality Distribution Across All Repositories

curl -s -X POST https://imaging-hub.data-commons.org/guppy/graphql \
  -H "Content-Type: application/json" \
  -d '{"query": "{ _aggregation { imaging_series { Modality { histogram { key count } } } } }"}'

Returns series counts per modality (CT, MR, CR, DX, SM, etc.) aggregated across all 7 federated repositories.

Query 6: MIDRC-Specific Data Breakdown

The BIH API can be filtered by commons_name to extract verified statistics for any single repository. This was used to independently verify MIDRC data scale.

MIDRC totals: 76,193 subjects, 189,854 studies, 469,324 series

MIDRC modalities (series):

Modality	Series
CT	170,150
CR (Computed Radiography)	126,858
DX (Digital X-ray)	106,539
SEG (Segmentation)	58,000
MR	5,894
SR (Structured Report)	1,074
PT, RF, NM, US, XA, MG	<300 each

MIDRC collections (series):

Collection	Series
MIDRC-Open-R1	209,142
MIDRC-Open-A1	202,202
MIDRC-TCIA-COVID-19-NY-SBU	24,482
MIDRC-Open-A1_SCCM_VIRUS	13,906
MIDRC-Open-A1_PETAL_BLUECORAL	7,948
MIDRC-Open-A1_PETAL_REDCORAL	7,913
MIDRC-TCIA-RICORD_1c	2,048
+ 4 smaller collections	<700 each

MIDRC licenses (series):

License	Series	%
MIDRC DUA	441,111	94.0%
CC BY 4.0	25,816	5.5%
CC BY-NC 4.0	2,397	0.5%

MIDRC race distribution (subjects):

Race	Subjects
White	37,990
Black or African American	21,393
Not Reported	5,667
No data	5,146
Asian	3,410
Other	2,256
American Indian or Alaska Native	176
Native Hawaiian or other Pacific Islander	155

MIDRC top study descriptions (series):

Study Description	Series
XR Chest AP or PA	215,531
CHEST PORT 1 VIEW	13,385
CHEST AP PORT	13,289
CT CHEST WITH CONTRAST	9,287
CTA CHEST (PE STUDY)	8,893
CT CHEST PULMONARY EMBOLISM	8,473
CT CHEST WO CONTRAST	8,359
CT CHEST W CONTRAST	8,204

Key Observations from Live Testing

No authentication required for read-only queries — unlike the MIDRC Gen3 portal itself, the BIH Guppy endpoint is publicly accessible
7 repositories federated (not 5 as initially documented): MIDRC, IDC, TCIA, ACRdart, Stanford AIMI, NIHCC, and OpenNeuro
Sub-second response times for aggregation queries; filtered queries with first: N also return quickly
Cross-repository discovery workflow: Query BIH to find series across repositories → use commons_name + SeriesInstanceUID/collection_id to retrieve data from source (e.g., idc-index for IDC, Gen3 DRS for MIDRC)
Metadata only — the BIH indexes structured metadata (identifiers, modality, body part, collection); actual DICOM files remain at source repositories

Appendix D: MIDRC Portal Guppy API (Live-Tested)

All queries below were tested on 2026-02-12 against the live MIDRC portal Guppy GraphQL endpoint. No authentication was required. This is distinct from the BDF Imaging Hub API (Appendix C) — the portal Guppy indexes MIDRC-specific data at higher granularity.

Endpoints

Endpoint	Method	Auth	Returns
`POST https://data.midrc.org/guppy/graphql`	GraphQL	None	Metadata queries across 6 data types
`GET https://data.midrc.org/guppy/_status`	REST	None	Index schema (data types, fields, array fields)
`GET https://data.midrc.org/index/_stats`	REST	None	Total file count and data size
`GET https://data.midrc.org/mds/metadata?limit=N`	REST	None	Object GUIDs and metadata

Six Queryable Data Types

The MIDRC portal indexes 6 data types via Guppy (vs 4 in BIH):

Data Type	Total Count	Key Fields
case	84,768	sex, race, ethnicity, age_at_index, covid19_positive, zip, conditions, medications, procedures, measurements, imaging_studies (array fields)
imaging_study	202,222	case_ids, study_uid, study_modality, body_part_examined, days_to_study, loinc_code, loinc_contrast, loinc_long_common_name, sex, race, age_at_index, covid19_positive
data_file	631,786	data_format (DCM/JSON/TXT/NII), data_type, data_category, instance_uids, contrast_bolus_agent, convolution_kernel, diffusion_b_value, echo_time, pixel_spacing, slice_thickness, study_year
measurement	203,374	case_ids, test_days_from_index, test_name, test_result_text, dataset_submitter_id
annotation	59,492	case_ids, instance_uids, annotation_method, annotation_name
radiology_report	12,124	body_part_examined, study_modality, sex, race, age_at_index, covid19_positive, ethnicity

Query 1: Total Counts for All Data Types

curl -s -X POST https://data.midrc.org/guppy/graphql \
  -H "Content-Type: application/json" \
  -d '{"query": "{ _aggregation { case { _totalCount } imaging_study { _totalCount } data_file { _totalCount } measurement { _totalCount } annotation { _totalCount } radiology_report { _totalCount } } }"}'

Result:

{
  "data": {
    "_aggregation": {
      "case": {"_totalCount": 84768},
      "imaging_study": {"_totalCount": 202222},
      "data_file": {"_totalCount": 631786},
      "measurement": {"_totalCount": 203374},
      "annotation": {"_totalCount": 59492},
      "radiology_report": {"_totalCount": 12124}
    }
  }
}

Query 2: Demographics Breakdown

curl -s -X POST https://data.midrc.org/guppy/graphql \
  -H "Content-Type: application/json" \
  -d '{"query": "{ _aggregation { case { sex { histogram { key count } } race { histogram { key count } } ethnicity { histogram { key count } } covid19_positive { histogram { key count } } } } }"}'

Result:

Field	Value	Count
sex	Female	39,914
	Male	38,500
	no data	6,330
	Not reported	24
race	White	41,466
	Black or African American	22,467
	no data	8,484
	Not Reported	5,655
	Asian	3,571
	Other	2,267
	American Indian or Alaska Native	179
	Native Hawaiian or other Pacific Islander	156
ethnicity	Not Hispanic or Latino	62,678
	Hispanic or Latino	8,004
	no data	8,396
	Not reported	5,670
	Unknown	20
covid19_positive	No	43,473
	Yes	33,842
	no data	7,453

Query 3: Study Modality and Data File Format Distribution

curl -s -X POST https://data.midrc.org/guppy/graphql \
  -H "Content-Type: application/json" \
  -d '{"query": "{ _aggregation { imaging_study { study_modality { histogram { key count } } } data_file { data_format { histogram { key count } } data_type { histogram { key count } } } } }"}'

Result:

Dimension	Value	Count
study_modality	DX	94,587
	CR	79,646
	CT	25,615
	MR	2,479
data_format	DCM	492,134
	JSON	122,091
	TXT	12,124
	NII	2,510
	CSV	1,888
	XLSX	1,039
data_type	DICOM	433,060
	External Annotation	177,070
	Radiology Report	12,124
	Internal Annotation	8,676
	Clinical Supplement	856

Query 4: Annotation Details

curl -s -X POST https://data.midrc.org/guppy/graphql \
  -H "Content-Type: application/json" \
  -d '{"query": "{ _aggregation { annotation { annotation_method { histogram { key count } } annotation_name { histogram { key count } } } } }"}'

Result:

Field	Value	Count
annotation_method	Retrospective_auto	57,420
	Retrospective_expert	2,072
annotation_name	SIFT	57,419
	midrc_mRALE_Mastermind_Challenge	2,072
	no data	1

Query 5: Dataset/Collection Breakdown (Top 10)

curl -s -X POST https://data.midrc.org/guppy/graphql \
  -H "Content-Type: application/json" \
  -d '{"query": "{ _aggregation { case { dataset { histogram { key count } } } } }"}'

Result (top 10 of 78 datasets):

Dataset	Cases
RSNA_20230830	21,465
RSNA_20230725	13,605
RSNA_20231012	10,456
RSNA_20221011	8,279
RSNA_20230420	5,576
ACR_20230530	5,026
RSNA_20220412	3,972
ACR_20230823	3,395
ACR_20240226	2,539
RSNA_20240315	2,369

Query 6: Total Data Size (Indexd)

curl -s https://data.midrc.org/index/_stats

Result:

{"fileCount": 637431, "totalFileSize": 12360285440149}

Total: 637,431 files, 12.36 TB

Key Observations: Portal vs BIH Discrepancy

Metric	MIDRC Portal	BIH (MIDRC filter)	Difference
Cases/Subjects	84,768	76,193	+11.2%
Studies	202,222	189,854	+6.5%
Files vs Series	631,786 files	469,324 series	Different units

Likely explanations:

BIH indexes at series level, not file level — one series may contain multiple files (DICOM instances), plus annotation files (JSON, NIfTI, TXT) are files but not DICOM series
BIH may lag behind the portal — federated index may not be synchronized in real-time
Different data types indexed — the portal indexes 6 data types including measurements and radiology reports; BIH indexes 4 (subject, imaging_study, imaging_series, dataset)

MIDRC Portal Guppy vs IDC idc-index: Comparison of Public Queryability

Capability	IDC (idc-index)	MIDRC (Portal Guppy)
Auth required	No	No
Query interface	SQL (local DuckDB)	GraphQL (remote Elasticsearch)
Data types	1 primary + 9 specialized indices	6 data types
Latency	Milliseconds (local)	Seconds (network)
Offline	Yes (Parquet cached locally)	No
Schema discovery	`client.indices_overview`	`GET /guppy/_status`
Filter + paginate	SQL LIMIT/OFFSET/WHERE	GraphQL `first`/`offset`/`filter`
Aggregations	SQL GROUP BY, COUNT, SUM	GraphQL `_aggregation` with `histogram`
Instance-level metadata	Yes (4000+ DICOM tags via BigQuery)	Partial (select fields on `data_file` type: pixel_spacing, slice_thickness, echo_time, etc.)
Annotation metadata	Rich (algorithm name, type, segment count, property codes)	Basic (annotation_method, annotation_name)

Limitations of This Analysis

Verification depth

Data Source	Verification Method	Depth
IDC	`idc-index` v0.11.9 programmatic queries	Exact counts (patients, series, instances, sizes, modalities, body parts, licenses, segmentation algorithms, annotation types)
MIDRC (portal Guppy)	`data.midrc.org/guppy/graphql` + `data.midrc.org/index/_stats`	Exact counts (cases, studies, files, data volume in TB, demographics, COVID status, modalities, annotations, measurements, radiology reports, datasets)
MIDRC (via BIH)	`imaging-hub.data-commons.org/guppy/graphql`	Exact counts (subjects, studies, series, modalities, body parts, collections, licenses, race demographics, study descriptions)

Both platforms were verified programmatically using publicly accessible APIs that require no authentication. The MIDRC portal Guppy and Indexd APIs resolved several items previously listed as "not publicly stated" — including data volume (12.36 TB), file count (637,431), annotation methods, and demographic breakdowns.

Things not verified or not possible to verify

~~MIDRC data volume in TB~~ — RESOLVED: 12.36 TB via data.midrc.org/index/_stats
~~MIDRC instance count~~ — RESOLVED: 637,431 files via Indexd (492K DCM + 122K JSON + 12K TXT + 2.5K NIfTI + others); note this counts files, not DICOM instances per se
~~MIDRC annotation-level metadata~~ — RESOLVED: Portal Guppy exposes annotation_method (Retrospective_auto, Retrospective_expert) and annotation_name (SIFT, mRALE_Mastermind_Challenge) as queryable fields. Less granular than IDC's per-segment metadata but searchable.
MIDRC sequestered data characteristics — the 80/20 open/sequestered split is documented, but both portal and BIH likely index only the public portion. The actual composition of the sequestered portion could not be independently verified
Portal vs BIH discrepancy — the MIDRC portal shows more data than BIH (84,768 vs 76,193 cases; 202,222 vs 189,854 studies). Likely explanations: (a) BIH federation lag, (b) BIH indexes fewer data types (4 vs 6), (c) different entity definitions (BIH "subject" vs portal "case"). MIDRC self-reports ">300,000 studies collected" which likely includes sequestered data and pre-release collections
Cross-commons query performance — CDA (IDC/CRDC) query capabilities were described from documentation, not tested programmatically. Both BIH and MIDRC portal Guppy APIs were tested live (see Appendices C and D) and confirmed to work without authentication
MIDRC Gen3 SDK code examples — the Gen3 SDK download examples were assembled from documentation but not executed (requires registration). However, the Guppy GraphQL metadata query examples were executed live and verified
Radiomics feature-level content — IDC's SR series were described by SeriesDescription and analysis_result_id, but individual radiomics features within SRs are only searchable via BigQuery (not idc-index), and this was not demonstrated
MIDRC tool maturity — 30+ algorithms are listed on midrc.org, but their integration depth with the data portal was not verified. SIFT appears to be the primary integrated algorithm (57,419 annotations indexed in portal)
Data freshness — IDC version v23 was current at time of analysis. MIDRC portal and BIH data may differ from each other and from the latest additions
MIDRC data_file field granularity — the portal Guppy data_file type exposes DICOM-level fields (pixel_spacing, slice_thickness, echo_time, contrast_bolus_agent, diffusion_b_value) but the actual queryable values and their coverage across files were not tested

Potential biases

Tool-induced bias

This analysis was generated using Claude Code with an imaging-data-commons skill loaded into context. This skill provided a detailed, authoritative reference for IDC — its data model, SQL patterns, index table schemas, API capabilities, and exact column names. No equivalent skill existed for MIDRC. This created an asymmetry — not by restricting MIDRC research, but by making IDC exploration substantially easier and more directed.

How the skill created asymmetry:

Structured guidance vs. open-ended search. The IDC skill gave the LLM a roadmap of what to explore (e.g., seg_index, ann_group_index, clinical_index, contrast metadata). For MIDRC, there was no equivalent guide — so the analysis of MIDRC's derived data, clinical tables, and internal metadata structure is shallower, not necessarily because MIDRC has less capability, but because there was less guidance on what to look for.
Programmatic verification. The skill enabled live idc-index queries producing exact, authoritative numbers. For MIDRC, programmatic verification was only possible through the BIH Guppy API (aggregate statistics) — not through the Gen3 portal itself (which requires registration). This means IDC's numbers are verified at finer granularity (instance counts, data volume in TB, per-algorithm segmentation counts, per-annotation-group property types) than MIDRC's.
Iterative deepening. Each round of user feedback ("add more detail on annotations", "fix the annotation queries") drove deeper into IDC's capabilities because that's where programmatic tools were available. MIDRC didn't receive equivalent iterative deepening until the BIH API was tested late in the process.

Important caveat: The LLM was not prevented or discouraged from researching MIDRC more thoroughly. Web search, web fetch, and API testing tools were available throughout the session and were used for MIDRC research. The asymmetry was one of pull rather than push — the IDC skill actively directed exploration toward specific metadata tables and query patterns, while MIDRC research required the LLM to independently discover what to look for.

This was confirmed by the late-session discovery of MIDRC's public Guppy API. When explicitly prompted to "search more aggressively for MIDRC information," the LLM discovered that data.midrc.org/guppy/graphql is publicly accessible without authentication — resolving multiple "not publicly stated" gaps in a single round of API testing. The LLM could have made this discovery much earlier in the process; the fact that it required user prompting demonstrates the behavioral nature of the bias. The BIH API discovery also came from user prompting, not LLM initiative. The skill made IDC exploration easier, but nothing made MIDRC exploration harder — the LLM simply didn't try as hard without the structured guidance the IDC skill provided.

Mitigating factors

The MIDRC portal Guppy API discovery substantially corrected the asymmetry by providing independently verified MIDRC numbers for cases (84,768), studies (202,222), files (637,431), data volume (12.36 TB), 6 data types, demographics, annotation methods, and dataset breakdowns
The BIH API added verified series-level statistics (469,324 series, modality/body part/collection/license breakdowns) for cross-repository context
The Indexd API confirmed total data size (12.36 TB) — previously listed as "not publicly stated"
Web research covered MIDRC's unique strengths (bias/fairness tools, sequestered evaluation, PPRL, federated search, measurements, radiology reports) that IDC does not offer
This limitations section explicitly documents the asymmetry and its behavioral nature

What would reduce the bias further

A comparable MIDRC/Gen3 skill providing structured guidance for MIDRC exploration (e.g., documenting Guppy data types, field schemas, query patterns — analogous to the IDC skill's index table guide)
Authenticated Gen3 portal access for testing download workflows and exploring data not exposed via public Guppy (e.g., individual file downloads, workspace integration)
Testing the Cancer Data Aggregator (CDA) API for IDC's cross-commons capabilities, paralleling the BIH/portal API testing done for MIDRC
Exploring MIDRC's data_file Guppy fields (pixel_spacing, slice_thickness, echo_time, contrast_bolus_agent, etc.) to assess DICOM-level metadata queryability — potentially comparable to IDC's BigQuery
More proactive LLM exploration of MIDRC resources earlier in the analysis process, without waiting for user prompting

Key Sources

Publications

Fedorov et al., RadioGraphics 2023 - IDC transparency & reproducibility
Drukker et al., Scientific Data 2025 - MIDRC interoperability use cases
Bergquist et al., J Imaging Informatics Med 2025 - MIDRC + N3C COVID severity prediction
NCI CRDC Core Standards and Services - Cancer Research 2024

Portals & Documentation

APIs Tested Live (Feb 2026, No Auth Required)

IDC: idc-index v0.11.9 Python package (local Parquet queries)
MIDRC Portal Guppy: POST https://data.midrc.org/guppy/graphql — 6 data types, full metadata queries
MIDRC Portal Indexd: GET https://data.midrc.org/index/_stats — file count and data volume
MIDRC Portal Guppy Status: GET https://data.midrc.org/guppy/_status — schema discovery
BDF Imaging Hub Guppy: POST https://imaging-hub.data-commons.org/guppy/graphql — 7-repository federated search

Tools

imaging-data-commons Claude skill - used for programmatic IDC data verification

claude_report_idc_vs_midrc.md

Comparative Analysis: IDC vs MIDRC

Executive Summary

Overview

1. Mission & Disease Focus

2. Data Scale & Modalities

3. Data Access & Licensing

4. Technical Architecture

Interoperability: Standards and Connected Systems

Standards Comparison

IDC: Connected Systems and Mechanisms

MIDRC: Connected Systems and Mechanisms

Key Architectural Difference

5. AI / ML Support

6. Clinical Data & Multi-modal Integration

7. Data Standards & De-identification

8. Governance & Community

9. Programmatic Access Comparison

Setup & Authentication

Querying Metadata

Downloading Data

Joining Imaging with Clinical Data

Browser Visualization

Side-by-Side Summary

10. Relationship Between the Two

11. Summary: When to Use Which

Appendix A: Image-Derived Data Deep Dive

IDC Image-Derived Data (verified via idc-index)

Derived Modality Summary

23 Analysis Result Collections

Segmentation Detail (188,013 series)

Pathology Annotations (7,108 ANN series)

Structured Reports (270,687 SR series)

MIDRC Image-Derived Data

Programmatic Access: Finding and Downloading Annotations

IDC: Find segmentations by algorithm and type

IDC: Find pathology annotations by property type

IDC: Find measurement Structured Reports

MIDRC: Find and download annotated data

Appendix B: IDC Live Data Breakdown (idc-index v0.11.9, IDC v23)

Modality Distribution

Top Body Parts

License Distribution

Appendix C: MIDRC BDF Imaging Hub API (Live-Tested)

Endpoint

Query 1: Total Counts Across All Repositories

Query 2: Subjects Per Repository

Query 3: Series Per Repository

Query 4: Filtered Query — IDC CT Series with Metadata

Query 5: Modality Distribution Across All Repositories

Query 6: MIDRC-Specific Data Breakdown

Key Observations from Live Testing

Appendix D: MIDRC Portal Guppy API (Live-Tested)

Endpoints

Six Queryable Data Types

Query 1: Total Counts for All Data Types

Query 2: Demographics Breakdown

Query 3: Study Modality and Data File Format Distribution

Query 4: Annotation Details

Query 5: Dataset/Collection Breakdown (Top 10)

Query 6: Total Data Size (Indexd)

Key Observations: Portal vs BIH Discrepancy

MIDRC Portal Guppy vs IDC idc-index: Comparison of Public Queryability

Limitations of This Analysis

Verification depth

Things not verified or not possible to verify

Potential biases

Tool-induced bias

Mitigating factors

What would reduce the bias further

Key Sources

Publications

Portals & Documentation

APIs Tested Live (Feb 2026, No Auth Required)

Tools

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally