-
Notifications
You must be signed in to change notification settings - Fork 11
claude_report_idc_vs_midrc.md
Generated on 2026-02-12 using Claude Code (Claude Opus 4.6). The conversation was prompted by Andrey Fedorov (Brigham and Women's Hospital / Harvard Medical School), who is a co-PI of IDC and is also supported in part by the MIDRC project. His dual involvement means the prompts likely reflected more domain knowledge about both platforms than an average user would have, but also that the framing may have been influenced by deeper familiarity with IDC's architecture and capabilities. IDC statistics were verified programmatically using the imaging-data-commons Claude skill, which provides direct access to idc-index (v0.11.9, IDC data v23). MIDRC statistics were verified via live Guppy GraphQL API queries against both the MIDRC portal (data.midrc.org/guppy/graphql) and the BDF Imaging Hub (imaging-hub.data-commons.org/guppy/graphql) — both publicly accessible without authentication. Additional MIDRC data size verified via Indexd API (data.midrc.org/index/_stats).
The NCI Imaging Data Commons (IDC) and the NIBIB/ARPA-H Medical Imaging and Data Resource Center (MIDRC) are two major federally funded open imaging data platforms in the United States. Both launched in 2020, both use DICOM as their primary data standard, and both are built on open-source infrastructure — but they serve different communities, solve different problems, and make different architectural choices.
IDC is a cancer-focused repository hosting 79,569 patients, 994,073 series, and 95.33 TB across 161 collections spanning 26 imaging modalities (CT, MRI, digital pathology, mammography, PET, and more). All data is harmonized to DICOM and served from public cloud storage (AWS + GCS) with zero authentication required — for both metadata queries and file downloads. IDC's strength is its depth of derived data: 188,013 segmentation series, 6.17 billion pathology annotations, and radiomics features for nearly 10 million structures, all queryable via SQL. IDC operates as a node within the NCI Cancer Research Data Commons (CRDC), enabling cross-commons queries spanning imaging, genomics, and proteomics.
MIDRC is a COVID-19/respiratory disease platform that has expanded to chronic diseases, hosting 84,768 cases, 202,222 studies, and 12.36 TB across 78 dataset batches from 27 U.S. hospitals. While narrower in modality (primarily chest X-ray and CT), MIDRC offers unique capabilities: 203,374 clinical measurements, 12,124 radiology reports, and 59,492 annotations as dedicated queryable data types — plus a 20% sequestered dataset for unbiased AI evaluation. MIDRC's defining strengths are its bias/fairness tooling (MELODY, REACT, AI Reliability Tool), its privacy-preserving record linkage with EHR data (N3C), and the BDF Imaging Hub — a federated search spanning 7 imaging repositories including IDC itself.
Licensing differs significantly. IDC assigns licenses per DICOM object (not per collection), with 97%+ of series under CC-BY (4.0 or 3.0), permitting commercial use with attribution. MIDRC uses a custom Data Use Agreement (DUA) covering 94% of its series, with commercial use requiring a separate agreement through the University of Chicago. IDC also makes 100% of its data publicly accessible, while MIDRC sequesters ~20% for unbiased algorithm evaluation.
Key finding for programmatic users: Both platforms offer public metadata APIs requiring no authentication. IDC provides idc-index (local SQL over Parquet, millisecond latency). MIDRC provides Guppy GraphQL at data.midrc.org/guppy/graphql (remote Elasticsearch, seconds latency, 6 data types). MIDRC's Guppy API being publicly accessible was not well-documented and was discovered during this analysis through API testing — it substantially narrows the programmatic access gap between the platforms for metadata exploration.
The platforms are complementary, not competing. Use IDC for cancer imaging, large-scale cloud analytics, digital pathology, and reproducible versioned pipelines. Use MIDRC for COVID-19/respiratory imaging, AI fairness evaluation, sequestered benchmarking, multi-modal clinical integration (imaging + EHR + genomics via cross-commons linkage), and federated cross-repository search. MIDRC's BDF Imaging Hub federates IDC, making both accessible through a single search interface.
| Dimension | IDC (Imaging Data Commons) | MIDRC (Medical Imaging & Data Resource Center) |
|---|---|---|
| Funder | NCI (Cancer Moonshot) | NIBIB + ARPA-H |
| Host / Operator | Brigham & Women's / Harvard (Fedorov & Kikinis) | University of Chicago (Giger); co-led by ACR, RSNA, AAPM |
| Ecosystem | Node within NCI Cancer Research Data Commons (CRDC) | Independent commons; ARPA-H Biomedical Data Fabric performer; NAIRR pilot |
| Launch | 2020 | 2020 (COVID-19 response) |
| IDC | MIDRC | |
|---|---|---|
| Core mission | Cloud-based repository for publicly available cancer imaging data, co-located with analysis tools | AI-ready data commons for machine learning innovation, initially COVID-19, now expanding to chronic diseases |
| Disease scope | Cancer only (all organ sites) | COVID-19 + expanding: cancer, diabetes, chronic liver disease, coronary artery disease, COPD, emphysema |
| Strategic emphasis | Transparency, reproducibility, scalability of imaging AI | Bias mitigation, fairness, real-world AI testing, interoperability |
IDC numbers verified live via idc-index v0.11.9 (IDC data v23, Feb 2026). MIDRC numbers verified via MIDRC portal Guppy GraphQL API + Indexd API (Feb 2026). BIH numbers used where portal data unavailable.
| IDC (v23) | MIDRC (portal-verified) | |
|---|---|---|
| Collections / Datasets | 161 curated collections | 78 dataset batches (RSNA + ACR submissions; largest: RSNA_20230830 with 21,465 cases) |
| Patients / Cases | 79,569 patients | 84,768 cases (portal); 76,193 subjects (BIH — discrepancy likely due to BIH not indexing all data types) |
| Studies | 160,199 studies | 202,222 imaging studies (portal); 189,854 (BIH) |
| Series | 994,073 series | 469,324 series (BIH — portal does not expose series-level counts) |
| Instances (files) | 46,885,909 DICOM instances | 637,431 files (Indexd; includes 492K DCM, 122K JSON, 12K TXT, 2.5K NIfTI) |
| Data volume | 95.33 TB |
12.36 TB (Indexd: data.midrc.org/index/_stats) |
| Imaging modalities | 26 modalities: CT (35K patients), SM/pathology (20K), MG/mammography (17K), MR (8K), PET (1.1K), CR, DX, US, and 18 others | Portal study_modality: DX (94.6K studies), CR (79.6K), CT (25.6K), MR (2.5K). BIH series: CT (170K), CR (127K), DX (107K), SEG (58K), MR (5.9K), SR (1K), + 6 others |
| Anatomical breadth | 20+ body parts: chest (30.7K), breast (20.9K), abdomen (1.8K), lung (1.4K), prostate (1K), colon (850), pelvis, liver, brain, kidney, etc. | Predominantly thoracic: CHEST (221K series), PORT CHEST (60K), ABDOMEN (12K), HEAD (2.7K), HEART (1.4K); ~60% chest, ~34% unlabeled |
| Image-derived data | 188,013 segmentation series covering 154,233 source series; 23 analysis result collections; 7,108 annotation series | 59,492 annotations (portal): SIFT auto-annotations (57,419), mRALE Mastermind Challenge expert annotations (2,072). 58K SEG + 1K SR series (BIH) |
| Additional data types | — | 203,374 measurements, 12,124 radiology reports (portal — unique MIDRC data types not found in IDC) |
| Clinical data tables | 97 collections with clinical data across 224 tables (TCGA, CPTAC, HTAN, NLST, ACRIN, RICORD, etc.) | Demographics (portal: Female 39.9K, Male 38.5K; White 41.5K, Black/AA 22.5K, Asian 3.6K; Not Hispanic 62.7K, Hispanic 8K), conditions, medications, vitals, labs, COVID status (positive: 33.8K, negative: 43.5K) |
| Data sources | TCGA, TCIA, CPTAC, CCDI, HTAN, LIDC, QIN, NLST, VHP | 27 U.S. subsites (academic + community hospitals); 78 RSNA + ACR dataset batches |
| IDC | MIDRC | |
|---|---|---|
| Registration required | No - fully open, anonymous access | Partial - metadata queries via Guppy GraphQL work without authentication (see Appendix D); data download requires free registration |
| Authentication | None for data download; GCP auth only for BigQuery |
None for metadata (Guppy GraphQL at data.midrc.org/guppy/graphql is publicly accessible); Gen3 auth (OpenID Connect / OAuth 2.0) required for file downloads |
| Licensing | Per-object (not per-collection): each DICOM series tagged with its own license. Analysis results contributed to a collection may differ from original images. Overall: 97%+ CC-BY (CC BY 4.0: 817K series; CC BY 3.0: 146K series); 3% CC-BY-NC (30K series) | MIDRC DUA (441K series, 94%); CC BY 4.0 (26K series, 5.5%); CC BY-NC 4.0 (2.4K series, 0.5%). Commercial use requires separate DUA agreement via UChicago |
| Egress fees | Zero - free from both AWS and GCS | Free download from Gen3 portal |
| Sequestered data | None - 100% public | 20% sequestered for unbiased algorithm evaluation / regulatory approval |
| IDC | MIDRC | |
|---|---|---|
| Platform | Google Cloud Platform + AWS (dual-cloud) | Gen3 Data Ecosystem (open-source, UChicago data center) |
| Data format | All data harmonized to DICOM | DICOM + multiple delivery formats (DICOM SR, SEG, JSON, NIfTI, CSV, XLSX) |
| Storage | 3 public cloud buckets (GCS + S3 mirrors) | Gen3 object storage |
| Metadata query | BigQuery (4000+ DICOM tags, SQL); idc-index Python package (~50 columns, no auth); Parquet on S3 (DuckDB) | Guppy GraphQL API at data.midrc.org/guppy/graphql (no auth, 6 data types: case, imaging_study, data_file, measurement, annotation, radiology_report); Gen3 portal explorer; GA4GH DRS API; graph database for clinical/phenotype data |
| Image retrieval | DICOMweb (IDC proxy, no auth); S3/GCS direct download; Google Healthcare API | Gen3 download client; TCIA integration |
| Programmatic access |
idc-index Python package; BigQuery SQL; DICOMweb REST |
Guppy GraphQL (no auth for metadata); Gen3 SDK; GA4GH DRS; Indexd API (/index/_stats); Gen3 mesh services |
| Visualization | OHIF v3 (radiology), SliM (pathology), VolView (3D), 3D Slicer extension | Gen3 portal; limited built-in visualization |
| Compute integration | Google Colab, Vertex AI, NIH Cloud Lab, ACCESS HPC | Gen3 workspaces |
| Data versioning | Versioned releases (v1-v23+), CRDC UUIDs persist across versions | Incremental data additions; no formal versioning scheme documented |
| Federated search | No (centralized) | Yes - MIDRC BDF Imaging Hub federates 7 repositories: IDC, MIDRC, TCIA, ACRDart, Stanford AIMI, NIHCC, OpenNeuro (327,895 subjects, 1.96M series total) |
Both platforms invest in interoperability, but with different philosophies: IDC prioritizes DICOM standards + CRDC ecosystem integration, while MIDRC prioritizes federated access + cross-commons record linkage.
| Standard | IDC | MIDRC |
|---|---|---|
| DICOM | Core philosophy — all data harmonized to DICOM | Native storage format |
| DICOMweb (WADO-RS, QIDO-RS, STOW-RS) | Full support via Google Healthcare API + IDC proxy (no auth) | WADO-RS support |
| GA4GH DRS | Yes — CRDC UUIDs resolve as DRS IDs (dg.4DFC/<uuid>) via CRDC centralized service |
Yes — GUIDs resolve to file locations across Gen3 repositories |
| GA4GH Passports | Via CRDC/DCF infrastructure | Core auth mechanism for cross-commons access |
| OpenID Connect / OAuth 2.0 | GCP auth only (for BigQuery); data access is unauthenticated | Required for all access |
| FHIR (HL7) | Not directly (NCPI alignment in progress) | Supported for clinical data interoperability |
| PPRL (Privacy-Preserving Record Linkage) | Not implemented | Gen3 Crosswalk Service for secure patient matching across systems |
| Cloud-native | BigQuery SQL; Parquet on S3; dual-cloud (AWS+GCS) mirroring | Gen3 object storage; single cloud |
| CRDC-H (Harmonized Data Model) | Yes — enables cross-commons queries via Cancer Data Aggregator | No (not a CRDC node) |
IDC connects through the NCI CRDC ecosystem and NCPI program:
| System | Linkage Mechanism | What It Provides |
|---|---|---|
| GDC (Genomic Data Commons) | CRDC shared identifiers; Cancer Data Aggregator | Genomic sequencing and variant data |
| PDC (Proteomic Data Commons) | CRDC shared identifiers; Cancer Data Aggregator | Proteomic analysis data |
| ICDC (Canine Data Commons) | CRDC shared identifiers | Canine clinical trial data |
| CTDC (Clinical & Translational Data Commons) | CRDC shared identifiers | Clinical trial data |
| TCIA | Data ingestion — IDC mirrors all public TCIA collections automatically | Archival repository; IDC adds cloud-native access |
| NCPI platforms (AnVIL, BioData Catalyst, dbGaP, Kids First) | GA4GH DRS; shared auth standards | Cross-NIH platform interoperability |
| PACS systems | DICOMweb (QIDO-RS, WADO-RS) | Direct query from clinical imaging systems |
Cancer Data Aggregator (CDA): Unified search across all 6 CRDC nodes — users can build "virtual cohorts" spanning imaging (IDC), genomics (GDC), proteomics (PDC), and clinical data using disease name, anatomical location, demographics, or data type. Uses the CRDC-H harmonized data model for cross-node element mapping.
CRDC Data Commons Framework (DCF): Gen3-based infrastructure minting persistent GUIDs for all 52 million FAIR objects (4.9 PB) across CRDC. DRS IDs resolve to access methods for both GCS and AWS, decoupling logical identifiers from physical storage.
Dual-cloud mirroring: IDC maintains identical copies on AWS S3 and GCS (migrated 40M DICOM objects / 63 TB via AWS DataSync in <41 hours). Parquet metadata exports in idc-open-metadata S3 bucket enable tool-agnostic access via DuckDB, Spark, etc.
MIDRC connects through the Gen3 mesh and ARPA-H BDF Imaging Hub:
| System | Linkage Mechanism | What It Provides | Published Use Case |
|---|---|---|---|
| N3C (National COVID Cohort Collaborative) | PPRL via honest broker | EHR data (diagnoses, labs, vitals, meds, procedures) | COVID-19 severity prediction from chest X-rays + EHR (Bergquist et al., J Imaging Informatics Med 2025) |
| BioData Catalyst (BDC) | Gen3 mesh / Crosswalk | Genomic + clinical data from NHLBI studies | Multimodal cohort construction (Chen/Whitney et al., Scientific Data 2025) |
| IDC | BDF Imaging Hub federated search | Cancer imaging (95 TB, 161 collections) | Cross-repository discovery via BIH |
| TCIA | Direct collection hosting + BIH | MIDRC-RICORD-1A/1B/1C collections served via TCIA | RICORD datasets published through both |
| ACRDart | BDF Imaging Hub affiliate | Radiology imaging data (ACR registry) | Federated search |
| Stanford AIMI | BDF Imaging Hub affiliate | Stanford medical imaging datasets | Federated search |
| All of Us | Gen3 mesh capability | Large-scale population health data | Supported (in development) |
Gen3 "Narrow Middle" architecture: Standardized core services (OpenAPI/RESTful) sit between diverse data ingest/curation and analysis/processing applications. FAIR APIs auto-generated from data models. The Crosswalk Service passes privacy-preserving MIDRC IDs to match patients across repositories without exposing PHI.
MIDRC BDF Imaging Hub (ARPA-H funded):
- Federated hub providing unified search across 7 repositories: MIDRC, IDC, TCIA, ACRDart, Stanford AIMI, NIHCC, and OpenNeuro
- Does not centralize/duplicate imaging files — aggregates structured metadata and provides identifiers for researchers to access from original nodes
- Features cohort builder over aggregated structured metadata
- Part of ARPA-H Biomedical Data Fabric Toolbox (September 2023)
-
API: Guppy GraphQL endpoint at
https://imaging-hub.data-commons.org/guppy/graphql— works without authentication for read-only queries - Live-verified scale (Feb 2026): 327,895 subjects, 722,695 studies, 1,963,865 series across all 7 repositories
| Repository | Subjects | Series |
|---|---|---|
| MIDRC | 76,193 | 469,324 |
| IDC | 69,223 | 946,957 |
| Stanford AIMI | 65,514 | 224,496 |
| OpenNeuro | 49,010 | — |
| NIHCC | 35,232 | 14,601 |
| TCIA | 28,434 | 201,337 |
| ACRdart | 4,289 | 107,150 |
See Appendix C for live-tested API query examples.
| IDC | MIDRC | |
|---|---|---|
| Interop model | Hub-in-ecosystem — centralized data with CRDC cross-commons query | Federated mesh — distributed data with cross-repository search |
| Cross-commons search | Cancer Data Aggregator queries all 6 CRDC nodes | BDF Imaging Hub federates 7 imaging repositories (verified via live API) |
| Patient matching | Shared CRDC identifiers (same patient across GDC/PDC/IDC) | Privacy-preserving record linkage (PPRL) for patient matching across separate systems |
| Data movement | Data ingested and harmonized centrally (DICOM) | Data stays at source; identifiers routed through mesh |
| NIH ecosystem | NCPI (AnVIL, BDC, dbGaP, Kids First) | ARPA-H BDF + Gen3 mesh (N3C, BDC, All of Us) |
| IDC | MIDRC | |
|---|---|---|
| AI-generated annotations | Yes - large-scale (e.g., NLST: 10M structures from 125K CT images with radiomics) | Yes - 59,492 annotations (portal-verified): SIFT auto-annotations (57,419, Retrospective_auto), mRALE Mastermind Challenge expert annotations (2,072, Retrospective_expert) |
| Bias / fairness tools | Not a primary focus | Core strength: MIDRC-MELODY (subgroup performance), MIDRC-REACT (dataset representativeness), AI Reliability Tool (30 bias sources across 5 pipeline stages), MetricTree (metric selection) |
| Sequestered evaluation | No | Yes - 20% held-out data for unbiased benchmarking and regulatory use |
| Model hosting | No | 30+ algorithms on GitHub, HuggingFace, PhysioNet |
| Real-world testing | No | TDP 2: Real-time connection with healthcare facilities for dynamic algorithm testing |
| NLP tools | No | RadGraph (entity/relation extraction from radiology reports) |
| IDC | MIDRC | |
|---|---|---|
| Clinical data | Cancer staging, treatment history, outcomes (varies by collection); structured via eCRFs | Portal-verified demographics: sex (Female 39.9K, Male 38.5K), race (White 41.5K, Black/AA 22.5K, Asian 3.6K), ethnicity (Not Hispanic 62.7K, Hispanic 8K), COVID status (negative 43.5K, positive 33.8K). Additional: conditions, medications, vitals, labs, procedures. 203,374 measurements and 12,124 radiology reports as dedicated queryable data types |
| Cross-commons linkage | CRDC (GDC genomics, PDC proteomics, ICDC canine) via shared identifiers | N3C (EHR), All of Us, BioData Catalyst (genomics); demonstrated COVID severity prediction use case |
| Clinical data standards | DICOM metadata; BigQuery-queryable | LOINC-harmonized; graph database |
| IDC | MIDRC | |
|---|---|---|
| Primary standard | DICOM (universal harmonization - all data converted to DICOM) | DICOM (native storage) + multiple export formats |
| Identifiers | CRDC UUIDs (instance, series, study) + standard DICOM UIDs | Gen3 object IDs + DICOM UIDs |
| De-identification | Two-stage process; NEMA/RSNA CTP profiles | Stanford De-identifier (text); RSNA DICOM Anonymizer (images); DICOM Harmonization Tool |
| Metadata richness | 4000+ DICOM tags queryable in BigQuery | Curated subset via Gen3 data model |
| IDC | MIDRC | |
|---|---|---|
| Governance model | NCI-funded contract (Leidos + BWH); advisory boards | NIBIB contract; tri-society leadership (ACR, RSNA, AAPM); 5 TDPs + 12 CRPs |
| Open-source | idc-index, viewers, tutorials (GitHub) | Gen3 platform (Apache License); all AI tools open-source with publications |
| User base | Broad cancer research community | >450 registered data users |
| Community engagement | Tutorials, Colab notebooks, 3D Slicer extension | 27 subsites; equity/diversity initiatives; NAIRR pilot |
IDC - No setup beyond pip install idc-index. No authentication for queries or downloads:
from idc_index import IDCClient
client = IDCClient() # Ready immediatelyMIDRC - Metadata queries require no authentication (Guppy GraphQL is public). Data downloads require registration + Gen3 SDK:
import requests
# NO AUTH needed for metadata queries via Guppy GraphQL
response = requests.post(
"https://data.midrc.org/guppy/graphql",
json={"query": '{ _aggregation { case { _totalCount } imaging_study { _totalCount } } }'}
)
print(response.json()) # Returns: case 84,768; imaging_study 202,222
# For data DOWNLOADS, Gen3 auth is required:
# pip install gen3
from gen3.auth import Gen3Auth
# credentials.json downloaded from https://data.midrc.org/identity (valid 30 days)
auth = Gen3Auth(endpoint="https://data.midrc.org", refresh_file="credentials.json")IDC - SQL queries against a local Parquet index (~50 columns, no network call):
results = client.sql_query("""
SELECT collection_id, PatientID, SeriesInstanceUID, Modality, series_size_MB
FROM index
WHERE Modality = 'CT' AND BodyPartExamined = 'CHEST'
LIMIT 100
""")MIDRC - GraphQL queries against Guppy (Elasticsearch-backed, no auth for metadata):
import requests
# Direct Guppy GraphQL — no authentication needed
response = requests.post(
"https://data.midrc.org/guppy/graphql",
json={"query": """{ imaging_study(first: 100, filter: {AND: [
{IN: {body_part_examined: ["CHEST"]}},
{IN: {study_modality: ["CT"]}}
]}) { case_ids study_uid study_modality body_part_examined } }"""}
)
studies = response.json()["data"]["imaging_study"]
# Or via Gen3 SDK (requires auth):
# response = query.query(data_type='imaging_study', first=100,
# fields=["case_ids", "study_uid", "study_modality", "object_id"],
# filters={"body_part_examined": "CHEST", "study_modality": "CT"})IDC - One-liner, downloads from public S3/GCS buckets (no auth):
client.download_from_selection(
seriesInstanceUID=["1.3.6.1.4..."],
downloadDir="./data"
)MIDRC - Two-step: resolve DRS URI, then download via presigned URL:
# CLI approach
# gen3 --endpoint https://data.midrc.org --auth creds.json drs-pull object "dg.XXTS/<guid>"
# Or Python: resolve object_id → presigned URL → download
import requests
url = f"https://data.midrc.org/index/{object_id}"
file_info = requests.get(url, headers={"Authorization": f"Bearer {auth.token}"}).json()IDC - SQL JOINs across local index tables:
client.fetch_index("clinical_index")
clinical_df = client.get_clinical_table("nlst_prsn") # loads locally
# Join with imaging data via PatientID or collection_idMIDRC - Separate queries on Gen3 nodes, merged in pandas:
imaging = query.query(data_type='imaging_study', fields=["case_ids", "study_uid", "days_to_study"], ...)
measurements = query.query(data_type='measurement', fields=["case_ids", "test_days_from_index"], ...)
cohort = pd.merge(pd.DataFrame(imaging), pd.DataFrame(measurements), on='case_ids')IDC - Generate viewer URL, opens OHIF/SliM automatically:
url = client.get_viewer_URL(seriesInstanceUID="1.3.6.1.4...")
webbrowser.open(url)MIDRC - No equivalent programmatic viewer URL generation; visualization is portal-based.
| Scenario | IDC | MIDRC |
|---|---|---|
| Install | pip install idc-index |
None for metadata (direct HTTP to Guppy); pip install gen3 + register for downloads |
| Auth | None | None for metadata queries; API key (30-day expiry) for downloads |
| Query language | SQL (DuckDB on local Parquet) | GraphQL (Guppy on remote Elasticsearch) |
| Query latency | Milliseconds (local) | Seconds (network) |
| Download method | Direct S3/GCS public URLs | DRS URI resolution → presigned URL (requires auth) |
| Batch download |
download_from_selection() with list of UIDs |
gen3 drs-pull manifest manifest.json ./output/ |
| Clinical data join | SQL JOIN on local tables | Separate GraphQL queries on 6 data types, merge in pandas |
| Viewer integration |
get_viewer_URL() → OHIF/SliM |
Portal only |
| Offline capability | Full (index is local Parquet) | None (all queries require network) |
| Queryable data types | 1 primary index + 9 specialized indices | 6 data types: case, imaging_study, data_file, measurement, annotation, radiology_report |
| Tutorials | IDC-Tutorials | MIDRC tutorial_notebooks |
IDC and MIDRC are complementary, not competing:
- MIDRC's BDF Imaging Hub federates IDC - users can search IDC data through MIDRC's unified interface alongside TCIA, ACRDart, Stanford AIMI, NIHCC, and OpenNeuro (7 repositories total; see Appendix C)
- Different funding agencies (NCI vs NIBIB) with different mandates
- Different technical philosophies: IDC emphasizes DICOM harmonization + cloud-native analytics; MIDRC emphasizes interoperability + federated access + bias-aware AI
- Minimal data overlap: IDC = cancer imaging from NCI programs; MIDRC = COVID-19/respiratory + multi-institutional clinical data
- Shared standards: Both use DICOM, both integrate with TCIA
| Use case | Recommended platform |
|---|---|
| Cancer imaging research (any organ) | IDC |
| COVID-19 / respiratory disease imaging | MIDRC |
| Large-scale cloud analytics (BigQuery, SQL) | IDC |
| AI fairness / bias evaluation | MIDRC |
| Unbiased model benchmarking (sequestered data) | MIDRC |
| No-registration, immediate data access | IDC |
| Multi-modal integration (imaging + EHR + genomics) | MIDRC (via N3C, BioData Catalyst links) |
| Digital pathology | IDC |
| Real-world clinical AI testing | MIDRC (TDP 2) |
| Cross-repository federated search | MIDRC (BDF Imaging Hub, which includes IDC) |
| Reproducible, versioned research pipelines | IDC (formal versioning + CRDC UUIDs) |
| Commercial use of data |
IDC (97%+ of series CC-BY; check license_short_name per object) |
IDC stores all derived data (segmentations, annotations, measurements) in standard DICOM formats alongside original images. This is a key architectural distinction — derived data is queryable and downloadable using the same tools as original images.
| DICOM Modality | Description | Patients | Series | Instances | Size (GB) |
|---|---|---|---|---|---|
| SR | Structured Reports (radiomics, measurements) | 29,065 | 270,687 | 270,687 | 143 |
| SEG | DICOM Segmentation objects | 42,023 | 188,013 | 214,182 | 19,474 |
| ANN | Bulk Simple Annotations (pathology) | 5,431 | 7,108 | 7,108 | 1,824 |
| RTSTRUCT | RT Structure Sets | 1,747 | 4,938 | 4,938 | 10 |
| M3D | 3D model objects | 842 | 2,328 | 2,328 | 0.3 |
| PR, KO, RWV, REG | Presentation states, key objects, etc. | varies | 1,833 | 1,932 | ~0.1 |
| Total derived | 475,431 series | 21,455 GB |
Top collections by coverage:
| Analysis Result | Description | Subjects | Source Collections | Modalities |
|---|---|---|---|---|
| TotalSegmentator v1.5.6 | AI segmentation of 104 anatomical structures in CT | 26,194 | nlst | SEG, SR |
| TCGA-SBU-TIL-Maps | Tumor-infiltrating lymphocyte maps | 7,600 | 23 TCGA collections | SEG |
| Pan-Cancer-Nuclei-Seg | Nuclei segmentation in H&E pathology | 5,185 | 14 TCGA collections | ANN, SEG |
| BAMF-AIMI-Annotations | Multi-organ AI segmentations | 4,226 | 22 collections | SEG |
| nnU-Net-BPR-annotations | Body part regression for CT | 985 | nlst, nsclc_radiomics | SEG, SR |
| DICOM-LIDC-IDRI-Nodules | Standardized lung nodule annotations | 875 | lidc_idri | SEG, SR |
Top segmented collections:
| Collection | Modality | Source Series Segmented | Seg Series |
|---|---|---|---|
| nlst | CT | 126,077 | 128,830 |
| ispy2 | MR | 2,688 | 2,688 |
| acrin_6698 | MR | 2,213 | 2,213 |
| upenn_gbm | MR | 2,164 | 2,384 |
| ispy1 | MR | 1,992 | 2,568 |
| tcga_brca | SM (pathology) | 1,130 | 4,224 |
Top segmentation algorithms:
| Algorithm | Type | Seg Series | Source Series |
|---|---|---|---|
| TotalSegmentator v1.5.6 | AUTOMATIC | 126,051 | 126,051 |
| Stony Brook TIL Inception-V4 2022 | AUTOMATIC | 15,868 | 7,934 |
| Pan-Cancer-Nuclei-Seg | AUTOMATIC | 6,074 | 6,064 |
| BAMF-Brain-MR | AUTOMATIC | 2,164 | 2,164 |
| 3d_fullres-tta_nnU-Net | AUTOMATIC | 1,453 | 1,453 |
| BAMF-Prostate-MR | AUTOMATIC | 1,164 | 1,161 |
| BAMF-Lung-CT-v2 | AUTOMATIC | 1,158 | 1,158 |
Segments per segmentation: Most TotalSegmentator SEGs contain 73-80 segments (anatomical structures per scan); single-structure segmentations (44,603 series) are common for tumor/lesion annotations.
| Graphic Type | Groups | Total Annotations |
|---|---|---|
| POLYGON | 6,075 | 6.17 billion |
| RECTANGLE | 9,452 | 111,181 |
All polygon annotations generated by Pan-Cancer-Nuclei-Seg algorithm — nuclei boundary polygons across 14 TCGA pathology collections.
Top SR collections (radiomics features, measurements):
| Collection | Patients | SR Series |
|---|---|---|
| nlst | 26,205 | 257,391 |
| lidc_idri | 875 | 6,859 |
| nsclc_radiomics | 414 | 2,419 |
| lung_pet_ct_dx | 354 | 1,091 |
| ispy1 | 221 | 845 |
The NLST SRs contain radiomics features (~20 per segmented structure: volume, surface area, flatness, CT intensity statistics) for nearly 10 million structures from 125,000 CT images.
MIDRC's annotation approach differs fundamentally from IDC's:
| Aspect | IDC | MIDRC |
|---|---|---|
| Annotation format | DICOM SEG, SR, ANN, RTSTRUCT (standardized, machine-readable) | DICOM + NIfTI, JSON, TXT, CSV, XLSX (multiple formats); portal reports 492K DCM, 122K JSON, 12K TXT, 2.5K NIfTI files |
| Scale | 475,431 derived series, 6.17 billion nuclei annotations | 59,492 annotations (portal-verified) + 58K SEG + 1K SR series (BIH) + 203,374 measurements + 12,124 radiology reports |
| Annotation type | Volumetric segmentations, radiomics features, nuclei polygons, TIL maps | SIFT auto-segmentations (57,419), mRALE severity scores (2,072 expert), diagnostic labels, COVID severity |
| Discovery | SQL-queryable via seg_index, ann_index, ann_group_index | Guppy GraphQL on annotation data type (no auth); queryable fields include annotation_method (Retrospective_auto, Retrospective_expert) and annotation_name (SIFT, midrc_mRALE_Mastermind_Challenge) |
| AI-generated at scale | Yes (TotalSegmentator, BAMF-AIMI, Nuclei-Seg) | Yes — SIFT algorithm produced 57,419 auto annotations; 30+ tools also published externally (GitHub/HuggingFace) |
| Algorithms catalogued | Yes (AlgorithmName, AlgorithmType indexed) | Partially — annotation_name and annotation_method are queryable via Guppy; not as granular as IDC's per-segment metadata |
| Additional data types | — | 203,374 measurements and 12,124 radiology reports are distinct queryable data types in MIDRC (not available in IDC) |
The seg_index provides per-series attributes: AlgorithmName, AlgorithmType (AUTOMATIC, MANUAL, SEMIAUTOMATIC), SegmentationType (BINARY, FRACTIONAL), total_segments, and segmented_SeriesInstanceUID (link to source image).
from idc_index import IDCClient
client = IDCClient()
client.fetch_index("seg_index")
# Find TotalSegmentator automatic segmentations in NLST
nlst_segs = client.sql_query("""
SELECT
s.SeriesInstanceUID as seg_series,
s.segmented_SeriesInstanceUID as source_series,
s.AlgorithmName,
s.AlgorithmType,
s.SegmentationType,
s.total_segments
FROM seg_index s
JOIN index i ON s.segmented_SeriesInstanceUID = i.SeriesInstanceUID
WHERE i.collection_id = 'nlst'
AND s.AlgorithmName = 'TotalSegmentator v1.5.6'
AND s.AlgorithmType = 'AUTOMATIC'
LIMIT 5
""")
# Each row has 77 segments (104 anatomical structures grouped into series)
# Download segmentation + its source CT together
source_uid = nlst_segs.iloc[0]['source_series']
seg_uid = nlst_segs.iloc[0]['seg_series']
client.download_from_selection(
seriesInstanceUID=[source_uid, seg_uid],
downloadDir="./nlst_with_seg"
)The ann_group_index provides rich DICOM-coded attributes: AnnotationGroupLabel, AnnotationPropertyCategory_CodeMeaning, AnnotationPropertyType_CodeMeaning, GraphicType (POLYGON, RECTANGLE), NumberOfAnnotations, AlgorithmName, and AnnotationGroupGenerationType (AUTOMATIC, MANUAL).
client.fetch_index("ann_index")
client.fetch_index("ann_group_index")
# Find nuclei polygon annotations specifically (not all ANN are nuclei!)
nuclei_ann = client.sql_query("""
SELECT
g.SeriesInstanceUID as ann_series,
a.referenced_SeriesInstanceUID as source_slide,
g.AnnotationGroupLabel,
g.AnnotationPropertyType_CodeMeaning,
g.GraphicType,
g.NumberOfAnnotations,
g.AlgorithmName
FROM ann_group_index g
JOIN ann_index a ON g.SeriesInstanceUID = a.SeriesInstanceUID
JOIN index i ON a.referenced_SeriesInstanceUID = i.SeriesInstanceUID
WHERE g.AnnotationPropertyType_CodeMeaning = 'Nucleus'
AND g.AnnotationGroupGenerationType = 'AUTOMATIC'
AND i.collection_id = 'tcga_brca'
LIMIT 5
""")
# Find manual blood cell annotations (bone marrow pathology)
blood_cells = client.sql_query("""
SELECT
g.AnnotationGroupLabel,
g.AnnotationPropertyType_CodeMeaning,
g.AnnotationGroupGenerationType,
COUNT(DISTINCT g.SeriesInstanceUID) as ann_series,
SUM(g.NumberOfAnnotations) as total
FROM ann_group_index g
WHERE g.AnnotationGroupGenerationType = 'MANUAL'
AND g.AnnotationPropertyCategory_CodeMeaning = 'Anatomical structure'
GROUP BY g.AnnotationGroupLabel, g.AnnotationPropertyType_CodeMeaning,
g.AnnotationGroupGenerationType
ORDER BY ann_series DESC
LIMIT 10
""")SR series contain measurements and radiomics features, but their content varies — use SeriesDescription and analysis_result_id to identify what they contain. Individual radiomics features within SRs are not searchable via idc-index; use BigQuery for feature-level queries.
# Find TotalSegmentator radiomics SRs (shape and firstorder features)
ts_radiomics = client.sql_query("""
SELECT
SeriesInstanceUID,
PatientID,
SeriesDescription,
analysis_result_id
FROM index
WHERE Modality = 'SR'
AND analysis_result_id = 'TotalSegmentator-CT-Segmentations'
AND SeriesDescription LIKE '%shape%'
LIMIT 5
""")
# These contain shape radiomics (volume, surface area, flatness, etc.)
# Find non-radiomics SRs: lesion bounding boxes, clinical reports
other_sr = client.sql_query("""
SELECT DISTINCT SeriesDescription, analysis_result_id, COUNT(*) as count
FROM index
WHERE Modality = 'SR'
AND analysis_result_id NOT IN ('TotalSegmentator-CT-Segmentations')
GROUP BY SeriesDescription, analysis_result_id
ORDER BY count DESC
LIMIT 10
""")
# Returns: BPR annotations, breast imaging reports, tumor bounding boxes, etc.
# For feature-level radiomics queries, BigQuery is required:
# SELECT * FROM `bigquery-public-data.idc_current.measurement_groups`
# WHERE finding_category = 'Radiomics' AND ...MIDRC annotations are linked to imaging studies through the Gen3 graph model. Querying annotation metadata requires navigating Gen3 nodes rather than querying annotation-level attributes directly.
from gen3.auth import Gen3Auth
from gen3.query import Gen3Query
auth = Gen3Auth(endpoint="https://data.midrc.org", refresh_file="credentials.json")
query = Gen3Query(auth)
# Query imaging studies — annotations are linked as separate file nodes
response = query.query(
data_type='imaging_study',
first=100,
fields=["case_ids", "study_uid", "study_modality", "object_id",
"study_description", "body_part_examined"],
filters={
"body_part_examined": "CHEST",
"study_modality": "CR"
}
)
# Annotation files are separate nodes in Gen3 graph
# Navigate: case → imaging_study → annotation_file
# No equivalent to IDC's seg_index/ann_group_index for querying
# annotation attributes (algorithm, segment count, property type) directly
# Download via DRS:
# gen3 drs-pull object "dg.XXTS/<object_id>"Key difference: In IDC, segmentations and annotations are first-class DICOM objects with dedicated queryable indices exposing algorithm names, segment counts, annotation property types, and graphic types. In MIDRC, annotations are queryable via Guppy GraphQL as a dedicated data type with annotation_method and annotation_name fields — less granular than IDC's per-segment metadata, but publicly accessible without authentication. MIDRC also offers unique data types (measurements, radiology reports) not available in IDC.
| Modality | Patients | Series | Instances | Size (GB) |
|---|---|---|---|---|
| SEG (Segmentation) | 42,023 | 188,013 | 214,182 | 19,474 |
| CT | 35,172 | 252,008 | 29,163,879 | 15,501 |
| SR (Structured Report) | 29,065 | 270,687 | 270,687 | 143 |
| SM (Slide Microscopy) | 20,344 | 71,132 | 380,054 | 47,224 |
| MG (Mammography) | 17,026 | 48,125 | 268,365 | 5,304 |
| MR | 8,062 | 124,072 | 15,189,444 | 4,837 |
| ANN (Annotations) | 5,431 | 7,108 | 7,108 | 1,824 |
| RTSTRUCT | 1,747 | 4,938 | 4,938 | 10 |
| CR (Computed Radiography) | 1,705 | 12,416 | 12,427 | 186 |
| US (Ultrasound) | 1,411 | 2,240 | 5,128 | 380 |
| PT (PET) | 1,143 | 4,065 | 1,338,343 | 74 |
| Other (M3D, DX, OT, RT*, PR, NM, REG, etc.) | varies | ~8,900 | ~28,000 | ~375 |
| Body Part | Patients | Series |
|---|---|---|
| CHEST | 30,745 | 356,133 |
| BREAST | 20,919 | 106,757 |
| ABDOMEN | 1,841 | 10,601 |
| LUNG | 1,442 | 10,971 |
| PROSTATE | 1,035 | 23,008 |
| COLON | 850 | 3,544 |
| PELVIS | 808 | 5,713 |
| LIVER | 527 | 3,989 |
| BRAIN | 494 | 2,044 |
| KIDNEY | 359 | 3,466 |
| License | Collections | Patients | Series | Size (GB) |
|---|---|---|---|---|
| CC BY 4.0 | 102 | 55,097 | 817,461 | 68,228 |
| CC BY 3.0 | 92 | 27,797 | 146,372 | 24,708 |
| CC BY-NC 4.0 | 5 | 6,570 | 28,783 | 2,039 |
| CC BY-NC 3.0 | 3 | 533 | 1,418 | 46 |
| NLM Terms | 1 | 2 | 39 | 313 |
Note: Licenses are assigned per DICOM object (series), not per collection. A single collection may contain objects under different licenses (e.g., original images under CC BY 3.0 and contributed analysis results under CC BY 4.0). Always check license_short_name on individual series before use. Collection counts per license sum > 161 because collections span multiple licenses.
All queries below were tested on 2026-02-12 against the live BIH Guppy GraphQL endpoint. No authentication was required.
POST https://imaging-hub.data-commons.org/guppy/graphql
Content-Type: application/json
The BIH uses Gen3's Guppy service backed by Elasticsearch. Four data types are indexed: subject, imaging_study, imaging_series, and dataset.
curl -s -X POST https://imaging-hub.data-commons.org/guppy/graphql \
-H "Content-Type: application/json" \
-d '{"query": "{ _aggregation { subject { _totalCount } imaging_study { _totalCount } imaging_series { _totalCount } } }"}'Result:
{
"data": {
"_aggregation": {
"subject": { "_totalCount": 327895 },
"imaging_study": { "_totalCount": 722695 },
"imaging_series": { "_totalCount": 1963865 }
}
}
}curl -s -X POST https://imaging-hub.data-commons.org/guppy/graphql \
-H "Content-Type: application/json" \
-d '{"query": "{ _aggregation { subject { commons_name { histogram { key count } } } } }"}'Result:
| Repository | Subjects |
|---|---|
| MIDRC | 76,193 |
| IDC | 69,223 |
| Stanford AIMI | 65,514 |
| OpenNeuro | 49,010 |
| NIHCC | 35,232 |
| TCIA | 28,434 |
| ACRdart | 4,289 |
curl -s -X POST https://imaging-hub.data-commons.org/guppy/graphql \
-H "Content-Type: application/json" \
-d '{"query": "{ _aggregation { imaging_series { commons_name { histogram { key count } } } } }"}'Result:
| Repository | Series |
|---|---|
| IDC | 946,957 |
| MIDRC | 469,324 |
| Stanford AIMI | 224,496 |
| TCIA | 201,337 |
| ACRdart | 107,150 |
| NIHCC | 14,601 |
Note: IDC leads in series count despite fewer subjects due to its multi-modality, multi-series-per-study collections (e.g., MRI protocols with T1, T2, DWI, ADC series per exam).
curl -s -X POST https://imaging-hub.data-commons.org/guppy/graphql \
-H "Content-Type: application/json" \
-d '{
"query": "{ imaging_series(first: 3, filter: {AND: [{IN: {commons_name: [\"IDC\"]}}, {IN: {Modality: [\"CT\"]}}]}) { SeriesInstanceUID Modality BodyPartExamined commons_name collection_id } }"
}'Result: Returns individual series with SeriesInstanceUID, Modality, BodyPartExamined, commons_name, and collection_id — sufficient to then retrieve the actual data from the source repository (e.g., via idc-index for IDC series).
curl -s -X POST https://imaging-hub.data-commons.org/guppy/graphql \
-H "Content-Type: application/json" \
-d '{"query": "{ _aggregation { imaging_series { Modality { histogram { key count } } } } }"}'Returns series counts per modality (CT, MR, CR, DX, SM, etc.) aggregated across all 7 federated repositories.
The BIH API can be filtered by commons_name to extract verified statistics for any single repository. This was used to independently verify MIDRC data scale.
MIDRC totals: 76,193 subjects, 189,854 studies, 469,324 series
MIDRC modalities (series):
| Modality | Series |
|---|---|
| CT | 170,150 |
| CR (Computed Radiography) | 126,858 |
| DX (Digital X-ray) | 106,539 |
| SEG (Segmentation) | 58,000 |
| MR | 5,894 |
| SR (Structured Report) | 1,074 |
| PT, RF, NM, US, XA, MG | <300 each |
MIDRC collections (series):
| Collection | Series |
|---|---|
| MIDRC-Open-R1 | 209,142 |
| MIDRC-Open-A1 | 202,202 |
| MIDRC-TCIA-COVID-19-NY-SBU | 24,482 |
| MIDRC-Open-A1_SCCM_VIRUS | 13,906 |
| MIDRC-Open-A1_PETAL_BLUECORAL | 7,948 |
| MIDRC-Open-A1_PETAL_REDCORAL | 7,913 |
| MIDRC-TCIA-RICORD_1c | 2,048 |
| + 4 smaller collections | <700 each |
MIDRC licenses (series):
| License | Series | % |
|---|---|---|
| MIDRC DUA | 441,111 | 94.0% |
| CC BY 4.0 | 25,816 | 5.5% |
| CC BY-NC 4.0 | 2,397 | 0.5% |
MIDRC race distribution (subjects):
| Race | Subjects |
|---|---|
| White | 37,990 |
| Black or African American | 21,393 |
| Not Reported | 5,667 |
| No data | 5,146 |
| Asian | 3,410 |
| Other | 2,256 |
| American Indian or Alaska Native | 176 |
| Native Hawaiian or other Pacific Islander | 155 |
MIDRC top study descriptions (series):
| Study Description | Series |
|---|---|
| XR Chest AP or PA | 215,531 |
| CHEST PORT 1 VIEW | 13,385 |
| CHEST AP PORT | 13,289 |
| CT CHEST WITH CONTRAST | 9,287 |
| CTA CHEST (PE STUDY) | 8,893 |
| CT CHEST PULMONARY EMBOLISM | 8,473 |
| CT CHEST WO CONTRAST | 8,359 |
| CT CHEST W CONTRAST | 8,204 |
- No authentication required for read-only queries — unlike the MIDRC Gen3 portal itself, the BIH Guppy endpoint is publicly accessible
- 7 repositories federated (not 5 as initially documented): MIDRC, IDC, TCIA, ACRdart, Stanford AIMI, NIHCC, and OpenNeuro
-
Sub-second response times for aggregation queries; filtered queries with
first: Nalso return quickly -
Cross-repository discovery workflow: Query BIH to find series across repositories → use
commons_name+SeriesInstanceUID/collection_idto retrieve data from source (e.g.,idc-indexfor IDC, Gen3 DRS for MIDRC) - Metadata only — the BIH indexes structured metadata (identifiers, modality, body part, collection); actual DICOM files remain at source repositories
All queries below were tested on 2026-02-12 against the live MIDRC portal Guppy GraphQL endpoint. No authentication was required. This is distinct from the BDF Imaging Hub API (Appendix C) — the portal Guppy indexes MIDRC-specific data at higher granularity.
| Endpoint | Method | Auth | Returns |
|---|---|---|---|
POST https://data.midrc.org/guppy/graphql |
GraphQL | None | Metadata queries across 6 data types |
GET https://data.midrc.org/guppy/_status |
REST | None | Index schema (data types, fields, array fields) |
GET https://data.midrc.org/index/_stats |
REST | None | Total file count and data size |
GET https://data.midrc.org/mds/metadata?limit=N |
REST | None | Object GUIDs and metadata |
The MIDRC portal indexes 6 data types via Guppy (vs 4 in BIH):
| Data Type | Total Count | Key Fields |
|---|---|---|
| case | 84,768 | sex, race, ethnicity, age_at_index, covid19_positive, zip, conditions, medications, procedures, measurements, imaging_studies (array fields) |
| imaging_study | 202,222 | case_ids, study_uid, study_modality, body_part_examined, days_to_study, loinc_code, loinc_contrast, loinc_long_common_name, sex, race, age_at_index, covid19_positive |
| data_file | 631,786 | data_format (DCM/JSON/TXT/NII), data_type, data_category, instance_uids, contrast_bolus_agent, convolution_kernel, diffusion_b_value, echo_time, pixel_spacing, slice_thickness, study_year |
| measurement | 203,374 | case_ids, test_days_from_index, test_name, test_result_text, dataset_submitter_id |
| annotation | 59,492 | case_ids, instance_uids, annotation_method, annotation_name |
| radiology_report | 12,124 | body_part_examined, study_modality, sex, race, age_at_index, covid19_positive, ethnicity |
curl -s -X POST https://data.midrc.org/guppy/graphql \
-H "Content-Type: application/json" \
-d '{"query": "{ _aggregation { case { _totalCount } imaging_study { _totalCount } data_file { _totalCount } measurement { _totalCount } annotation { _totalCount } radiology_report { _totalCount } } }"}'Result:
{
"data": {
"_aggregation": {
"case": {"_totalCount": 84768},
"imaging_study": {"_totalCount": 202222},
"data_file": {"_totalCount": 631786},
"measurement": {"_totalCount": 203374},
"annotation": {"_totalCount": 59492},
"radiology_report": {"_totalCount": 12124}
}
}
}curl -s -X POST https://data.midrc.org/guppy/graphql \
-H "Content-Type: application/json" \
-d '{"query": "{ _aggregation { case { sex { histogram { key count } } race { histogram { key count } } ethnicity { histogram { key count } } covid19_positive { histogram { key count } } } } }"}'Result:
| Field | Value | Count |
|---|---|---|
| sex | Female | 39,914 |
| Male | 38,500 | |
| no data | 6,330 | |
| Not reported | 24 | |
| race | White | 41,466 |
| Black or African American | 22,467 | |
| no data | 8,484 | |
| Not Reported | 5,655 | |
| Asian | 3,571 | |
| Other | 2,267 | |
| American Indian or Alaska Native | 179 | |
| Native Hawaiian or other Pacific Islander | 156 | |
| ethnicity | Not Hispanic or Latino | 62,678 |
| Hispanic or Latino | 8,004 | |
| no data | 8,396 | |
| Not reported | 5,670 | |
| Unknown | 20 | |
| covid19_positive | No | 43,473 |
| Yes | 33,842 | |
| no data | 7,453 |
curl -s -X POST https://data.midrc.org/guppy/graphql \
-H "Content-Type: application/json" \
-d '{"query": "{ _aggregation { imaging_study { study_modality { histogram { key count } } } data_file { data_format { histogram { key count } } data_type { histogram { key count } } } } }"}'Result:
| Dimension | Value | Count |
|---|---|---|
| study_modality | DX | 94,587 |
| CR | 79,646 | |
| CT | 25,615 | |
| MR | 2,479 | |
| data_format | DCM | 492,134 |
| JSON | 122,091 | |
| TXT | 12,124 | |
| NII | 2,510 | |
| CSV | 1,888 | |
| XLSX | 1,039 | |
| data_type | DICOM | 433,060 |
| External Annotation | 177,070 | |
| Radiology Report | 12,124 | |
| Internal Annotation | 8,676 | |
| Clinical Supplement | 856 |
curl -s -X POST https://data.midrc.org/guppy/graphql \
-H "Content-Type: application/json" \
-d '{"query": "{ _aggregation { annotation { annotation_method { histogram { key count } } annotation_name { histogram { key count } } } } }"}'Result:
| Field | Value | Count |
|---|---|---|
| annotation_method | Retrospective_auto | 57,420 |
| Retrospective_expert | 2,072 | |
| annotation_name | SIFT | 57,419 |
| midrc_mRALE_Mastermind_Challenge | 2,072 | |
| no data | 1 |
curl -s -X POST https://data.midrc.org/guppy/graphql \
-H "Content-Type: application/json" \
-d '{"query": "{ _aggregation { case { dataset { histogram { key count } } } } }"}'Result (top 10 of 78 datasets):
| Dataset | Cases |
|---|---|
| RSNA_20230830 | 21,465 |
| RSNA_20230725 | 13,605 |
| RSNA_20231012 | 10,456 |
| RSNA_20221011 | 8,279 |
| RSNA_20230420 | 5,576 |
| ACR_20230530 | 5,026 |
| RSNA_20220412 | 3,972 |
| ACR_20230823 | 3,395 |
| ACR_20240226 | 2,539 |
| RSNA_20240315 | 2,369 |
curl -s https://data.midrc.org/index/_statsResult:
{"fileCount": 637431, "totalFileSize": 12360285440149}Total: 637,431 files, 12.36 TB
| Metric | MIDRC Portal | BIH (MIDRC filter) | Difference |
|---|---|---|---|
| Cases/Subjects | 84,768 | 76,193 | +11.2% |
| Studies | 202,222 | 189,854 | +6.5% |
| Files vs Series | 631,786 files | 469,324 series | Different units |
Likely explanations:
- BIH indexes at series level, not file level — one series may contain multiple files (DICOM instances), plus annotation files (JSON, NIfTI, TXT) are files but not DICOM series
- BIH may lag behind the portal — federated index may not be synchronized in real-time
- Different data types indexed — the portal indexes 6 data types including measurements and radiology reports; BIH indexes 4 (subject, imaging_study, imaging_series, dataset)
| Capability | IDC (idc-index) | MIDRC (Portal Guppy) |
|---|---|---|
| Auth required | No | No |
| Query interface | SQL (local DuckDB) | GraphQL (remote Elasticsearch) |
| Data types | 1 primary + 9 specialized indices | 6 data types |
| Latency | Milliseconds (local) | Seconds (network) |
| Offline | Yes (Parquet cached locally) | No |
| Schema discovery | client.indices_overview |
GET /guppy/_status |
| Filter + paginate | SQL LIMIT/OFFSET/WHERE | GraphQL first/offset/filter
|
| Aggregations | SQL GROUP BY, COUNT, SUM | GraphQL _aggregation with histogram
|
| Instance-level metadata | Yes (4000+ DICOM tags via BigQuery) | Partial (select fields on data_file type: pixel_spacing, slice_thickness, echo_time, etc.) |
| Annotation metadata | Rich (algorithm name, type, segment count, property codes) | Basic (annotation_method, annotation_name) |
| Data Source | Verification Method | Depth |
|---|---|---|
| IDC |
idc-index v0.11.9 programmatic queries |
Exact counts (patients, series, instances, sizes, modalities, body parts, licenses, segmentation algorithms, annotation types) |
| MIDRC (portal Guppy) |
data.midrc.org/guppy/graphql + data.midrc.org/index/_stats
|
Exact counts (cases, studies, files, data volume in TB, demographics, COVID status, modalities, annotations, measurements, radiology reports, datasets) |
| MIDRC (via BIH) | imaging-hub.data-commons.org/guppy/graphql |
Exact counts (subjects, studies, series, modalities, body parts, collections, licenses, race demographics, study descriptions) |
Both platforms were verified programmatically using publicly accessible APIs that require no authentication. The MIDRC portal Guppy and Indexd APIs resolved several items previously listed as "not publicly stated" — including data volume (12.36 TB), file count (637,431), annotation methods, and demographic breakdowns.
-
MIDRC data volume in TB— RESOLVED: 12.36 TB viadata.midrc.org/index/_stats -
MIDRC instance count— RESOLVED: 637,431 files via Indexd (492K DCM + 122K JSON + 12K TXT + 2.5K NIfTI + others); note this counts files, not DICOM instances per se -
MIDRC annotation-level metadata— RESOLVED: Portal Guppy exposesannotation_method(Retrospective_auto, Retrospective_expert) andannotation_name(SIFT, mRALE_Mastermind_Challenge) as queryable fields. Less granular than IDC's per-segment metadata but searchable. - MIDRC sequestered data characteristics — the 80/20 open/sequestered split is documented, but both portal and BIH likely index only the public portion. The actual composition of the sequestered portion could not be independently verified
- Portal vs BIH discrepancy — the MIDRC portal shows more data than BIH (84,768 vs 76,193 cases; 202,222 vs 189,854 studies). Likely explanations: (a) BIH federation lag, (b) BIH indexes fewer data types (4 vs 6), (c) different entity definitions (BIH "subject" vs portal "case"). MIDRC self-reports ">300,000 studies collected" which likely includes sequestered data and pre-release collections
- Cross-commons query performance — CDA (IDC/CRDC) query capabilities were described from documentation, not tested programmatically. Both BIH and MIDRC portal Guppy APIs were tested live (see Appendices C and D) and confirmed to work without authentication
- MIDRC Gen3 SDK code examples — the Gen3 SDK download examples were assembled from documentation but not executed (requires registration). However, the Guppy GraphQL metadata query examples were executed live and verified
-
Radiomics feature-level content — IDC's SR series were described by
SeriesDescriptionandanalysis_result_id, but individual radiomics features within SRs are only searchable via BigQuery (not idc-index), and this was not demonstrated - MIDRC tool maturity — 30+ algorithms are listed on midrc.org, but their integration depth with the data portal was not verified. SIFT appears to be the primary integrated algorithm (57,419 annotations indexed in portal)
- Data freshness — IDC version v23 was current at time of analysis. MIDRC portal and BIH data may differ from each other and from the latest additions
-
MIDRC data_file field granularity — the portal Guppy
data_filetype exposes DICOM-level fields (pixel_spacing, slice_thickness, echo_time, contrast_bolus_agent, diffusion_b_value) but the actual queryable values and their coverage across files were not tested
This analysis was generated using Claude Code with an imaging-data-commons skill loaded into context. This skill provided a detailed, authoritative reference for IDC — its data model, SQL patterns, index table schemas, API capabilities, and exact column names. No equivalent skill existed for MIDRC. This created an asymmetry — not by restricting MIDRC research, but by making IDC exploration substantially easier and more directed.
How the skill created asymmetry:
-
Structured guidance vs. open-ended search. The IDC skill gave the LLM a roadmap of what to explore (e.g.,
seg_index,ann_group_index,clinical_index, contrast metadata). For MIDRC, there was no equivalent guide — so the analysis of MIDRC's derived data, clinical tables, and internal metadata structure is shallower, not necessarily because MIDRC has less capability, but because there was less guidance on what to look for. -
Programmatic verification. The skill enabled live
idc-indexqueries producing exact, authoritative numbers. For MIDRC, programmatic verification was only possible through the BIH Guppy API (aggregate statistics) — not through the Gen3 portal itself (which requires registration). This means IDC's numbers are verified at finer granularity (instance counts, data volume in TB, per-algorithm segmentation counts, per-annotation-group property types) than MIDRC's. -
Iterative deepening. Each round of user feedback ("add more detail on annotations", "fix the annotation queries") drove deeper into IDC's capabilities because that's where programmatic tools were available. MIDRC didn't receive equivalent iterative deepening until the BIH API was tested late in the process.
Important caveat: The LLM was not prevented or discouraged from researching MIDRC more thoroughly. Web search, web fetch, and API testing tools were available throughout the session and were used for MIDRC research. The asymmetry was one of pull rather than push — the IDC skill actively directed exploration toward specific metadata tables and query patterns, while MIDRC research required the LLM to independently discover what to look for.
This was confirmed by the late-session discovery of MIDRC's public Guppy API. When explicitly prompted to "search more aggressively for MIDRC information," the LLM discovered that data.midrc.org/guppy/graphql is publicly accessible without authentication — resolving multiple "not publicly stated" gaps in a single round of API testing. The LLM could have made this discovery much earlier in the process; the fact that it required user prompting demonstrates the behavioral nature of the bias. The BIH API discovery also came from user prompting, not LLM initiative. The skill made IDC exploration easier, but nothing made MIDRC exploration harder — the LLM simply didn't try as hard without the structured guidance the IDC skill provided.
- The MIDRC portal Guppy API discovery substantially corrected the asymmetry by providing independently verified MIDRC numbers for cases (84,768), studies (202,222), files (637,431), data volume (12.36 TB), 6 data types, demographics, annotation methods, and dataset breakdowns
- The BIH API added verified series-level statistics (469,324 series, modality/body part/collection/license breakdowns) for cross-repository context
- The Indexd API confirmed total data size (12.36 TB) — previously listed as "not publicly stated"
- Web research covered MIDRC's unique strengths (bias/fairness tools, sequestered evaluation, PPRL, federated search, measurements, radiology reports) that IDC does not offer
- This limitations section explicitly documents the asymmetry and its behavioral nature
- A comparable MIDRC/Gen3 skill providing structured guidance for MIDRC exploration (e.g., documenting Guppy data types, field schemas, query patterns — analogous to the IDC skill's index table guide)
- Authenticated Gen3 portal access for testing download workflows and exploring data not exposed via public Guppy (e.g., individual file downloads, workspace integration)
- Testing the Cancer Data Aggregator (CDA) API for IDC's cross-commons capabilities, paralleling the BIH/portal API testing done for MIDRC
- Exploring MIDRC's
data_fileGuppy fields (pixel_spacing, slice_thickness, echo_time, contrast_bolus_agent, etc.) to assess DICOM-level metadata queryability — potentially comparable to IDC's BigQuery - More proactive LLM exploration of MIDRC resources earlier in the analysis process, without waiting for user prompting
- Fedorov et al., RadioGraphics 2023 - IDC transparency & reproducibility
- Drukker et al., Scientific Data 2025 - MIDRC interoperability use cases
- Bergquist et al., J Imaging Informatics Med 2025 - MIDRC + N3C COVID severity prediction
- NCI CRDC Core Standards and Services - Cancer Research 2024
- IDC Portal | IDC Docs
- MIDRC Portal | MIDRC Website
- MIDRC BDF Imaging Hub
- Gen3 Data Commons Architecture
- NIH Cloud Platform Interoperability
-
IDC:
idc-indexv0.11.9 Python package (local Parquet queries) -
MIDRC Portal Guppy:
POST https://data.midrc.org/guppy/graphql— 6 data types, full metadata queries -
MIDRC Portal Indexd:
GET https://data.midrc.org/index/_stats— file count and data volume -
MIDRC Portal Guppy Status:
GET https://data.midrc.org/guppy/_status— schema discovery -
BDF Imaging Hub Guppy:
POST https://imaging-hub.data-commons.org/guppy/graphql— 7-repository federated search
- imaging-data-commons Claude skill - used for programmatic IDC data verification