Skip to content

claude_report_clinical_data.md

Andrey Fedorov edited this page Feb 12, 2026 · 1 revision

Clinical Data and Age Analysis in NCI Imaging Data Commons

Analysis Date: 2026-02-10 IDC Version: v23 (idc-index 0.11.8) Collections Analyzed: 97 collections with clinical data Generated using: IDC Claude Skill


Executive Summary

This report provides a comprehensive analysis of clinical data in the NCI Imaging Data Commons (IDC), with a focus on patient age information. The analysis covers:

  1. Clinical data structure and access patterns in IDC
  2. Semantic commonality of clinical columns across collections
  3. Age data availability from both clinical tables and DICOM metadata
  4. Cross-source consistency between clinical and DICOM age values
  5. Data quality issues and recommendations

Key Findings

Metric Count Percentage
Total collections with clinical data 97 100%
Collections with meaningful clinical age 55 57%
Collections with DICOM PatientAge 69 71%
Collections with both sources 39 40%

Critical discoveries:

  • Only dicom_patient_id is truly universal across all clinical tables
  • TCGA provides the most standardized clinical data (32 collections with identical schemas)
  • CPTAC collections store age in days since birth, not years (requires ÷365.25 conversion)
  • Clinical and DICOM ages may reference different time points (diagnosis vs. imaging vs. surgery)
  • DICOM PatientAge is sparse or missing in many collections, even when clinical age exists

Table of Contents

  1. Understanding Clinical Data in IDC
  2. Semantic Commonality Across Collections
  3. Age Column Naming Conventions
  4. Complete Collection Analysis
  5. Data Quality Issues
  6. Cross-Source Consistency
  7. Recommendations
  8. Appendix: Code Examples

Understanding Clinical Data in IDC

What is Clinical Data?

Clinical data in IDC refers to non-imaging information that accompanies medical images:

  • Patient demographics (age, sex, race, ethnicity)
  • Clinical history (diagnoses, surgeries, therapies)
  • Lab tests and pathology results
  • Cancer staging (clinical and pathological TNM)
  • Treatment outcomes and survival data

Key Characteristics

Characteristic Description
Not harmonized Terms and formats vary across collections (each data submitter uses their own conventions)
Not universal Not all collections have clinical data
Anonymized dicom_patient_id links clinical records to imaging data (PatientID in imaging index)

The clinical_index Table

The clinical_index serves as a dictionary/catalog of all available clinical data:

Column Purpose Use For
collection_id Collection identifier Filtering by collection
table_name Full BigQuery table reference BigQuery queries (if needed)
short_table_name Short name get_clinical_table() method
column Column name in table Selecting data columns
column_label Human-readable description Searching for concepts
values Observed attribute values Interpreting coded values

Basic Access Pattern

from idc_index import IDCClient

client = IDCClient()

# 1. Fetch the clinical index (catalog of all clinical data)
client.fetch_index('clinical_index')

# 2. See which collections have clinical data
collections = client.clinical_index["collection_id"].unique()
print(f"{len(collections)} collections have clinical data")

# 3. Explore clinical attributes for a specific collection
nlst_columns = client.clinical_index[client.clinical_index['collection_id']=='nlst']

# 4. Load a specific clinical table
nlst_prsn = client.get_clinical_table("nlst_prsn")

# 5. Join with imaging data using dicom_patient_id = PatientID

Semantic Commonality Across Collections

Universal Columns (100% coverage)

Only two columns appear in every clinical table:

Column Description
dicom_patient_id Links clinical data to imaging (PatientID in index)
source_batch IDC provenance tracking

Columns by Prevalence

The analysis identified columns appearing across multiple collections. Many columns at ~33% coverage are TCGA-specific (32 collections with identical schemas):

Category Column Coverage TCGA Non-TCGA
Demographics race 44% (43) 32 11
gender 40% (39) 32 7
ethnicity 38% (37) 32 5
age_at_diagnosis 36% (35) 32 3
age 14% (14) 0 14
sex 12% (12) 0 12
Survival vital_status 36% (35) 32 3
days_to_death 33% (32) 32 0
days_to_last_followup 34% (33) 32 1
Cancer Staging clinical_stage 33% (32) 32 0
clinical_T, clinical_N, clinical_M 33% (32) 32 0
pathologic_stage 33% (32) 32 0
pathologic_T, pathologic_N, pathologic_M 33% (32) 32 0
Histology histological_type 34% (33) 32 1
icd_o_3_histology 33% (32) 32 0
Tumor Site tumor_tissue_site 33% (32) 32 0
icd_o_3_site 33% (32) 32 0
Body Metrics height, weight, bmi 33% (32) 32 0
Treatment history_of_neoadjuvant_treatment 33% (32) 32 0

Non-TCGA Common Columns

Among the 65 non-TCGA collections, these columns appear most frequently:

Column Collections Notes
patientid 17 Alternative patient identifier
studyinstanceuid 16 Links to specific imaging studies
collection 15 Collection identifier
age 14 Patient age (various meanings)
seriesinstanceuid 14 Links to specific series
patient_id 13 Alternative patient identifier
sex 12 Patient sex
disease_type 12 Cancer/disease classification
race 11 Demographics
primary_site 11 Tumor location

Key Takeaways on Semantic Commonality

  1. Only dicom_patient_id is truly universal - every clinical table has this for linking to imaging

  2. TCGA provides the most standardized clinical data - 32 collections with identical column names for staging, histology, and treatment

  3. Demographics are the most cross-compatible - race, gender/sex, age/age_at_diagnosis appear across many collection types

  4. Naming varies - The same concept uses different column names (e.g., gender vs sex, age vs age_at_diagnosis)

  5. Survival data is moderately common - vital_status and time-to-event columns appear in ~35% of collections


Age Column Naming Conventions

Clinical data tables use various column names for patient age:

Column Name Collections Unit Description
age 14 Years Generic age (context varies by collection)
age_at_diagnosis 35 Years Age when cancer was diagnosed (TCGA standard)
diag__age_at_diagnosis 10 Days Age at diagnosis in days (CPTAC format)
demo__age_at_index 10 Days Age at index date (CPTAC, often empty)
Age_at_Diagnosis 4 Years HTAN format

The 14 Collections with "age" Column

Collection Table Column Description
acrin_6698 acrin_6698_clinical age Age in years at enrollment
adrenal_acc_ki67_seg adrenal_acc_ki67_seg_clinical age Integer years
cmmd cmmd_clinical age Age
colorectal_liver_metastases colorectal_liver_metastases_clinical age Age at Operation
covid_19_ar covid_19_ar_clinical age AGE
cptac_aml cptac_aml_demographic_classification age Age (all "Not Reported")
ea1141 ea1141_demographics age Participant age at study enrollment
hcc_tace_seg hcc_tace_seg_clinical age Patient age in years during diagnosis
ispy1 ispy1_clinical age age
lung_pet_ct_dx lung_pet_ct_dx_clinical age Age
nlst nlst_prsn age Age at randomization (years)
nsclc_radiomics nsclc_radiomics_clinical age age
remind remind_clinical age Years between birth and surgery date
soft_tissue_sarcoma soft_tissue_sarcoma_clinical age Age

Note: Even with the same column name, age can refer to different time points:

  • Age at enrollment/randomization (clinical trials)
  • Age at diagnosis
  • Age at surgery/operation

Complete Collection Analysis

Summary Statistics

Category Collections Notes
TCGA 32 All have age_at_diagnosis in years
CPTAC 11 Use diag__age_at_diagnosis in days
Other with clinical age 15 Various column names
DICOM age only 30 No clinical age column
No age data 12 Neither source available

TCGA Collections (32 total)

All TCGA collections use age_at_diagnosis in years with consistent formatting.

Collection Patients Valid Age Range Mean Quality DICOM Age Available
tcga_acc 92 92 14-83 47 GOOD No
tcga_blca 412 412 34-90 68 GOOD Yes (38 values)
tcga_brca 1,098 1,096 26-90 58 GOOD Yes (45 values)
tcga_cesc 307 307 20-88 48 GOOD Yes (25 values)
tcga_chol 51 45 29-82 64 PARTIAL No
tcga_coad 460 458 31-90 67 GOOD Yes (23 values)
tcga_dlbc 48 48 23-82 56 GOOD No
tcga_esca 185 185 27-90 62 GOOD Yes (13 values)
tcga_gbm 607 595 10-89 58 GOOD No
tcga_hnsc 523 522 19-90 61 GOOD No
tcga_kich 113 113 17-86 51 GOOD Yes (14 values)
tcga_kirc 537 537 26-90 61 GOOD Yes (54 values)
tcga_kirp 291 288 28-88 61 GOOD Yes (27 values)
tcga_lgg 516 515 14-87 43 GOOD No
tcga_lihc 377 376 16-90 59 GOOD Yes (49 values)
tcga_luad 560 503 33-88 65 PARTIAL Yes (36 values)
tcga_lusc 504 495 39-90 67 GOOD Yes (25 values)
tcga_meso 87 87 28-81 63 GOOD No
tcga_ov 590 587 26-89 60 GOOD Yes (44 values)
tcga_paad 185 185 35-88 65 GOOD No
tcga_pcpg 179 179 19-83 47 GOOD No
tcga_prad 500 500 41-78 61 GOOD Yes (13 values)
tcga_read 171 170 31-90 64 GOOD Yes (3 values)
tcga_sarc 261 261 20-90 61 GOOD Yes (5 values)
tcga_skcm 470 462 15-90 58 GOOD No
tcga_stad 443 438 30-90 66 GOOD Yes (24 values)
tcga_tgct 150 134 14-67 32 PARTIAL No
tcga_thca 507 507 15-89 47 GOOD Yes (6 values)
tcga_thym 124 123 17-84 58 GOOD No
tcga_ucec 560 545 31-90 64 GOOD Yes (44 values)
tcga_ucs 57 57 51-90 70 GOOD No
tcga_uvm 80 80 22-86 62 GOOD No

CPTAC Collections (11 total)

Important: CPTAC stores diag__age_at_diagnosis in days since birth. Values must be divided by 365.25 to convert to years.

Collection Patients Valid Age Range (years) Mean Quality DICOM Age
cptac_aml 275 0 N/A N/A NO DATA Yes (30 values)
cptac_brca 123 112 31-90 60 GOOD Yes (47 values)
cptac_ccrcc 110 110 30-90 61 GOOD Yes (51 values)
cptac_coad 106 0 N/A N/A NO DATA Yes (42 values)
cptac_gbm 99 99 24-89 58 GOOD Yes (51 values)
cptac_hnscc 110 110 23-81 62 GOOD Yes (34 values)
cptac_lscc 108 105 41-88 66 GOOD Yes (40 values)
cptac_luad 111 111 35-82 63 GOOD Yes (46 values)
cptac_ov 63 0 N/A N/A NO DATA Yes (29 values)
cptac_pda 140 138 32-85 65 GOOD Yes (47 values)
cptac_ucec 100 100 38-90 64 GOOD Yes (46 values)

Example conversion:

Raw value: 21747 days
Converted: 21747 ÷ 365.25 = 59.5 years

Other Collections with Clinical Age

Collection Column Patients Valid Age Range Mean Quality DICOM Age
acrin_6698 age 385 381 23-77 49 GOOD No
adrenal_acc_ki67_seg age 53 53 22-82 53 GOOD Yes
cmmd age 1,872 1,872 17-87 47 GOOD Yes
colorectal_liver_metastases age 197 197 30-88 60 GOOD Yes
covid_19_ar age 105 105 19-91 54 GOOD Yes
ctpred_sunitinib_pannet age_at_diagnosis 38 38 21-72 48 GOOD Yes
ea1141 age 500 500 40-75 55 GOOD Yes
hcc_tace_seg age 105 105 31-93 68 GOOD Sparse (1 value)
ispy1 age 221 221 27-69 48 GOOD Yes
lung_fused_ct_pathology age_at_diagnosis 6 6 56-85 75 GOOD Yes
lung_pet_ct_dx age 355 354 28-90 61 GOOD Yes
nlst age 53,452 53,452 43-79 61 GOOD Sparse (20 values)
nsclc_radiomics age 422 400 34-92 68 GOOD Yes
remind age 114 114 20-76 47 GOOD No
soft_tissue_sarcoma age 51 51 16-83 55 GOOD Yes

Collections with DICOM Age Only (No Clinical Age Column)

These collections have DICOM PatientAge but no corresponding clinical age column:

Collection DICOM Patients Unique Ages Range Mean
acrin_contralateral_breast_mr 984 57 27-86 56
acrin_flt_breast 83 41 22-83 50
acrin_nsclc_fdg_pet 242 53 1-90 56
advanced_mri_breast_lesions 632 60 19-86 53
anti_pd_1_lung 46 32 37-85 65
breast_mri_nact_pilot 64 38 29-72 48
c4kc_kits 210 57 10-90 55
covid_19_ny_sbu 1,384 72 18-90 54
duke_breast_cancer_mri 922 57 24-89 54
lidc_idri 1,010 58 14-88 55
nsclc_radiogenomics 211 39 24-86 65
prostate_mri_us_biopsy 1,151 41 45-90 66
prostatex 346 38 35-78 59
upenn_gbm 630 63 18-88 56

Collections with No Age Data

Collection Notes
bonemarrowwsi_pediatricleukemia Pediatric - DICOM shows ages 2-17
cbis_ddsm 6,671 patients, no age in DICOM
cc_radiomics_phantom_3 Phantom data (not human)
htan_hms, htan_ohsu, htan_vanderbilt, htan_wustl HTAN collections - age columns exist but empty
ispy2 719 patients, no DICOM age
mediastinal_lymph_node_seg 513 patients
nsclc_radiomics_interobserver1 22 patients
prostate_diagnosis 92 patients

Data Quality Issues

1. All Clinical Ages are Meaningful

Finding: No placeholder values (999), zeros, or negatives were found in clinical age data across all 14 collections with the age column. All values represent clinically plausible ages for adult cancer patients (typically 17-93 years).

2. CPTAC Age in Days

Problem: CPTAC collections store age as days since birth, not years.

Example values:

Raw Value (days) Converted (years)
21747 59.5
19709 54.0
15512 42.5
26492 72.5

Solution: Divide by 365.25 before use.

3. One Collection Has No Usable Clinical Age

Finding: cptac_aml has an age column, but all values are "Not Reported".

4. DICOM PatientAge Availability Varies Significantly

DICOM Status Collections Examples
Not available 2 acrin_6698, remind
Mostly missing 2 hcc_tace_seg (1 value), nlst (20 values for 53K patients)
Incomplete 3 ea1141, ispy1 (zeros + empties)
Good coverage 8 cmmd, covid_19_ar, nsclc_radiomics, soft_tissue_sarcoma

5. DICOM Placeholder Values

Problem: Some collections use "000Y" as a placeholder for missing age.

Collection Count of "000Y" Empty/Null
ea1141 5 76
ispy1 1 26
lung_pet_ct_dx 1 0
c4kc_kits Present -

Solution: Filter out age = 0 when analyzing DICOM PatientAge.

6. Decimal vs. Integer Precision

Problem: Some clinical tables store age with decimal precision (e.g., 28.76 years) while DICOM stores integers.

Collection Clinical Example DICOM Example
ispy1 28.76 029Y
nsclc_radiomics 68.6 069Y

Impact: 0% exact match but 100% within ±1 year difference.

7. Different Reference Time Points

Problem: Clinical "age" and DICOM "PatientAge" may measure age at different events.

Case Study: colorectal_liver_metastases

Patient Clinical Age (at Operation) DICOM Age (at CT Scan) Difference
CRLM-CT-1190 80 33 47 years
CRLM-CT-1185 74 30 44 years
CRLM-CT-1093 71 32 39 years

Explanation: CT scans from 1991-1995 were acquired decades before the surgical intervention. The clinical age refers to age at operation, not at imaging.

8. Sparse DICOM PatientAge

Problem: Some large collections have very few unique DICOM age values.

Collection Patients Unique DICOM Ages Coverage
nlst 26,410 20 <0.1%
hcc_tace_seg 105 1 <1%
tcga_gbm 607 0 0%

Cross-Source Consistency

Detailed Comparison Results

Analysis was performed on collections with both clinical age and DICOM PatientAge available:

Collection Patients Compared Exact Match Within ±1 yr Within ±2 yr Mean Diff Std Diff Range
cmmd 1,872 100% 100% 100% 0.00 0.00 [0, 0]
lung_pet_ct_dx 353 99.4% 100% 100% 0.01 0.08 [0, 1]
covid_19_ar 103 97.1% 100% 100% 0.03 0.17 [0, 1]
soft_tissue_sarcoma 51 94.1% 100% 100% -0.06 0.24 [-1, 0]
adrenal_acc_ki67_seg 50 56.0% 100% 100% 0.38 0.46 [0, 1]
ea1141 478 52.9% 95.8% 99.2% -0.20 4.48 [-2, 73]
nsclc_radiomics 304 0% 95.7% 99.3% 0.52 0.45 [-4, 3]
ispy1 221 0% 100% 100% 0.01 0.30 [0, 0]
colorectal_liver_metastases 195 3.1% 8.2% 10.8% 0.18 17.01 [-38, 47]

Collections with High Consistency

Collection Patients Compared Exact Match Within ±1 yr Notes
cmmd 1,872 100% 100% Perfect match
lung_pet_ct_dx 353 99.4% 100% Near-perfect
covid_19_ar 103 97.1% 100% Excellent
soft_tissue_sarcoma 51 94.1% 100% Excellent

Collections with Precision Differences

Collection Exact Match Within ±1 yr Cause
ispy1 0% 100% Decimal clinical (28.76), integer DICOM (029Y)
nsclc_radiomics 0% 95.7% Decimal clinical (68.6), integer DICOM
adrenal_acc_ki67_seg 56% 100% Rounding differences

Collections with Semantic Differences

Collection Clinical Meaning DICOM Meaning Consistency
colorectal_liver_metastases Age at operation Age at CT scan Poor (up to 47 years difference)
nlst Age at randomization Age at scan Cannot compare (sparse DICOM)

Root Causes of Discrepancies

Cause Collections Affected Impact
Different reference points colorectal_liver_metastases Large differences (up to 47 years)
Decimal vs integer precision ispy1, nsclc_radiomics < 1 year differences
Rounding at data entry adrenal_acc_ki67_seg < 1 year differences
Missing DICOM data acrin_6698, nlst, remind Cannot compare

Recommendations

For Data Users

  1. Always check the column label to understand what "age" refers to in each collection

    client.clinical_index[client.clinical_index['column'] == 'age']['column_label']
  2. Convert CPTAC ages from days to years

    age_years = age_days / 365.25
  3. Filter DICOM placeholder values

    valid_ages = dicom_df[dicom_df['PatientAge'] != '000Y']
  4. Use clinical age for clinical events, DICOM age for imaging timing

  5. Best collections for cross-validated age data:

    • cmmd, covid_19_ar, soft_tissue_sarcoma, lung_pet_ct_dx
    • Most CPTAC collections (after days→years conversion)
    • Most TCGA collections with DICOM coverage

For Data Integration

Use Case Recommended Source
Age at diagnosis Clinical age_at_diagnosis
Age at imaging DICOM PatientAge
Age at treatment/surgery Clinical (check column label)
Cross-study comparisons Use single source consistently

Collections by Age Data Availability

Both Clinical + DICOM Age (39 collections) - Best for cross-validation:

  • TCGA: blca, brca, cesc, coad, esca, kich, kirc, kirp, lihc, luad, lusc, ov, prad, read, sarc, stad, thca, ucec
  • CPTAC: brca, ccrcc, gbm, hnscc, lscc, luad, pda, ucec
  • Others: adrenal_acc_ki67_seg, cmmd, colorectal_liver_metastases, covid_19_ar, ctpred_sunitinib_pannet, ea1141, ispy1, lung_fused_ct_pathology, lung_pet_ct_dx, nsclc_radiomics, soft_tissue_sarcoma

Clinical Age Only (16 collections):

  • TCGA: acc, chol, dlbc, gbm, hnsc, lgg, meso, paad, pcpg, skcm, tgct, thym, ucs, uvm
  • Others: acrin_6698, hcc_tace_seg, nlst, remind

DICOM Age Only (30 collections):

  • Large collections: cbis_ddsm (6,671 patients), covid_19_ny_sbu (1,384), lidc_idri (1,010), ispy2 (719)
  • Various imaging studies without linked clinical tables

No Age Data (12 collections):

  • htan_*, mediastinal_lymph_node_seg, nsclc_radiomics_interobserver1, prostate_diagnosis, cc_radiomics_phantom_3

Appendix: Code Examples

Fetching Clinical Age Data

from idc_index import IDCClient

client = IDCClient()
client.fetch_index('clinical_index')

# Find collections with age data
age_collections = client.clinical_index[
    client.clinical_index['column'].isin(['age', 'age_at_diagnosis'])
]['collection_id'].unique()

# Load clinical table
clinical_df = client.get_clinical_table('tcga_brca_clinical')

Converting CPTAC Age

# CPTAC stores age in days
clinical_df = client.get_clinical_table('cptac_brca_clinical')
clinical_df['age_years'] = clinical_df['diag__age_at_diagnosis'] / 365.25

Parsing DICOM PatientAge

import re
import pandas as pd
import numpy as np

def parse_dicom_age(age_str):
    """Convert DICOM PatientAge (e.g., '045Y') to numeric years"""
    if pd.isna(age_str) or age_str == '':
        return np.nan
    match = re.match(r'^(\d+)[Yy]?$', str(age_str).strip())
    if match:
        return int(match.group(1))
    return np.nan

# Apply to DICOM data
dicom_df['age_numeric'] = dicom_df['PatientAge'].apply(parse_dicom_age)
# Filter placeholders
dicom_df = dicom_df[dicom_df['age_numeric'] > 0]

Joining Clinical and DICOM Age

import pandas as pd

# Get clinical data
clinical_df = client.get_clinical_table('cmmd_clinical')

# Get DICOM data
dicom_df = client.index[client.index['collection_id'] == 'cmmd'][
    ['PatientID', 'PatientAge']
].drop_duplicates()

# Parse DICOM age
dicom_df['dicom_age'] = dicom_df['PatientAge'].apply(parse_dicom_age)

# Merge on patient ID
merged = pd.merge(
    clinical_df[['dicom_patient_id', 'age']],
    dicom_df[['PatientID', 'dicom_age']],
    left_on='dicom_patient_id',
    right_on='PatientID'
)

# Calculate difference
merged['age_diff'] = merged['age'] - merged['dicom_age']

Finding Value Mappings

# Get observed values for coded columns
age_values = client.clinical_index[
    (client.clinical_index['collection_id'] == 'nlst') &
    (client.clinical_index['column'] == 'age')
]['values'].values[0]

# Create mapping dictionary
mapping = {item['option_code']: item['option_description']
           for item in age_values}

Complete Age Analysis Workflow

from idc_index import IDCClient
import pandas as pd
import numpy as np
import re

client = IDCClient()
client.fetch_index('clinical_index')

def parse_dicom_age(age_str):
    if pd.isna(age_str) or age_str == '':
        return np.nan
    match = re.match(r'^(\d+)[Yy]?$', str(age_str).strip())
    if match:
        return int(match.group(1))
    return np.nan

def analyze_collection_age(collection_id, age_column, table_name, is_days=False):
    """Analyze age consistency for a collection"""

    # Load clinical data
    clinical_df = client.get_clinical_table(table_name)

    # Parse clinical age
    if is_days:
        clinical_df['clinical_age'] = clinical_df[age_column] / 365.25
    else:
        clinical_df['clinical_age'] = pd.to_numeric(clinical_df[age_column], errors='coerce')

    # Get DICOM data
    dicom_df = client.index[client.index['collection_id'] == collection_id][
        ['PatientID', 'PatientAge']
    ].drop_duplicates()
    dicom_df['dicom_age'] = dicom_df['PatientAge'].apply(parse_dicom_age)

    # Merge
    merged = pd.merge(
        clinical_df[['dicom_patient_id', 'clinical_age']],
        dicom_df.groupby('PatientID')['dicom_age'].first().reset_index(),
        left_on='dicom_patient_id',
        right_on='PatientID'
    ).dropna(subset=['clinical_age', 'dicom_age'])

    if len(merged) == 0:
        return None

    # Calculate statistics
    merged['diff'] = merged['clinical_age'] - merged['dicom_age']

    return {
        'collection': collection_id,
        'n_compared': len(merged),
        'exact_match': (merged['diff'] == 0).mean() * 100,
        'within_1yr': (merged['diff'].abs() <= 1).mean() * 100,
        'mean_diff': merged['diff'].mean(),
        'std_diff': merged['diff'].std()
    }

# Example usage
result = analyze_collection_age('cmmd', 'age', 'cmmd_clinical')
print(result)

References

Clone this wiki locally