-
Notifications
You must be signed in to change notification settings - Fork 11
claude_report_clinical_data.md
Analysis Date: 2026-02-10 IDC Version: v23 (idc-index 0.11.8) Collections Analyzed: 97 collections with clinical data Generated using: IDC Claude Skill
This report provides a comprehensive analysis of clinical data in the NCI Imaging Data Commons (IDC), with a focus on patient age information. The analysis covers:
- Clinical data structure and access patterns in IDC
- Semantic commonality of clinical columns across collections
- Age data availability from both clinical tables and DICOM metadata
- Cross-source consistency between clinical and DICOM age values
- Data quality issues and recommendations
| Metric | Count | Percentage |
|---|---|---|
| Total collections with clinical data | 97 | 100% |
| Collections with meaningful clinical age | 55 | 57% |
| Collections with DICOM PatientAge | 69 | 71% |
| Collections with both sources | 39 | 40% |
Critical discoveries:
- Only
dicom_patient_idis truly universal across all clinical tables - TCGA provides the most standardized clinical data (32 collections with identical schemas)
- CPTAC collections store age in days since birth, not years (requires ÷365.25 conversion)
- Clinical and DICOM ages may reference different time points (diagnosis vs. imaging vs. surgery)
- DICOM PatientAge is sparse or missing in many collections, even when clinical age exists
- Understanding Clinical Data in IDC
- Semantic Commonality Across Collections
- Age Column Naming Conventions
- Complete Collection Analysis
- Data Quality Issues
- Cross-Source Consistency
- Recommendations
- Appendix: Code Examples
Clinical data in IDC refers to non-imaging information that accompanies medical images:
- Patient demographics (age, sex, race, ethnicity)
- Clinical history (diagnoses, surgeries, therapies)
- Lab tests and pathology results
- Cancer staging (clinical and pathological TNM)
- Treatment outcomes and survival data
| Characteristic | Description |
|---|---|
| Not harmonized | Terms and formats vary across collections (each data submitter uses their own conventions) |
| Not universal | Not all collections have clinical data |
| Anonymized |
dicom_patient_id links clinical records to imaging data (PatientID in imaging index) |
The clinical_index serves as a dictionary/catalog of all available clinical data:
| Column | Purpose | Use For |
|---|---|---|
collection_id |
Collection identifier | Filtering by collection |
table_name |
Full BigQuery table reference | BigQuery queries (if needed) |
short_table_name |
Short name |
get_clinical_table() method |
column |
Column name in table | Selecting data columns |
column_label |
Human-readable description | Searching for concepts |
values |
Observed attribute values | Interpreting coded values |
from idc_index import IDCClient
client = IDCClient()
# 1. Fetch the clinical index (catalog of all clinical data)
client.fetch_index('clinical_index')
# 2. See which collections have clinical data
collections = client.clinical_index["collection_id"].unique()
print(f"{len(collections)} collections have clinical data")
# 3. Explore clinical attributes for a specific collection
nlst_columns = client.clinical_index[client.clinical_index['collection_id']=='nlst']
# 4. Load a specific clinical table
nlst_prsn = client.get_clinical_table("nlst_prsn")
# 5. Join with imaging data using dicom_patient_id = PatientIDOnly two columns appear in every clinical table:
| Column | Description |
|---|---|
dicom_patient_id |
Links clinical data to imaging (PatientID in index) |
source_batch |
IDC provenance tracking |
The analysis identified columns appearing across multiple collections. Many columns at ~33% coverage are TCGA-specific (32 collections with identical schemas):
| Category | Column | Coverage | TCGA | Non-TCGA |
|---|---|---|---|---|
| Demographics | race |
44% (43) | 32 | 11 |
gender |
40% (39) | 32 | 7 | |
ethnicity |
38% (37) | 32 | 5 | |
age_at_diagnosis |
36% (35) | 32 | 3 | |
age |
14% (14) | 0 | 14 | |
sex |
12% (12) | 0 | 12 | |
| Survival | vital_status |
36% (35) | 32 | 3 |
days_to_death |
33% (32) | 32 | 0 | |
days_to_last_followup |
34% (33) | 32 | 1 | |
| Cancer Staging | clinical_stage |
33% (32) | 32 | 0 |
clinical_T, clinical_N, clinical_M
|
33% (32) | 32 | 0 | |
pathologic_stage |
33% (32) | 32 | 0 | |
pathologic_T, pathologic_N, pathologic_M
|
33% (32) | 32 | 0 | |
| Histology | histological_type |
34% (33) | 32 | 1 |
icd_o_3_histology |
33% (32) | 32 | 0 | |
| Tumor Site | tumor_tissue_site |
33% (32) | 32 | 0 |
icd_o_3_site |
33% (32) | 32 | 0 | |
| Body Metrics |
height, weight, bmi
|
33% (32) | 32 | 0 |
| Treatment | history_of_neoadjuvant_treatment |
33% (32) | 32 | 0 |
Among the 65 non-TCGA collections, these columns appear most frequently:
| Column | Collections | Notes |
|---|---|---|
patientid |
17 | Alternative patient identifier |
studyinstanceuid |
16 | Links to specific imaging studies |
collection |
15 | Collection identifier |
age |
14 | Patient age (various meanings) |
seriesinstanceuid |
14 | Links to specific series |
patient_id |
13 | Alternative patient identifier |
sex |
12 | Patient sex |
disease_type |
12 | Cancer/disease classification |
race |
11 | Demographics |
primary_site |
11 | Tumor location |
-
Only
dicom_patient_idis truly universal - every clinical table has this for linking to imaging -
TCGA provides the most standardized clinical data - 32 collections with identical column names for staging, histology, and treatment
-
Demographics are the most cross-compatible -
race,gender/sex,age/age_at_diagnosisappear across many collection types -
Naming varies - The same concept uses different column names (e.g.,
gendervssex,agevsage_at_diagnosis) -
Survival data is moderately common -
vital_statusand time-to-event columns appear in ~35% of collections
Clinical data tables use various column names for patient age:
| Column Name | Collections | Unit | Description |
|---|---|---|---|
age |
14 | Years | Generic age (context varies by collection) |
age_at_diagnosis |
35 | Years | Age when cancer was diagnosed (TCGA standard) |
diag__age_at_diagnosis |
10 | Days | Age at diagnosis in days (CPTAC format) |
demo__age_at_index |
10 | Days | Age at index date (CPTAC, often empty) |
Age_at_Diagnosis |
4 | Years | HTAN format |
| Collection | Table | Column | Description |
|---|---|---|---|
acrin_6698 |
acrin_6698_clinical | age |
Age in years at enrollment |
adrenal_acc_ki67_seg |
adrenal_acc_ki67_seg_clinical | age |
Integer years |
cmmd |
cmmd_clinical | age |
Age |
colorectal_liver_metastases |
colorectal_liver_metastases_clinical | age |
Age at Operation |
covid_19_ar |
covid_19_ar_clinical | age |
AGE |
cptac_aml |
cptac_aml_demographic_classification | age |
Age (all "Not Reported") |
ea1141 |
ea1141_demographics | age |
Participant age at study enrollment |
hcc_tace_seg |
hcc_tace_seg_clinical | age |
Patient age in years during diagnosis |
ispy1 |
ispy1_clinical | age |
age |
lung_pet_ct_dx |
lung_pet_ct_dx_clinical | age |
Age |
nlst |
nlst_prsn | age |
Age at randomization (years) |
nsclc_radiomics |
nsclc_radiomics_clinical | age |
age |
remind |
remind_clinical | age |
Years between birth and surgery date |
soft_tissue_sarcoma |
soft_tissue_sarcoma_clinical | age |
Age |
Note: Even with the same column name, age can refer to different time points:
- Age at enrollment/randomization (clinical trials)
- Age at diagnosis
- Age at surgery/operation
| Category | Collections | Notes |
|---|---|---|
| TCGA | 32 | All have age_at_diagnosis in years |
| CPTAC | 11 | Use diag__age_at_diagnosis in days |
| Other with clinical age | 15 | Various column names |
| DICOM age only | 30 | No clinical age column |
| No age data | 12 | Neither source available |
All TCGA collections use age_at_diagnosis in years with consistent formatting.
| Collection | Patients | Valid Age | Range | Mean | Quality | DICOM Age Available |
|---|---|---|---|---|---|---|
| tcga_acc | 92 | 92 | 14-83 | 47 | GOOD | No |
| tcga_blca | 412 | 412 | 34-90 | 68 | GOOD | Yes (38 values) |
| tcga_brca | 1,098 | 1,096 | 26-90 | 58 | GOOD | Yes (45 values) |
| tcga_cesc | 307 | 307 | 20-88 | 48 | GOOD | Yes (25 values) |
| tcga_chol | 51 | 45 | 29-82 | 64 | PARTIAL | No |
| tcga_coad | 460 | 458 | 31-90 | 67 | GOOD | Yes (23 values) |
| tcga_dlbc | 48 | 48 | 23-82 | 56 | GOOD | No |
| tcga_esca | 185 | 185 | 27-90 | 62 | GOOD | Yes (13 values) |
| tcga_gbm | 607 | 595 | 10-89 | 58 | GOOD | No |
| tcga_hnsc | 523 | 522 | 19-90 | 61 | GOOD | No |
| tcga_kich | 113 | 113 | 17-86 | 51 | GOOD | Yes (14 values) |
| tcga_kirc | 537 | 537 | 26-90 | 61 | GOOD | Yes (54 values) |
| tcga_kirp | 291 | 288 | 28-88 | 61 | GOOD | Yes (27 values) |
| tcga_lgg | 516 | 515 | 14-87 | 43 | GOOD | No |
| tcga_lihc | 377 | 376 | 16-90 | 59 | GOOD | Yes (49 values) |
| tcga_luad | 560 | 503 | 33-88 | 65 | PARTIAL | Yes (36 values) |
| tcga_lusc | 504 | 495 | 39-90 | 67 | GOOD | Yes (25 values) |
| tcga_meso | 87 | 87 | 28-81 | 63 | GOOD | No |
| tcga_ov | 590 | 587 | 26-89 | 60 | GOOD | Yes (44 values) |
| tcga_paad | 185 | 185 | 35-88 | 65 | GOOD | No |
| tcga_pcpg | 179 | 179 | 19-83 | 47 | GOOD | No |
| tcga_prad | 500 | 500 | 41-78 | 61 | GOOD | Yes (13 values) |
| tcga_read | 171 | 170 | 31-90 | 64 | GOOD | Yes (3 values) |
| tcga_sarc | 261 | 261 | 20-90 | 61 | GOOD | Yes (5 values) |
| tcga_skcm | 470 | 462 | 15-90 | 58 | GOOD | No |
| tcga_stad | 443 | 438 | 30-90 | 66 | GOOD | Yes (24 values) |
| tcga_tgct | 150 | 134 | 14-67 | 32 | PARTIAL | No |
| tcga_thca | 507 | 507 | 15-89 | 47 | GOOD | Yes (6 values) |
| tcga_thym | 124 | 123 | 17-84 | 58 | GOOD | No |
| tcga_ucec | 560 | 545 | 31-90 | 64 | GOOD | Yes (44 values) |
| tcga_ucs | 57 | 57 | 51-90 | 70 | GOOD | No |
| tcga_uvm | 80 | 80 | 22-86 | 62 | GOOD | No |
Important: CPTAC stores diag__age_at_diagnosis in days since birth. Values must be divided by 365.25 to convert to years.
| Collection | Patients | Valid Age | Range (years) | Mean | Quality | DICOM Age |
|---|---|---|---|---|---|---|
| cptac_aml | 275 | 0 | N/A | N/A | NO DATA | Yes (30 values) |
| cptac_brca | 123 | 112 | 31-90 | 60 | GOOD | Yes (47 values) |
| cptac_ccrcc | 110 | 110 | 30-90 | 61 | GOOD | Yes (51 values) |
| cptac_coad | 106 | 0 | N/A | N/A | NO DATA | Yes (42 values) |
| cptac_gbm | 99 | 99 | 24-89 | 58 | GOOD | Yes (51 values) |
| cptac_hnscc | 110 | 110 | 23-81 | 62 | GOOD | Yes (34 values) |
| cptac_lscc | 108 | 105 | 41-88 | 66 | GOOD | Yes (40 values) |
| cptac_luad | 111 | 111 | 35-82 | 63 | GOOD | Yes (46 values) |
| cptac_ov | 63 | 0 | N/A | N/A | NO DATA | Yes (29 values) |
| cptac_pda | 140 | 138 | 32-85 | 65 | GOOD | Yes (47 values) |
| cptac_ucec | 100 | 100 | 38-90 | 64 | GOOD | Yes (46 values) |
Example conversion:
Raw value: 21747 days
Converted: 21747 ÷ 365.25 = 59.5 years
| Collection | Column | Patients | Valid Age | Range | Mean | Quality | DICOM Age |
|---|---|---|---|---|---|---|---|
| acrin_6698 | age | 385 | 381 | 23-77 | 49 | GOOD | No |
| adrenal_acc_ki67_seg | age | 53 | 53 | 22-82 | 53 | GOOD | Yes |
| cmmd | age | 1,872 | 1,872 | 17-87 | 47 | GOOD | Yes |
| colorectal_liver_metastases | age | 197 | 197 | 30-88 | 60 | GOOD | Yes |
| covid_19_ar | age | 105 | 105 | 19-91 | 54 | GOOD | Yes |
| ctpred_sunitinib_pannet | age_at_diagnosis | 38 | 38 | 21-72 | 48 | GOOD | Yes |
| ea1141 | age | 500 | 500 | 40-75 | 55 | GOOD | Yes |
| hcc_tace_seg | age | 105 | 105 | 31-93 | 68 | GOOD | Sparse (1 value) |
| ispy1 | age | 221 | 221 | 27-69 | 48 | GOOD | Yes |
| lung_fused_ct_pathology | age_at_diagnosis | 6 | 6 | 56-85 | 75 | GOOD | Yes |
| lung_pet_ct_dx | age | 355 | 354 | 28-90 | 61 | GOOD | Yes |
| nlst | age | 53,452 | 53,452 | 43-79 | 61 | GOOD | Sparse (20 values) |
| nsclc_radiomics | age | 422 | 400 | 34-92 | 68 | GOOD | Yes |
| remind | age | 114 | 114 | 20-76 | 47 | GOOD | No |
| soft_tissue_sarcoma | age | 51 | 51 | 16-83 | 55 | GOOD | Yes |
These collections have DICOM PatientAge but no corresponding clinical age column:
| Collection | DICOM Patients | Unique Ages | Range | Mean |
|---|---|---|---|---|
| acrin_contralateral_breast_mr | 984 | 57 | 27-86 | 56 |
| acrin_flt_breast | 83 | 41 | 22-83 | 50 |
| acrin_nsclc_fdg_pet | 242 | 53 | 1-90 | 56 |
| advanced_mri_breast_lesions | 632 | 60 | 19-86 | 53 |
| anti_pd_1_lung | 46 | 32 | 37-85 | 65 |
| breast_mri_nact_pilot | 64 | 38 | 29-72 | 48 |
| c4kc_kits | 210 | 57 | 10-90 | 55 |
| covid_19_ny_sbu | 1,384 | 72 | 18-90 | 54 |
| duke_breast_cancer_mri | 922 | 57 | 24-89 | 54 |
| lidc_idri | 1,010 | 58 | 14-88 | 55 |
| nsclc_radiogenomics | 211 | 39 | 24-86 | 65 |
| prostate_mri_us_biopsy | 1,151 | 41 | 45-90 | 66 |
| prostatex | 346 | 38 | 35-78 | 59 |
| upenn_gbm | 630 | 63 | 18-88 | 56 |
| Collection | Notes |
|---|---|
| bonemarrowwsi_pediatricleukemia | Pediatric - DICOM shows ages 2-17 |
| cbis_ddsm | 6,671 patients, no age in DICOM |
| cc_radiomics_phantom_3 | Phantom data (not human) |
| htan_hms, htan_ohsu, htan_vanderbilt, htan_wustl | HTAN collections - age columns exist but empty |
| ispy2 | 719 patients, no DICOM age |
| mediastinal_lymph_node_seg | 513 patients |
| nsclc_radiomics_interobserver1 | 22 patients |
| prostate_diagnosis | 92 patients |
Finding: No placeholder values (999), zeros, or negatives were found in clinical age data across all 14 collections with the age column. All values represent clinically plausible ages for adult cancer patients (typically 17-93 years).
Problem: CPTAC collections store age as days since birth, not years.
Example values:
| Raw Value (days) | Converted (years) |
|---|---|
| 21747 | 59.5 |
| 19709 | 54.0 |
| 15512 | 42.5 |
| 26492 | 72.5 |
Solution: Divide by 365.25 before use.
Finding: cptac_aml has an age column, but all values are "Not Reported".
| DICOM Status | Collections | Examples |
|---|---|---|
| Not available | 2 | acrin_6698, remind |
| Mostly missing | 2 | hcc_tace_seg (1 value), nlst (20 values for 53K patients) |
| Incomplete | 3 | ea1141, ispy1 (zeros + empties) |
| Good coverage | 8 | cmmd, covid_19_ar, nsclc_radiomics, soft_tissue_sarcoma |
Problem: Some collections use "000Y" as a placeholder for missing age.
| Collection | Count of "000Y" | Empty/Null |
|---|---|---|
| ea1141 | 5 | 76 |
| ispy1 | 1 | 26 |
| lung_pet_ct_dx | 1 | 0 |
| c4kc_kits | Present | - |
Solution: Filter out age = 0 when analyzing DICOM PatientAge.
Problem: Some clinical tables store age with decimal precision (e.g., 28.76 years) while DICOM stores integers.
| Collection | Clinical Example | DICOM Example |
|---|---|---|
| ispy1 | 28.76 | 029Y |
| nsclc_radiomics | 68.6 | 069Y |
Impact: 0% exact match but 100% within ±1 year difference.
Problem: Clinical "age" and DICOM "PatientAge" may measure age at different events.
Case Study: colorectal_liver_metastases
| Patient | Clinical Age (at Operation) | DICOM Age (at CT Scan) | Difference |
|---|---|---|---|
| CRLM-CT-1190 | 80 | 33 | 47 years |
| CRLM-CT-1185 | 74 | 30 | 44 years |
| CRLM-CT-1093 | 71 | 32 | 39 years |
Explanation: CT scans from 1991-1995 were acquired decades before the surgical intervention. The clinical age refers to age at operation, not at imaging.
Problem: Some large collections have very few unique DICOM age values.
| Collection | Patients | Unique DICOM Ages | Coverage |
|---|---|---|---|
| nlst | 26,410 | 20 | <0.1% |
| hcc_tace_seg | 105 | 1 | <1% |
| tcga_gbm | 607 | 0 | 0% |
Analysis was performed on collections with both clinical age and DICOM PatientAge available:
| Collection | Patients Compared | Exact Match | Within ±1 yr | Within ±2 yr | Mean Diff | Std Diff | Range |
|---|---|---|---|---|---|---|---|
| cmmd | 1,872 | 100% | 100% | 100% | 0.00 | 0.00 | [0, 0] |
| lung_pet_ct_dx | 353 | 99.4% | 100% | 100% | 0.01 | 0.08 | [0, 1] |
| covid_19_ar | 103 | 97.1% | 100% | 100% | 0.03 | 0.17 | [0, 1] |
| soft_tissue_sarcoma | 51 | 94.1% | 100% | 100% | -0.06 | 0.24 | [-1, 0] |
| adrenal_acc_ki67_seg | 50 | 56.0% | 100% | 100% | 0.38 | 0.46 | [0, 1] |
| ea1141 | 478 | 52.9% | 95.8% | 99.2% | -0.20 | 4.48 | [-2, 73] |
| nsclc_radiomics | 304 | 0% | 95.7% | 99.3% | 0.52 | 0.45 | [-4, 3] |
| ispy1 | 221 | 0% | 100% | 100% | 0.01 | 0.30 | [0, 0] |
| colorectal_liver_metastases | 195 | 3.1% | 8.2% | 10.8% | 0.18 | 17.01 | [-38, 47] |
| Collection | Patients Compared | Exact Match | Within ±1 yr | Notes |
|---|---|---|---|---|
| cmmd | 1,872 | 100% | 100% | Perfect match |
| lung_pet_ct_dx | 353 | 99.4% | 100% | Near-perfect |
| covid_19_ar | 103 | 97.1% | 100% | Excellent |
| soft_tissue_sarcoma | 51 | 94.1% | 100% | Excellent |
| Collection | Exact Match | Within ±1 yr | Cause |
|---|---|---|---|
| ispy1 | 0% | 100% | Decimal clinical (28.76), integer DICOM (029Y) |
| nsclc_radiomics | 0% | 95.7% | Decimal clinical (68.6), integer DICOM |
| adrenal_acc_ki67_seg | 56% | 100% | Rounding differences |
| Collection | Clinical Meaning | DICOM Meaning | Consistency |
|---|---|---|---|
| colorectal_liver_metastases | Age at operation | Age at CT scan | Poor (up to 47 years difference) |
| nlst | Age at randomization | Age at scan | Cannot compare (sparse DICOM) |
| Cause | Collections Affected | Impact |
|---|---|---|
| Different reference points | colorectal_liver_metastases | Large differences (up to 47 years) |
| Decimal vs integer precision | ispy1, nsclc_radiomics | < 1 year differences |
| Rounding at data entry | adrenal_acc_ki67_seg | < 1 year differences |
| Missing DICOM data | acrin_6698, nlst, remind | Cannot compare |
-
Always check the column label to understand what "age" refers to in each collection
client.clinical_index[client.clinical_index['column'] == 'age']['column_label']
-
Convert CPTAC ages from days to years
age_years = age_days / 365.25
-
Filter DICOM placeholder values
valid_ages = dicom_df[dicom_df['PatientAge'] != '000Y']
-
Use clinical age for clinical events, DICOM age for imaging timing
-
Best collections for cross-validated age data:
- cmmd, covid_19_ar, soft_tissue_sarcoma, lung_pet_ct_dx
- Most CPTAC collections (after days→years conversion)
- Most TCGA collections with DICOM coverage
| Use Case | Recommended Source |
|---|---|
| Age at diagnosis | Clinical age_at_diagnosis
|
| Age at imaging | DICOM PatientAge
|
| Age at treatment/surgery | Clinical (check column label) |
| Cross-study comparisons | Use single source consistently |
Both Clinical + DICOM Age (39 collections) - Best for cross-validation:
- TCGA: blca, brca, cesc, coad, esca, kich, kirc, kirp, lihc, luad, lusc, ov, prad, read, sarc, stad, thca, ucec
- CPTAC: brca, ccrcc, gbm, hnscc, lscc, luad, pda, ucec
- Others: adrenal_acc_ki67_seg, cmmd, colorectal_liver_metastases, covid_19_ar, ctpred_sunitinib_pannet, ea1141, ispy1, lung_fused_ct_pathology, lung_pet_ct_dx, nsclc_radiomics, soft_tissue_sarcoma
Clinical Age Only (16 collections):
- TCGA: acc, chol, dlbc, gbm, hnsc, lgg, meso, paad, pcpg, skcm, tgct, thym, ucs, uvm
- Others: acrin_6698, hcc_tace_seg, nlst, remind
DICOM Age Only (30 collections):
- Large collections: cbis_ddsm (6,671 patients), covid_19_ny_sbu (1,384), lidc_idri (1,010), ispy2 (719)
- Various imaging studies without linked clinical tables
No Age Data (12 collections):
- htan_*, mediastinal_lymph_node_seg, nsclc_radiomics_interobserver1, prostate_diagnosis, cc_radiomics_phantom_3
from idc_index import IDCClient
client = IDCClient()
client.fetch_index('clinical_index')
# Find collections with age data
age_collections = client.clinical_index[
client.clinical_index['column'].isin(['age', 'age_at_diagnosis'])
]['collection_id'].unique()
# Load clinical table
clinical_df = client.get_clinical_table('tcga_brca_clinical')# CPTAC stores age in days
clinical_df = client.get_clinical_table('cptac_brca_clinical')
clinical_df['age_years'] = clinical_df['diag__age_at_diagnosis'] / 365.25import re
import pandas as pd
import numpy as np
def parse_dicom_age(age_str):
"""Convert DICOM PatientAge (e.g., '045Y') to numeric years"""
if pd.isna(age_str) or age_str == '':
return np.nan
match = re.match(r'^(\d+)[Yy]?$', str(age_str).strip())
if match:
return int(match.group(1))
return np.nan
# Apply to DICOM data
dicom_df['age_numeric'] = dicom_df['PatientAge'].apply(parse_dicom_age)
# Filter placeholders
dicom_df = dicom_df[dicom_df['age_numeric'] > 0]import pandas as pd
# Get clinical data
clinical_df = client.get_clinical_table('cmmd_clinical')
# Get DICOM data
dicom_df = client.index[client.index['collection_id'] == 'cmmd'][
['PatientID', 'PatientAge']
].drop_duplicates()
# Parse DICOM age
dicom_df['dicom_age'] = dicom_df['PatientAge'].apply(parse_dicom_age)
# Merge on patient ID
merged = pd.merge(
clinical_df[['dicom_patient_id', 'age']],
dicom_df[['PatientID', 'dicom_age']],
left_on='dicom_patient_id',
right_on='PatientID'
)
# Calculate difference
merged['age_diff'] = merged['age'] - merged['dicom_age']# Get observed values for coded columns
age_values = client.clinical_index[
(client.clinical_index['collection_id'] == 'nlst') &
(client.clinical_index['column'] == 'age')
]['values'].values[0]
# Create mapping dictionary
mapping = {item['option_code']: item['option_description']
for item in age_values}from idc_index import IDCClient
import pandas as pd
import numpy as np
import re
client = IDCClient()
client.fetch_index('clinical_index')
def parse_dicom_age(age_str):
if pd.isna(age_str) or age_str == '':
return np.nan
match = re.match(r'^(\d+)[Yy]?$', str(age_str).strip())
if match:
return int(match.group(1))
return np.nan
def analyze_collection_age(collection_id, age_column, table_name, is_days=False):
"""Analyze age consistency for a collection"""
# Load clinical data
clinical_df = client.get_clinical_table(table_name)
# Parse clinical age
if is_days:
clinical_df['clinical_age'] = clinical_df[age_column] / 365.25
else:
clinical_df['clinical_age'] = pd.to_numeric(clinical_df[age_column], errors='coerce')
# Get DICOM data
dicom_df = client.index[client.index['collection_id'] == collection_id][
['PatientID', 'PatientAge']
].drop_duplicates()
dicom_df['dicom_age'] = dicom_df['PatientAge'].apply(parse_dicom_age)
# Merge
merged = pd.merge(
clinical_df[['dicom_patient_id', 'clinical_age']],
dicom_df.groupby('PatientID')['dicom_age'].first().reset_index(),
left_on='dicom_patient_id',
right_on='PatientID'
).dropna(subset=['clinical_age', 'dicom_age'])
if len(merged) == 0:
return None
# Calculate statistics
merged['diff'] = merged['clinical_age'] - merged['dicom_age']
return {
'collection': collection_id,
'n_compared': len(merged),
'exact_match': (merged['diff'] == 0).mean() * 100,
'within_1yr': (merged['diff'].abs() <= 1).mean() * 100,
'mean_diff': merged['diff'].mean(),
'std_diff': merged['diff'].std()
}
# Example usage
result = analyze_collection_age('cmmd', 'age', 'cmmd_clinical')
print(result)