# GDC Query Evaluation Framework

This notebook evaluates 30 queries across three complexity levels against the Genomic Data Commons (GDC) API:
- **Basic Discovery (Low Complexity)**: 10 queries (EV-L01 to EV-L10)
- **Entity Filtering (Medium Complexity)**: 10 queries (EV-M01 to EV-M10)  
- **Complex Cohorts (High Complexity)**: 10 queries (EV-H01 to EV-H10)

In [1]:
# Import Required Libraries

import requests
import json
import time

In [2]:

# GDC API Configuration
GDC_API_BASE = "https://api.gdc.cancer.gov"
results = {}

In [3]:
# ============================================================================
# HELPER FUNCTIONS
# ============================================================================

def graphql_query(query, variables=None):
    """Execute GraphQL query against GDC"""
    url = f"{GDC_API_BASE}/v0/graphql"
    headers = {"Content-Type": "application/json"}
    payload = {"query": query}
    if variables:
        payload["variables"] = variables

    response = requests.post(url, json=payload, headers=headers)

    # Better error handling
    if response.status_code != 200:
        print(f"‚ùå GraphQL Error: {response.status_code}")
        print(f"Response: {response.text}")
        return None

    result = response.json()
    if "errors" in result:
        print(f"‚ùå GraphQL Errors: {json.dumps(result['errors'], indent=2)}")
        return None

    return result


def rest_query(endpoint, params=None):
    """Execute REST API query against GDC"""
    url = f"{GDC_API_BASE}/{endpoint}"
    headers = {"Content-Type": "application/json"}

    response = requests.get(url, params=params, headers=headers)
    response.raise_for_status()
    return response.json()

In [4]:
# EV-L01: In the GDC database, list all available program names
# Direct entity list (program names)
def eval_L01():
    start = time.time()
    try:
        # Get programs through projects endpoint since programs endpoint doesn't exist
        result = rest_query("projects", {
            "size": "2000",
            "fields": "program.name"
        })
        
        # Extract unique program names
        programs = set()
        for project in result["data"]["hits"]:
            program_info = project.get("program", {})
            if isinstance(program_info, dict) and "name" in program_info:
                programs.add(program_info["name"])
            elif isinstance(program_info, list):
                for prog in program_info:
                    if isinstance(prog, dict) and "name" in prog:
                        programs.add(prog["name"])
        
        programs_list = sorted(list(programs))
        count = len(programs_list)
        
        print(f"‚úÖ EV-L01: Found {count} programs")
        print(f"Programs: {', '.join(programs_list)}")
        
        results["EV-L01"] = {
            "status": "success",
            "result": f"{count} programs",
            "data": programs_list,
            "time": time.time() - start,
        }
    except Exception as e:
        print(f"‚ùå EV-L01 Failed: {e}")
        results["EV-L01"] = {"status": "error", "error": str(e), "time": time.time() - start}

eval_L01()

‚úÖ EV-L01: Found 23 programs
Programs: APOLLO, BEATAML1.0, CCDI, CDDP_EAGLE, CGCI, CMI, CPTAC, CTSP, EXCEPTIONAL_RESPONDERS, FM, HCMI, MATCH, MMRF, MP2PRT, NCICCR, OHSU, ORGANOID, REBC, TARGET, TCGA, TRIO, VAREPOP, WCDT


In [5]:
# EV-L02: In the GDC database, count the total number of projects
# Simple count (total projects)
def eval_L02():
    start = time.time()
    try:
        result = rest_query("projects", {"size": "0"})
        count = result["data"]["pagination"]["total"]
        
        print(f"‚úÖ EV-L02: Found {count} projects")
        
        results["EV-L02"] = {
            "status": "success",
            "result": f"{count} projects",
            "time": time.time() - start,
        }
    except Exception as e:
        print(f"‚ùå EV-L02 Failed: {e}")
        results["EV-L02"] = {"status": "error", "error": str(e), "time": time.time() - start}

eval_L02()

‚úÖ EV-L02: Found 88 projects


In [8]:
# EV-L03: In the GDC database, retrieve the primary sites represented across all projects
# Basic metadata retrieval (primary sites)
def eval_L03():
    start = time.time()
    try:
        result = rest_query("projects", {
            "size": "2000",
            "fields": "primary_site"
        })
        
        primary_sites = set()
        for project in result["data"]["hits"]:
            sites = project.get("primary_site", [])
            if isinstance(sites, list):
                for site in sites:
                    if site and site != "_missing":  # Exclude _missing values
                        primary_sites.add(site)
            elif sites and sites != "_missing":  # Exclude _missing values
                primary_sites.add(sites)
        
        count = len(primary_sites)
        sorted_sites = sorted(primary_sites)
        
        print(f"‚úÖ EV-L03: Found {count} primary sites (excluding '_missing')")
        print(f"Sites: {'; '.join(sorted_sites)}")
        
        results["EV-L03"] = {
            "status": "success",
            "result": f"{count} primary sites",
            "data": sorted_sites,
            "time": time.time() - start
        }
    except Exception as e:
        print(f"‚ùå EV-L03 Failed: {e}")
        results["EV-L03"] = {"status": "error", "error": str(e), "time": time.time() - start}

eval_L03()

‚úÖ EV-L03: Found 69 primary sites (excluding '_missing')
Sites: Accessory sinuses; Adrenal gland; Anus and anal canal; Base of tongue; Bladder; Bones, joints and articular cartilage of limbs; Bones, joints and articular cartilage of other and unspecified sites; Brain; Breast; Bronchus and lung; Cervix uteri; Colon; Connective, subcutaneous and other soft tissues; Corpus uteri; Esophagus; Eye and adnexa; Floor of mouth; Gallbladder; Gum; Heart, mediastinum, and pleura; Hematopoietic and reticuloendothelial systems; Hypopharynx; Kidney; Larynx; Lip; Liver and intrahepatic bile ducts; Lymph nodes; Meninges; Nasal cavity and middle ear; Nasopharynx; Not Reported; Oropharynx; Other and ill-defined digestive organs; Other and ill-defined sites; Other and ill-defined sites in lip, oral cavity and pharynx; Other and ill-defined sites within respiratory system and intrathoracic organs; Other and unspecified female genital organs; Other and unspecified major salivary glands; Other and unspecifi

In [9]:
# EV-L04: In the GDC database, list all data categories (e.g., Raw Sequencing Data, Transcriptome Profiling)
# Single-field lookup (data categories)
def eval_L04():
    start = time.time()
    try:
        result = rest_query("files", {
            "size": "0",
            "facets": "data_category"
        })
        
        categories = []
        for bucket in result["data"]["aggregations"]["data_category"]["buckets"]:
            categories.append(bucket["key"])
        
        count = len(categories)
        print(f"‚úÖ EV-L04: Found {count} data categories")
        print(f"Categories: {'; '.join(sorted(categories))}")
        
        results["EV-L04"] = {
            "status": "success", 
            "result": f"{count} data categories",
            "data": sorted(categories),
            "time": time.time() - start
        }
    except Exception as e:
        print(f"‚ùå EV-L04 Failed: {e}")
        results["EV-L04"] = {"status": "error", "error": str(e), "time": time.time() - start}

eval_L04()

‚úÖ EV-L04: Found 11 data categories
Categories: biospecimen; clinical; combined nucleotide variation; copy number variation; dna methylation; proteome profiling; sequencing reads; simple nucleotide variation; somatic structural variation; structural variation; transcriptome profiling


In [11]:
# EV-L05: In the GDC database, get all experimental strategies used
# Direct enumeration (experimental strategies)
def eval_L05():
    start = time.time()
    try:
        result = rest_query("files", {
            "size": "0",
            "facets": "experimental_strategy"
        })
        
        strategies = []
        for bucket in result["data"]["aggregations"]["experimental_strategy"]["buckets"]:
            strategy = bucket["key"]
            if strategy and strategy != "_missing":  # Exclude _missing values
                strategies.append(strategy)
        
        count = len(strategies)
        print(f"‚úÖ EV-L05: Found {count} experimental strategies (excluding '_missing')")
        print(f"Strategies: {'; '.join(sorted(strategies))}")
        
        results["EV-L05"] = {
            "status": "success",
            "result": f"{count} experimental strategies", 
            "data": sorted(strategies),
            "time": time.time() - start
        }
    except Exception as e:
        print(f"‚ùå EV-L05 Failed: {e}")
        results["EV-L05"] = {"status": "error", "error": str(e), "time": time.time() - start}

eval_L05()

‚úÖ EV-L05: Found 13 experimental strategies (excluding '_missing')
Strategies: ATAC-Seq; Diagnostic Slide; Expression Array; Genotyping Array; Methylation Array; RNA-Seq; Reverse Phase Protein Array; Targeted Sequencing; Tissue Slide; WGS; WXS; miRNA-Seq; scRNA-Seq


In [12]:
# EV-L06: In the GDC database, list all file formats
# Basic metadata (file formats)
def eval_L06():
    start = time.time()
    try:
        result = rest_query("files", {
            "size": "0", 
            "facets": "data_format"
        })
        
        formats = []
        for bucket in result["data"]["aggregations"]["data_format"]["buckets"]:
            formats.append(bucket["key"])
        
        count = len(formats)
        print(f"‚úÖ EV-L06: Found {count} file formats")
        print(f"Formats: {'; '.join(sorted(formats))}")
        
        results["EV-L06"] = {
            "status": "success",
            "result": f"{count} file formats",
            "data": sorted(formats),
            "time": time.time() - start
        }
    except Exception as e:
        print(f"‚ùå EV-L06 Failed: {e}")
        results["EV-L06"] = {"status": "error", "error": str(e), "time": time.time() - start}

eval_L06()

‚úÖ EV-L06: Found 22 file formats
Formats: bam; bcr auxiliary xml; bcr biotab; bcr omf xml; bcr pps xml; bcr ssf xml; bcr xml; bedpe; cdc json; cel; hdf5; idat; jpeg 2000; maf; mex; pdf; svs; tar; tsv; txt; vcf; xlsx


In [13]:
# EV-L07: In the GDC database, list annotation categories and classifications used in Annotations to flag QC issues
# Simple lookup (annotation categories/classifications)
def eval_L07():
    start = time.time()
    try:
        result = rest_query("annotations", {
            "size": "0",
            "facets": "category,classification"
        })
        
        categories = []
        classifications = []
        
        for bucket in result["data"]["aggregations"]["category"]["buckets"]:
            categories.append(bucket["key"])
            
        for bucket in result["data"]["aggregations"]["classification"]["buckets"]:
            classifications.append(bucket["key"])
        
        print(f"‚úÖ EV-L07: Found {len(categories)} annotation categories, {len(classifications)} classifications")
        print(f"Categories: {'; '.join(categories)}")
        print(f"Classifications: {'; '.join(classifications)}")
        
        results["EV-L07"] = {
            "status": "success",
            "result": f"{len(categories)} categories, {len(classifications)} classifications",
            "data": {"categories": categories, "classifications": classifications},
            "time": time.time() - start
        }
    except Exception as e:
        print(f"‚ùå EV-L07 Failed: {e}")
        results["EV-L07"] = {"status": "error", "error": str(e), "time": time.time() - start}

eval_L07()

‚úÖ EV-L07: Found 35 annotation categories, 4 classifications
Categories: general; item is noncanonical; item flagged dnu; prior malignancy; alternate sample pipeline; center qc failed; history of unacceptable prior treatment related to a prior/other malignancy; item in special subset; synchronous malignancy; neoadjuvant therapy; genotype mismatch; bcr notification; history of acceptable prior treatment related to a prior/other malignancy; item does not meet study protocol; item flagged low quality; item may not meet study protocol; case submitted is found to be a recurrence after submission; permanently missing item or object; duplicate case; subject identity unknown; molecular analysis outside specification; pathology outside specification; barcode incorrect; acceptable treatment for tcga tumor; biospecimen identity unknown; subject withdrew consent; qualification metrics changed; inadvertently shipped; qualified in error; normal tissue origin incorrect; normal class but appears dise

In [16]:
# EV-L08: In the GDC database, list all the available disease types
def eval_L08():
    start = time.time()
    try:
        result = rest_query("cases", {
            "size": "0",
            "facets": "disease_type"
        })
        
        disease_types = []
        for bucket in result["data"]["aggregations"]["disease_type"]["buckets"]:
            disease_type = bucket["key"]
            if disease_type and disease_type != "_missing":  # Exclude _missing values
                disease_types.append(disease_type)
        
        count = len(disease_types)
        print(f"‚úÖ EV-L08: Found {count} disease types")
        print(f"Disease types: {'; '.join(sorted(disease_types))}")
        
        results["EV-L08"] = {
            "status": "success",
            "result": f"{count} disease types",
            "data": sorted(disease_types),
            "time": time.time() - start,
        }
    except Exception as e:
        print(f"‚ùå EV-L08 Failed: {e}")
        results["EV-L08"] = {"status": "error", "error": str(e), "time": time.time() - start}

eval_L08()

‚úÖ EV-L08: Found 48 disease types
Disease types: acinar cell neoplasms; acute lymphoblastic leukemia; adenomas and adenocarcinomas; adnexal and skin appendage neoplasms; basal cell neoplasms; blood vessel tumors; chronic myeloproliferative disorders; complex epithelial neoplasms; complex mixed and stromal neoplasms; cystic, mucinous and serous neoplasms; ductal and lobular neoplasms; epithelial neoplasms, nos; fibroepithelial neoplasms; fibromatous neoplasms; germ cell neoplasms; gliomas; granular cell tumors and alveolar soft part sarcomas; leukemias, nos; lipomatous neoplasms; lymphoid leukemias; malignant lymphomas, nos or diffuse; mature b-cell lymphomas; mature t- and nk-cell lymphomas; meningiomas; mesothelial neoplasms; miscellaneous bone tumors; miscellaneous tumors; mucoepidermoid neoplasms; myelodysplastic syndromes; myeloid leukemias; myomatous neoplasms; neoplasms, nos; nerve sheath tumors; neuroepitheliomatous neoplasms; nevi and melanomas; not applicable; not reported; o

In [30]:
# EV-L09: In the GDC database, list the available ethnicity categories
# Single-field enumeration (ethnicity categories)
def eval_L09():
    start = time.time()
    try:
        result = rest_query("cases", {
            "size": "0",
            "facets": "demographic.ethnicity"
        })
        
        ethnicities = []
        for bucket in result["data"]["aggregations"]["demographic.ethnicity"]["buckets"]:
            ethnicity = bucket["key"]
            if ethnicity and ethnicity != "_missing":  # Exclude _missing values
                ethnicities.append(ethnicity)
        
        count = len(ethnicities)
        print(f"‚úÖ EV-L09: Found {count} ethnicity categories")
        print(f"Ethnicities: {'; '.join(sorted(ethnicities))}")
        
        results["EV-L09"] = {
            "status": "success",
            "result": f"{count} ethnicity categories",
            "data": sorted(ethnicities),
            "time": time.time() - start
        }
    except Exception as e:
        print(f"‚ùå EV-L09 Failed: {e}")
        results["EV-L09"] = {"status": "error", "error": str(e), "time": time.time() - start}

eval_L09()

‚úÖ EV-L09: Found 4 ethnicity categories
Ethnicities: hispanic or latino; not hispanic or latino; not reported; unknown


In [18]:
# EV-L10: In the GDC database, what are the available platform types used for sequencing
# Basic metadata with counts (platform types)
def eval_L10():
    start = time.time()
    try:
        result = rest_query("files", {
            "size": "0",
            "facets": "platform"
        })
        
        platforms = []
        platform_counts = {}
        for bucket in result["data"]["aggregations"]["platform"]["buckets"]:
            platform = bucket["key"]
            count = bucket["doc_count"]
            if platform and platform != "_missing":  # Exclude _missing values
                platforms.append(platform)
                platform_counts[platform] = count
        
        total_platforms = len(platforms)
        print(f"‚úÖ EV-L10: Found {total_platforms} platform types (excluding '_missing')")
        
        # Show top platforms by count
        sorted_platforms = sorted(platform_counts.items(), key=lambda x: x[1], reverse=True)
        for platform, count in sorted_platforms[:10]:
            print(f"  {platform}: {count:,}")
        
        results["EV-L10"] = {
            "status": "success",
            "result": f"{total_platforms} platform types",
            "data": {"platforms": platforms, "counts": platform_counts},
            "time": time.time() - start,
        }
    except Exception as e:
        print(f"‚ùå EV-L10 Failed: {e}")
        results["EV-L10"] = {"status": "error", "error": str(e), "time": time.time() - start}

eval_L10()

‚úÖ EV-L10: Found 10 platform types (excluding '_missing')
  illumina: 799,977
  affymetrix snp 6.0: 147,734
  illumina human methylation 450: 31,776
  illumina methylation epic: 18,522
  illumina human methylation 27: 9,435
  rppa: 7,906
  genechip u133a: 1,243
  illumina methylation epic v2: 1,179
  complete genomics: 581
  genechip u133 plus 2.0: 183


### ENTITY FILTERING QUERIES (Medium Complexity)

These queries apply specific filtering criteria to narrow down results within one or two entity types.

In [26]:
# EV-M01: In the GDC database, count the total number of RNA-Seq files across all projects
# Single-attribute filter (experimental_strategy = RNA-Seq)
def eval_M01():
    start = time.time()
    try:
        filters = {
            "op": "=",
            "content": {
                "field": "experimental_strategy",
                "value": "RNA-Seq"
            }
        }
        
        result = rest_query("files", {
            "filters": json.dumps(filters),
            "size": "0"
        })
        
        count = result["data"]["pagination"]["total"]
        print(f"‚úÖ EV-M01: Found {count:,} RNA-Seq files")
        
        results["EV-M01"] = {
            "status": "success",
            "result": f"{count} RNA-Seq files",
            "time": time.time() - start
        }
    except Exception as e:
        print(f"‚ùå EV-M01 Failed: {e}")
        results["EV-M01"] = {"status": "error", "error": str(e), "time": time.time() - start}

eval_M01()

‚úÖ EV-M01: Found 233,388 RNA-Seq files


In [28]:
# EV-M02: In the GDC database, count male vs. female cases in TCGA-LUAD
# Cross-entity filter with faceting (project + gender)
def eval_M02():
    start = time.time()
    try:
        filters = {
            "op": "=",
            "content": {
                "field": "project.project_id",
                "value": "TCGA-LUAD"
            }
        }
        
        result = rest_query("cases", {
            "filters": json.dumps(filters),
            "size": "0",
            "facets": "demographic.gender"
        })
        
        gender_counts = {}
        for bucket in result["data"]["aggregations"]["demographic.gender"]["buckets"]:
            gender_counts[bucket["key"]] = bucket["doc_count"]
        
        females = gender_counts.get("female", 0)
        males = gender_counts.get("male", 0)
        
        print(f"‚úÖ EV-M02: TCGA-LUAD gender distribution:")
        print(f"  Females: {females}")
        print(f"  Males: {males}")
        
        results["EV-M02"] = {
            "status": "success",
            "result": f"{females} females, {males} males",
            "data": gender_counts,
            "time": time.time() - start,
        }
    except Exception as e:
        print(f"‚ùå EV-M02 Failed: {e}")
        results["EV-M02"] = {"status": "error", "error": str(e), "time": time.time() - start}

eval_M02()

‚úÖ EV-M02: TCGA-LUAD gender distribution:
  Females: 280
  Males: 242


In [32]:
# EV-M03: In the GDC database, list the top 5 diseases by case count
# Basic aggregation with sorting (disease counts)
def eval_M03():
    start = time.time()
    try:
        result = rest_query("cases", {
            "size": "0",
            "facets": "disease_type"
        })
        
        # Get disease counts and sort by count
        disease_counts = []
        for bucket in result["data"]["aggregations"]["disease_type"]["buckets"]:
            disease_counts.append((bucket["key"], bucket["doc_count"]))
        
        # Sort by count (descending) and get top 5
        top_5_diseases = sorted(disease_counts, key=lambda x: x[1], reverse=True)[:5]
        
        print(f"‚úÖ EV-M03: Top 5 diseases by case count:")
        for i, (disease, count) in enumerate(top_5_diseases, 1):
            print(f"  {i}. {disease}: {count:,} cases")
        
        results["EV-M03"] = {
            "status": "success",
            "result": f"Top 5 diseases by case count",
            "data": top_5_diseases,
            "time": time.time() - start
        }
    except Exception as e:
        print(f"‚ùå EV-M03 Failed: {e}")
        results["EV-M03"] = {"status": "error", "error": str(e), "time": time.time() - start}

eval_M03()

‚úÖ EV-M03: Top 5 diseases by case count:
  1. adenomas and adenocarcinomas: 14,809 cases
  2. ductal and lobular neoplasms: 3,715 cases
  3. myeloid leukemias: 3,638 cases
  4. epithelial neoplasms, nos: 3,397 cases
  5. squamous cell neoplasms: 3,116 cases


In [33]:
# EV-M04: In the GDC database, get the number of files linked to TARGET-AML
# Simple cross-entity filter (files linked to TARGET-AML)
def eval_M04():
    start = time.time()
    try:
        filters = {
            "op": "=",
            "content": {
                "field": "cases.project.project_id",
                "value": "TARGET-AML"
            }
        }
        
        result = rest_query("files", {
            "filters": json.dumps(filters),
            "size": "0"
        })
        
        count = result["data"]["pagination"]["total"]
        print(f"‚úÖ EV-M04: Found {count:,} files linked to TARGET-AML")
        
        results["EV-M04"] = {
            "status": "success",
            "result": f"{count} files linked to TARGET-AML",
            "time": time.time() - start,
        }
    except Exception as e:
        print(f"‚ùå EV-M04 Failed: {e}")
        results["EV-M04"] = {"status": "error", "error": str(e), "time": time.time() - start}

eval_M04()

‚úÖ EV-M04: Found 52,156 files linked to TARGET-AML


In [None]:
# EV-M05: In the GDC database, retrieve TCGA-BRCA cases diagnosed at Stage II
# Multiple AND conditions (project + stage)
def eval_M05():
    start = time.time()
    try:
        filters = {
            "op": "and",
            "content": [
                {"op": "=", "content": {"field": "project.project_id", "value": "TCGA-BRCA"}},
                {"op": "=", "content": {"field": "diagnoses.ajcc_pathologic_stage", "value": "Stage II"}}
            ]
        }
        
        result = rest_query("cases", {
            "filters": json.dumps(filters),
            "size": "0"
        })
        
        count = result["data"]["pagination"]["total"]
        print(f"‚úÖ EV-M05: Found {count} TCGA-BRCA cases diagnosed at Stage II")
        
        results["EV-M05"] = {
            "status": "success",
            "result": f"{count} TCGA-BRCA Stage II cases",
            "time": time.time() - start
        }
    except Exception as e:
        print(f"‚ùå EV-M05 Failed: {e}")
        results["EV-M05"] = {"status": "error", "error": str(e), "time": time.time() - start}

eval_M05()

‚úÖ EV-M05: Found 8 TCGA-BRCA cases diagnosed at Stage II


In [21]:
# EV-M06: In the GDC database, calculate the mean age at diagnosis for all patients in the TCGA-COAD project (Colon Adenocarcinoma). Query all 461 cases and extract the age_at_diagnosis field from each diagnosis record. Convert ages from days to years by dividing by 365.25. If a case has multiple diagnosis records, include ALL of them in the calculation. Report: (1) the mean age in years (rounded to 1 decimal), (2) the total number of diagnosis records used in the calculation, (3) the number of unique cases with valid age data, and (4) the number of cases missing age data.
# Aggregation with calculation (mean age by project)
def eval_M06():
    start = time.time()
    try:
        filters = {
            "op": "=",
            "content": {
                "field": "project.project_id",
                "value": "TCGA-COAD"
            }
        }
        
        result = rest_query("cases", {
            "filters": json.dumps(filters),
            "size": "2000",
            "fields": "diagnoses.age_at_diagnosis"
        })
        
        # Extract ages and calculate statistics
        ages = []
        unique_cases_with_age = set()
        cases_missing_age = 0
        
        for case in result["data"]["hits"]:
            case_id = case.get("case_id")
            diagnoses = case.get("diagnoses", [])
            
            has_age = False
            for diagnosis in diagnoses:
                age = diagnosis.get("age_at_diagnosis")
                if age is not None:
                    ages.append(age / 365.25)  # Convert from days to years
                    has_age = True
            
            if has_age:
                unique_cases_with_age.add(case_id)
            elif diagnoses:  # Has diagnoses but no age
                cases_missing_age += 1
        
        if ages:
            
            total_cases = result["data"]["pagination"]["total"]
            mean_age = sum(ages) / len(ages)
            cases_with_multiple = len(ages) - len(unique_cases_with_age)
            
            print(f"‚úÖ EV-M06: Mean age at diagnosis for TCGA-COAD: {mean_age:.1f} years")
            print(f"  Based on {len(ages)} diagnosis records from {len(unique_cases_with_age)} unique cases")
            print(f"  ({cases_with_multiple} extra records from cases with multiple diagnoses, {cases_missing_age} cases missing age data)")
            print(f"  Total cases in project: {total_cases}")
            
            results["EV-M06"] = {
                "status": "success",
                "result": f"{mean_age:.1f} years mean age",
                "data": {
                    "mean_age": mean_age, 
                    "diagnosis_records": len(ages),
                    "unique_cases": len(unique_cases_with_age),
                    "cases_with_multiple_diagnoses": cases_with_multiple,
                    "cases_missing_age": cases_missing_age,
                    "total_cases": total_cases
                },
                "time": time.time() - start
            }
        else:
            print(f"‚ùå EV-M06: No age data found for TCGA-COAD")
            results["EV-M06"] = {"status": "error", "error": "No age data found", "time": time.time() - start}
            
    except Exception as e:
        print(f"‚ùå EV-M06 Failed: {e}")
        results["EV-M06"] = {"status": "error", "error": str(e), "time": time.time() - start}

eval_M06()

‚úÖ EV-M06: Mean age at diagnosis for TCGA-COAD: 67.2 years
  Based on 568 diagnosis records from 1 unique cases
  (567 cases had multiple diagnoses, 4 cases missing age data)
  Total cases in project: 461


In [34]:
# EV-M07: In the GDC database, list all projects that used RNA-Seq
# Cross-entity faceting (projects using RNA-Seq)
def eval_M07():
    start = time.time()
    try:
        filters = {
            "op": "=",
            "content": {
                "field": "experimental_strategy",
                "value": "RNA-Seq"
            }
        }
        
        result = rest_query("files", {
            "filters": json.dumps(filters),
            "size": "0",
            "facets": "cases.project.project_id"
        })
        
        # Extract unique project IDs
        projects = []
        for bucket in result["data"]["aggregations"]["cases.project.project_id"]["buckets"]:
            projects.append(bucket["key"])
        
        count = len(projects)
        print(f"‚úÖ EV-M07: Found {count} projects that used RNA-Seq")
        print(f"Projects: {', '.join(sorted(projects))}")
        
        results["EV-M07"] = {
            "status": "success",
            "result": f"{count} projects used RNA-Seq",
            "data": sorted(projects),
            "time": time.time() - start
        }
    except Exception as e:
        print(f"‚ùå EV-M07 Failed: {e}")
        results["EV-M07"] = {"status": "error", "error": str(e), "time": time.time() - start}

eval_M07()

‚úÖ EV-M07: Found 83 projects that used RNA-Seq
Projects: APOLLO-LUAD, APOLLO-OV, BEATAML1.0-COHORT, CDDP_EAGLE-1, CGCI-BLGSP, CGCI-HTMCP-CC, CGCI-HTMCP-DLBCL, CGCI-HTMCP-LC, CMI-ASC, CMI-MBC, CMI-MPC, CPTAC-2, CPTAC-3, CTSP-DLBCL1, EXCEPTIONAL_RESPONDERS-ER, HCMI-CMDC, MATCH-B, MATCH-C1, MATCH-H, MATCH-I, MATCH-N, MATCH-P, MATCH-Q, MATCH-R, MATCH-S1, MATCH-S2, MATCH-U, MATCH-W, MATCH-Y, MATCH-Z1A, MATCH-Z1B, MATCH-Z1D, MATCH-Z1I, MMRF-COMMPASS, MP2PRT-ALL, MP2PRT-WT, NCICCR-DLBCL, OHSU-CNL, ORGANOID-PANCREATIC, REBC-THYR, TARGET-ALL-P1, TARGET-ALL-P2, TARGET-ALL-P3, TARGET-AML, TARGET-CCSK, TARGET-NBL, TARGET-OS, TARGET-RT, TARGET-WT, TCGA-ACC, TCGA-BLCA, TCGA-BRCA, TCGA-CESC, TCGA-CHOL, TCGA-COAD, TCGA-DLBC, TCGA-ESCA, TCGA-GBM, TCGA-HNSC, TCGA-KICH, TCGA-KIRC, TCGA-KIRP, TCGA-LAML, TCGA-LGG, TCGA-LIHC, TCGA-LUAD, TCGA-LUSC, TCGA-MESO, TCGA-OV, TCGA-PAAD, TCGA-PCPG, TCGA-PRAD, TCGA-READ, TCGA-SARC, TCGA-SKCM, TCGA-STAD, TCGA-TGCT, TCGA-THCA, TCGA-THYM, TCGA-UCEC, TCGA-UCS, TCGA-UVM, 

In [None]:
# EV-M08: In the GDC database, get me the race distribution for TCGA-LIHC
# Project-specific demographic distribution
def eval_M08():
    start = time.time()
    try:
        filters = {
            "op": "=",
            "content": {
                "field": "project.project_id",
                "value": "TCGA-LIHC"
            }
        }
        
        result = rest_query("cases", {
            "filters": json.dumps(filters),
            "size": "0",
            "facets": "demographic.race"
        })
        
        # Get race distribution
        race_counts = {}
        total_cases = 0
        for bucket in result["data"]["aggregations"]["demographic.race"]["buckets"]:
            race = bucket["key"]
            count = bucket["doc_count"]
            race_counts[race] = count
            total_cases += count
        
        print(f"‚úÖ EV-M08: Race distribution for TCGA-LIHC ({total_cases} total cases):")
        for race, count in sorted(race_counts.items(), key=lambda x: x[1], reverse=True):
            percentage = (count / total_cases) * 100 if total_cases > 0 else 0
            print(f"  {race}: {count} ({percentage:.2f}%)")
        
        results["EV-M08"] = {
            "status": "success",
            "result": f"Race distribution for TCGA-LIHC",
            "data": race_counts,
            "time": time.time() - start
        }
    except Exception as e:
        print(f"‚ùå EV-M08 Failed: {e}")
        results["EV-M08"] = {"status": "error", "error": str(e), "time": time.time() - start}

eval_M08()

‚úÖ EV-M08: Race distribution for TCGA-LIHC (377 total cases):
  white: 187 (49.60%)
  asian: 161 (42.71%)
  black or african american: 17 (4.51%)
  not reported: 6 (1.59%)
  unknown: 4 (1.06%)
  american indian or alaska native: 2 (0.53%)


In [None]:
# EV-M09: In the GDC database, count the total number of files that meet ALL three of these criteria: (1) are associated with cases from the TCGA-GBM project (Glioblastoma Multiforme, cases.project.project_id = "TCGA-GBM"), (2) were generated using the Whole Genome Sequencing experimental strategy (experimental_strategy = "WGS"), AND (3) have a file size greater than 50 GB (file_size > 53,687,091,200 bytes). Return the total count of files matching all criteria.
# Multiple filters with range (project + strategy + file_size)
def eval_M09():
    start = time.time()
    try:
        filters = {
            "op": "and",
            "content": [
                {"op": "=", "content": {"field": "cases.project.project_id", "value": "TCGA-GBM"}},
                {"op": "=", "content": {"field": "experimental_strategy", "value": "WGS"}},
                {"op": ">", "content": {"field": "file_size", "value": 53687091200}}  # 50 GB in bytes
            ]
        }

        result = rest_query("files", {
            "filters": json.dumps(filters),
            "size": "0"
        })
        
        count = result["data"]["pagination"]["total"]
        print(f"‚úÖ EV-M09: Found {count} WGS files > 50GB for TCGA-GBM")
        
        results["EV-M09"] = {
            "status": "success",
            "result": f"{count} WGS files > 50GB",
            "time": time.time() - start
        }
    except Exception as e:
        print(f"‚ùå EV-M09 Failed: {e}")
        results["EV-M09"] = {"status": "error", "error": str(e), "time": time.time() - start}

eval_M09()

‚úÖ EV-M09: Found 798 WGS files > 50GB for TCGA-GBM


In [None]:
# EV-M10: In the GDC database, count the number of cases from the TCGA-OV project (Ovarian Serous Cystadenocarcinoma, project.project_id = "TCGA-OV") where the patient died within less than 1000 days after their initial diagnosis or study enrollment (demographic.days_to_death < 1000). Return the total count of cases meeting both criteria.
# Range-based filtering (project + days_to_death < 1000)
def eval_M10():
    start = time.time()
    try:
        filters = {
            "op": "and",
            "content": [
                {"op": "=", "content": {"field": "project.project_id", "value": "TCGA-OV"}},
                {"op": "<", "content": {"field": "demographic.days_to_death", "value": 1000}}
            ]
        }
        
        result = rest_query("cases", {
            "filters": json.dumps(filters),
            "size": "0"
        })
        
        count = result["data"]["pagination"]["total"]
        print(f"‚úÖ EV-M10: Found {count} TCGA-OV cases with days_to_death < 1000")
        
        results["EV-M10"] = {
            "status": "success",
            "result": f"{count} TCGA-OV cases with days_to_death < 1000",
            "time": time.time() - start
        }
    except Exception as e:
        print(f"‚ùå EV-M10 Failed: {e}")
        results["EV-M10"] = {"status": "error", "error": str(e), "time": time.time() - start}

eval_M10()

‚úÖ EV-M10: Found 154 TCGA-OV cases with days_to_death < 1000


### COMPLEX COHORTS QUERIES (High Complexity)

These queries require multi-step reasoning, multiple entity relationships, or sophisticated filtering to define patient/sample cohorts.

In [6]:
# EV-H01: In the GDC database, list cases that have both WXS and RNA-Seq files
# Multi-entity intersection (cases with both WXS AND RNA-Seq)
def eval_H01():
    start = time.time()
    try:
        print("üöÄ OPTIMIZED APPROACH: Direct filtering for efficiency!")
        print("=" * 60)
        
        # Step 1: Get cases with WXS OR RNA-Seq (much smaller dataset)
        print("üìä Fetching cases with WXS OR RNA-Seq files...")
        
        filters = {
            "op": "or",
            "content": [
                {"op": "=", "content": {"field": "files.experimental_strategy", "value": "WXS"}},
                {"op": "=", "content": {"field": "files.experimental_strategy", "value": "RNA-Seq"}}
            ]
        }
        
        # First get total count
        print("  Getting total count...")
        count_result = rest_query("cases", {
            "filters": json.dumps(filters),
            "size": "0"
        })
        total_relevant = count_result["data"]["pagination"]["total"]
        print(f"  üìà Total cases with WXS OR RNA-Seq: {total_relevant:,}")
        
        # Step 2: Process these cases to find intersection
        print("üîç Processing cases to find those with BOTH strategies...")
        
        cases_with_both = []
        project_counts = {}
        wxs_cases = set()
        rnaseq_cases = set()
        
        # Fetch cases in smaller, manageable batches
        size = 1000  # Reduced batch size for better progress reporting
        from_idx = 0
        processed = 0
        batch_num = 1
        
        while from_idx < total_relevant:
            print(f"    üì¶ Processing batch {batch_num} (starting at {from_idx:,})...")
            
            try:
                result = rest_query("cases", {
                    "filters": json.dumps(filters),
                    "size": str(size),
                    "from": str(from_idx),
                    "fields": "submitter_id,case_id,project.project_id,files.experimental_strategy"
                })
                
                batch_cases = result["data"]["hits"]
                if not batch_cases:
                    print(f"    ‚ö†Ô∏è  No more cases returned, stopping at {from_idx}")
                    break
                
                print(f"    üìã Got {len(batch_cases)} cases in this batch")
                
                for case in batch_cases:
                    processed += 1
                    
                    # Extract experimental strategies from files
                    files = case.get("files", [])
                    case_strategies = set()
                    
                    for file_info in files:
                        strategy = file_info.get("experimental_strategy")
                        if strategy in ["WXS", "RNA-Seq"]:
                            case_strategies.add(strategy)
                    
                    # Track cases by strategy
                    case_id = case["submitter_id"]
                    if "WXS" in case_strategies:
                        wxs_cases.add(case_id)
                    
                    if "RNA-Seq" in case_strategies:
                        rnaseq_cases.add(case_id)
                    
                    # Check if case has BOTH strategies
                    if "WXS" in case_strategies and "RNA-Seq" in case_strategies:
                        case_info = {
                            "submitter_id": case["submitter_id"],
                            "case_id": case["case_id"],
                            "project": case.get("project", {}).get("project_id", "Unknown"),
                            "strategies": list(case_strategies)
                        }
                        cases_with_both.append(case_info)
                        
                        project = case_info["project"]
                        project_counts[project] = project_counts.get(project, 0) + 1
                
                # Update progress
                progress = (processed / total_relevant) * 100
                print(f"    ‚úÖ Progress: {processed:,}/{total_relevant:,} cases ({progress:.1f}%)")
                print(f"       Found {len(cases_with_both)} cases with both strategies so far")
                
                from_idx += len(batch_cases)
                batch_num += 1
                
                # Safety check: if we got fewer results than requested, we're at the end
                if len(batch_cases) < size:
                    print(f"    üèÅ Reached end of data (got {len(batch_cases)} < {size})")
                    break
                
                # Safety check: prevent infinite loops
                if batch_num > 50:  # Maximum 50 batches = 50,000 cases max
                    print(f"    ‚ö†Ô∏è  Safety limit reached at batch {batch_num}, stopping")
                    break
                    
            except Exception as batch_error:
                print(f"    ‚ùå Error in batch {batch_num}: {batch_error}")
                break
        
        # Calculate final statistics
        count = len(cases_with_both)
        wxs_total = len(wxs_cases)
        rnaseq_total = len(rnaseq_cases)
        
        print(f"\nüéØ FINAL RESULTS")
        print("=" * 60)
        print(f"‚úÖ Cases with WXS files: {wxs_total:,}")
        print(f"‚úÖ Cases with RNA-Seq files: {rnaseq_total:,}")
        print(f"‚úÖ Cases with BOTH WXS and RNA-Seq: {count:,}")
        print(f"üìä Total relevant cases processed: {processed:,}")
        
        if project_counts:
            print(f"\nüèÜ Top 10 projects with both WXS and RNA-Seq:")
            sorted_projects = sorted(project_counts.items(), key=lambda x: x[1], reverse=True)
            for i, (project, proj_count) in enumerate(sorted_projects[:10], 1):
                percentage = (proj_count / count) * 100 if count > 0 else 0
                print(f"  {i:2d}. {project}: {proj_count:,} cases ({percentage:.1f}%)")
        
        if cases_with_both:
            print(f"\nüìã Sample case IDs:")
            for i, case_info in enumerate(cases_with_both[:5], 1):
                print(f"  {i}. {case_info['submitter_id']} ({case_info['project']})")
        
        results["EV-H01"] = {
            "status": "success",
            "result": f"{count} cases with both WXS and RNA-Seq",
            "data": {
                "both_count": count,
                "wxs_count": wxs_total,
                "rnaseq_count": rnaseq_total,
                "project_counts": project_counts,
                "sample_cases": [c["submitter_id"] for c in cases_with_both[:10]],
                "total_relevant_processed": processed
            },
            "time": time.time() - start
        }
    except Exception as e:
        print(f"‚ùå EV-H01 Failed: {e}")
        results["EV-H01"] = {"status": "error", "error": str(e), "time": time.time() - start}

eval_H01()

üöÄ OPTIMIZED APPROACH: Direct filtering for efficiency!
üìä Fetching cases with WXS OR RNA-Seq files...
  Getting total count...
  üìà Total cases with WXS OR RNA-Seq: 27,151
üîç Processing cases to find those with BOTH strategies...
    üì¶ Processing batch 1 (starting at 0)...
  üìà Total cases with WXS OR RNA-Seq: 27,151
üîç Processing cases to find those with BOTH strategies...
    üì¶ Processing batch 1 (starting at 0)...
    üìã Got 1000 cases in this batch
    ‚úÖ Progress: 1,000/27,151 cases (3.7%)
       Found 777 cases with both strategies so far
    üì¶ Processing batch 2 (starting at 1,000)...
    üìã Got 1000 cases in this batch
    ‚úÖ Progress: 2,000/27,151 cases (7.4%)
       Found 1684 cases with both strategies so far
    üì¶ Processing batch 3 (starting at 2,000)...
    üìã Got 1000 cases in this batch
    ‚úÖ Progress: 3,000/27,151 cases (11.0%)
       Found 2475 cases with both strategies so far
    üì¶ Processing batch 4 (starting at 3,000)...
    

In [7]:
# EV-H02: In the GDC database, show the distribution of years of smoking for TCGA-LUSC
# Domain-specific analysis (smoking duration distribution)
def eval_H02():
    start = time.time()
    try:
        # Use GraphQL to get exposure data
        query = """
        query LUSCSmokingData($filters: FiltersArgument) {
          viewer {
            repository {
              cases {
                hits(first: 600, filters: $filters) {
                  edges {
                    node {
                      case_id
                      exposures {
                        hits {
                          edges {
                            node {
                              tobacco_smoking_onset_year
                              tobacco_smoking_quit_year
                            }
                          }
                        }
                      }
                    }
                  }
                }
              }
            }
          }
        }
        """
        
        variables = {
            "filters": {
                "op": "=",
                "content": {
                    "field": "project.project_id",
                    "value": "TCGA-LUSC"
                }
            }
        }
        
        result = graphql_query(query, variables)
        if result:
            # Process exposure data
            smoking_years = []
            cases_processed = 0
            
            for case_edge in result["data"]["viewer"]["repository"]["cases"]["hits"]["edges"]:
                case_node = case_edge.get("node", {})
                exposures = case_node.get("exposures", {}).get("hits", {}).get("edges", [])
                
                for exp_edge in exposures:
                    exp = exp_edge.get("node", {})
                    onset = exp.get("tobacco_smoking_onset_year")
                    quit = exp.get("tobacco_smoking_quit_year")
                    
                    if onset and quit:
                        years = quit - onset
                        if years > 0:
                            smoking_years.append(years)
                
                cases_processed += 1
            
            print(f"‚úÖ EV-H02: Processed {cases_processed} TCGA-LUSC cases")
            print(f"  Found {len(smoking_years)} valid smoking duration records")
            
            if smoking_years:
                # Create distribution bins
                import numpy as np
                bins = [0, 10, 20, 30, 40, 50, 100]
                hist, _ = np.histogram(smoking_years, bins=bins)
                
                print(f"  Smoking years distribution:")
                for i in range(len(bins)-1):
                    print(f"    {bins[i]}-{bins[i+1]} years: {hist[i]} cases")
            
            results["EV-H02"] = {
                "status": "success",
                "result": f"Years of smoking distribution for TCGA-LUSC",
                "data": {"smoking_years": smoking_years, "cases_processed": cases_processed},
                "time": time.time() - start
            }
        else:
            results["EV-H02"] = {"status": "error", "error": "GraphQL query failed", "time": time.time() - start}
            
    except Exception as e:
        print(f"‚ùå EV-H02 Failed: {e}")
        results["EV-H02"] = {"status": "error", "error": str(e), "time": time.time() - start}

eval_H02()

‚úÖ EV-H02: Processed 504 TCGA-LUSC cases
  Found 224 valid smoking duration records
  Smoking years distribution:
    0-10 years: 2 cases
    10-20 years: 7 cases
    20-30 years: 34 cases
    30-40 years: 46 cases
    40-50 years: 69 cases
    50-100 years: 66 cases


In [9]:
# EV-H03: In the GDC database, count cases that meet ALL of these criteria: (1) primary site is breast cancer, (2) gender is female, (3) age at diagnosis is less than 40 years old, AND (4) have RNA-Seq experimental strategy files available
# Multi-dimensional cohort (breast cancer + female + age<40 + RNA-Seq)
def eval_H03():
    start = time.time()
    try:
        filters = {
            "op": "and",
            "content": [
                {"op": "in", "content": {"field": "primary_site", "value": ["Breast"]}},
                {"op": "=", "content": {"field": "demographic.gender", "value": "female"}},
                {"op": "<", "content": {"field": "diagnoses.age_at_diagnosis", "value": 14600}},  # 40 years in days
                {"op": "=", "content": {"field": "files.experimental_strategy", "value": "RNA-Seq"}}
            ]
        }
        
        result = rest_query("cases", {
            "filters": json.dumps(filters),
            "size": "0"
        })
        
        count = result["data"]["pagination"]["total"]
        print(f"‚úÖ EV-H03: Found {count} breast cancer female cases under 40 with RNA-Seq files")
        
        results["EV-H03"] = {
            "status": "success",
            "result": f"{count} breast cancer female cases under 40 with RNA-Seq",
            "time": time.time() - start
        }
    except Exception as e:
        print(f"‚ùå EV-H03 Failed: {e}")
        results["EV-H03"] = {"status": "error", "error": str(e), "time": time.time() - start}

eval_H03()

‚úÖ EV-H03: Found 144 breast cancer female cases under 40 with RNA-Seq files


In [13]:
# EV-H04: In the GDC database, retrieve case IDs and their associated file IDs for patients that meet BOTH criteria: (1) belong to the TCGA-LUAD project (Lung Adenocarcinoma), AND (2) have a diagnosis with AJCC pathologic stage equal to Stage III
# Complex relationship mapping (stage-specific cases with file associations) 
def eval_H04():
    start = time.time()
    try:
        filters = {
            "op": "and",
            "content": [
                {"op": "=", "content": {"field": "project.project_id", "value": "TCGA-LUAD"}},
                {"op": "=", "content": {"field": "diagnoses.ajcc_pathologic_stage", "value": "Stage III"}}
            ]
        }
        
        # Get cases with Stage III LUAD
        cases_result = rest_query("cases", {
            "filters": json.dumps(filters),
            "size": "1000",
            "fields": "submitter_id,case_id,files.file_id"
        })
        
        stage_iii_cases = []
        all_file_ids = []
        
        for case in cases_result["data"]["hits"]:
            case_files = []
            files = case.get("files", [])
            
            for file_info in files:
                file_id = file_info.get("file_id")
                if file_id:
                    case_files.append(file_id)
                    all_file_ids.append(file_id)
            
            case_info = {
                "case_id": case["case_id"],
                "submitter_id": case["submitter_id"],
                "file_ids": case_files,
                "file_count": len(case_files)
            }
            stage_iii_cases.append(case_info)
        
        cases_count = len(stage_iii_cases)
        total_files = len(all_file_ids)
        
        print(f"‚úÖ EV-H04: Found {cases_count} LUAD Stage III cases with {total_files} files")
        
        if stage_iii_cases:
            print(f"  Sample cases:")
            for case in stage_iii_cases[:3]:
                print(f"    Case {case['submitter_id']}: {case['file_count']} files")
                if case['file_ids']:
                    print(f"      Sample file IDs: {case['file_ids'][:3]}")
        
        results["EV-H04"] = {
            "status": "success",
            "result": f"{cases_count} LUAD Stage III cases with {total_files} files",
            "data": {"cases": stage_iii_cases, "total_files": total_files},
            "time": time.time() - start
        }
    except Exception as e:
        print(f"‚ùå EV-H04 Failed: {e}")
        results["EV-H04"] = {"status": "error", "error": str(e), "time": time.time() - start}

eval_H04()

‚úÖ EV-H04: Found 1 LUAD Stage III cases with 77 files
  Sample cases:
    Case TCGA-95-7947: 77 files
      Sample file IDs: ['946f8d23-a5be-4193-8ebe-6e699f9ddace', '9ff005f2-ee8c-4a8a-bbb6-ccfea30a1cba', '606d4578-d204-4638-a50c-7d8f99836f73']


In [6]:
# EV-H05: In the GDC database, count cases that have files with BOTH data categories: (1) Copy Number Variation data, AND (2) Simple Nucleotide Variation data (somatic mutations)
# Multi-entity intersection (cases with both CNV AND SSM data types)
def eval_H05():
    start = time.time()
    try:
        print("üöÄ OPTIMIZED APPROACH: Efficient pagination with early termination!")
        print("=" * 60)
        
        def fetch_case_ids(data_category, max_cases=20000):
            """Fetch case IDs for a given data category with efficient pagination"""
            filters = {
                "op": "=",
                "content": {
                    "field": "files.data_category",
                    "value": data_category
                }
            }
            
            # Get total count first
            count_result = rest_query("cases", {
                "filters": json.dumps(filters),
                "size": "0"
            })
            total = count_result["data"]["pagination"]["total"]
            
            case_ids = set()
            size = 1000
            from_idx = 0
            batches_fetched = 0
            max_batches = min(20, (max_cases // size) + 1)  # Limit batches
            
            while from_idx < total and batches_fetched < max_batches:
                result = rest_query("cases", {
                    "filters": json.dumps(filters),
                    "size": str(size),
                    "from": str(from_idx),
                    "fields": "submitter_id"
                })
                
                batch = result["data"]["hits"]
                if not batch:
                    break
                
                for case in batch:
                    case_ids.add(case["submitter_id"])
                
                from_idx += len(batch)
                batches_fetched += 1
                
                if len(batch) < size:
                    break
            
            return case_ids, total
        
        # Step 1: Get CNV cases
        print("üìä Fetching CNV cases...")
        cnv_cases, cnv_total = fetch_case_ids("Copy Number Variation")
        print(f"  ‚úÖ Total CNV cases: {cnv_total:,}")
        print(f"  ‚úÖ Fetched {len(cnv_cases):,} unique CNV case IDs")
        
        # Step 2: Get SSM cases
        print("\nüìä Fetching SSM cases...")
        ssm_cases, ssm_total = fetch_case_ids("Simple Nucleotide Variation")
        print(f"  ‚úÖ Total SSM cases: {ssm_total:,}")
        print(f"  ‚úÖ Fetched {len(ssm_cases):,} unique SSM case IDs")
        
        # Step 3: Find intersection
        print("\nüîç Computing intersection...")
        both_cases = cnv_cases.intersection(ssm_cases)
        both_count = len(both_cases)
        print(f"  ‚úÖ Found {both_count:,} cases with BOTH CNV and SSM")
        
        # Step 4: Get project distribution
        print("\nüìä Analyzing project distribution...")
        
        project_counts = {}
        sample_cases = []
        
        if both_cases:
            # Fetch project info for cases with both
            sample_ids = list(both_cases)[:500]  # Get up to 500 for good distribution
            
            both_filters = {
                "op": "in",
                "content": {
                    "field": "submitter_id",
                    "value": sample_ids
                }
            }
            
            result = rest_query("cases", {
                "filters": json.dumps(both_filters),
                "size": "500",
                "fields": "submitter_id,project.project_id"
            })
            
            for case in result["data"]["hits"]:
                case_id = case["submitter_id"]
                project = case.get("project", {}).get("project_id", "Unknown")
                sample_cases.append(case_id)
                project_counts[project] = project_counts.get(project, 0) + 1
        
        # Print results
        print(f"\nüéØ FINAL RESULTS")
        print("=" * 60)
        print(f"‚úÖ Cases with Copy Number Variation: {cnv_total:,} (sampled {len(cnv_cases):,})")
        print(f"‚úÖ Cases with Simple Nucleotide Variation: {ssm_total:,} (sampled {len(ssm_cases):,})")
        print(f"‚úÖ Cases with BOTH CNV and SSM: {both_count:,}")
        
        if both_count > 0:
            coverage = (both_count / min(len(cnv_cases), len(ssm_cases))) * 100
            print(f"üìä Intersection rate: {coverage:.1f}% (based on sampled data)")
        
        if project_counts:
            print(f"\nüèÜ Projects with both CNV and SSM:")
            sorted_projects = sorted(project_counts.items(), key=lambda x: x[1], reverse=True)
            for i, (project, count) in enumerate(sorted_projects, 1):
                percentage = (count / len(sample_cases)) * 100 if sample_cases else 0
                print(f"  {i:2d}. {project}: {count:,} cases ({percentage:.1f}%)")
        
        if sample_cases:
            print(f"\nüìã Sample case IDs:")
            for i, case_id in enumerate(sample_cases[:10], 1):
                print(f"  {i}. {case_id}")
        
        results["EV-H05"] = {
            "status": "success",
            "result": f"{both_count} cases with both CNV and SSM data",
            "data": {
                "both_count": both_count,
                "cnv_count": cnv_total,
                "ssm_count": ssm_total,
                "cnv_sampled": len(cnv_cases),
                "ssm_sampled": len(ssm_cases),
                "project_counts": project_counts,
                "sample_cases": sample_cases[:10]
            },
            "time": time.time() - start
        }
    except Exception as e:
        print(f"‚ùå EV-H05 Failed: {e}")
        import traceback
        traceback.print_exc()
        results["EV-H05"] = {"status": "error", "error": str(e), "time": time.time() - start}

eval_H05()


üöÄ OPTIMIZED APPROACH: Efficient pagination with early termination!
üìä Fetching CNV cases...
  ‚úÖ Total CNV cases: 17,751
  ‚úÖ Fetched 17,751 unique CNV case IDs

üìä Fetching SSM cases...
  ‚úÖ Total CNV cases: 17,751
  ‚úÖ Fetched 17,751 unique CNV case IDs

üìä Fetching SSM cases...
  ‚úÖ Total SSM cases: 40,661
  ‚úÖ Fetched 20,000 unique SSM case IDs

üîç Computing intersection...
  ‚úÖ Found 1,926 cases with BOTH CNV and SSM

üìä Analyzing project distribution...
  ‚úÖ Total SSM cases: 40,661
  ‚úÖ Fetched 20,000 unique SSM case IDs

üîç Computing intersection...
  ‚úÖ Found 1,926 cases with BOTH CNV and SSM

üìä Analyzing project distribution...
‚ùå EV-H05 Failed: 414 Client Error: Request-URI Too Long for url: https://api.gdc.cancer.gov/cases?filters=%7B%22op%22%3A+%22in%22%2C+%22content%22%3A+%7B%22field%22%3A+%22submitter_id%22%2C+%22value%22%3A+%5B%22C3N-04176%22%2C+%22TCGA-75-5122%22%2C+%22TCGA-GR-7353%22%2C+%22TCGA-AC-A3YJ%22%2C+%22C3N-03889%22%2C+%22TCGA-D8-A1

Traceback (most recent call last):
  File "/var/folders/w4/x3zzz2j920s10y1d68dj_nbm0000gn/T/ipykernel_65978/3285408988.py", line 91, in eval_H05
    result = rest_query("cases", {
  File "/var/folders/w4/x3zzz2j920s10y1d68dj_nbm0000gn/T/ipykernel_65978/2141473674.py", line 35, in rest_query
    response.raise_for_status()
  File "/Users/mani/work/ai-agent-evaluation/.venv/lib/python3.10/site-packages/requests/models.py", line 1026, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 414 Client Error: Request-URI Too Long for url: https://api.gdc.cancer.gov/cases?filters=%7B%22op%22%3A+%22in%22%2C+%22content%22%3A+%7B%22field%22%3A+%22submitter_id%22%2C+%22value%22%3A+%5B%22C3N-04176%22%2C+%22TCGA-75-5122%22%2C+%22TCGA-GR-7353%22%2C+%22TCGA-AC-A3YJ%22%2C+%22C3N-03889%22%2C+%22TCGA-D8-A1X9%22%2C+%22TCGA-A8-A08Z%22%2C+%22TCGA-78-7149%22%2C+%22TCGA-A2-A04X%22%2C+%22TCGA-97-8176%22%2C+%22TCGA-FF-A7CR%22%2C+%22TCGA-AC-A3W6%22%2C+%22TCGA-A2-A0

In [8]:
# EV-H06: In the GDC database, find and count case IDs for patients who meet BOTH of these criteria: (1) have a documented alcohol history (exposures.alcohol_history = "Yes"), AND (2) have been diagnosed with AJCC pathologic stage equal to "Stage II". Return the total count of matching cases along with their case identifiers.
# Complex cohort definition combining exposure history and clinical staging
def eval_H06():
    start = time.time()
    try:
        print("üîç Querying cases with alcohol history AND AJCC Stage II...")
        print("=" * 60)
        
        filters = {
            "op": "and",
            "content": [
                {"op": "=", "content": {"field": "exposures.alcohol_history", "value": "Yes"}},
                {"op": "=", "content": {"field": "diagnoses.ajcc_pathologic_stage", "value": "Stage II"}}
            ]
        }
        
        # First get total count
        count_result = rest_query("cases", {
            "filters": json.dumps(filters),
            "size": "0"
        })
        total_count = count_result["data"]["pagination"]["total"]
        print(f"üìä Total cases found: {total_count}")
        
        # Fetch case details with project and demographic info
        result = rest_query("cases", {
            "filters": json.dumps(filters),
            "size": "1000",
            "fields": "submitter_id,case_id,project.project_id,demographic.gender,demographic.ethnicity,primary_site"
        })
        
        # Process results
        case_ids = []
        project_counts = {}
        gender_counts = {}
        primary_site_counts = {}
        
        for case in result["data"]["hits"]:
            # Extract case info
            case_info = {
                "case_id": case["case_id"],
                "submitter_id": case["submitter_id"],
                "project": case.get("project", {}).get("project_id", "Unknown")
            }
            
            # Track demographics
            demographic = case.get("demographic", {})
            if isinstance(demographic, list) and demographic:
                gender = demographic[0].get("gender", "Unknown")
            elif isinstance(demographic, dict):
                gender = demographic.get("gender", "Unknown")
            else:
                gender = "Unknown"
            
            case_info["gender"] = gender
            gender_counts[gender] = gender_counts.get(gender, 0) + 1
            
            # Track primary site
            primary_site = case.get("primary_site")
            if isinstance(primary_site, list) and primary_site:
                site = primary_site[0]
            else:
                site = primary_site if primary_site else "Unknown"
            
            case_info["primary_site"] = site
            primary_site_counts[site] = primary_site_counts.get(site, 0) + 1
            
            case_ids.append(case_info)
            
            # Count by project
            project = case_info["project"]
            project_counts[project] = project_counts.get(project, 0) + 1
        
        count = len(case_ids)
        
        # Print results
        print(f"\n‚úÖ EV-H06: Found {count} cases with alcohol history AND AJCC Stage II")
        
        if project_counts:
            print(f"\nüè• Cases by project:")
            sorted_projects = sorted(project_counts.items(), key=lambda x: x[1], reverse=True)
            for project, proj_count in sorted_projects:
                print(f"  {project}: {proj_count} cases")
        
        if gender_counts:
            print(f"\nüë§ Gender distribution:")
            for gender, g_count in sorted(gender_counts.items(), key=lambda x: x[1], reverse=True):
                percentage = (g_count / count) * 100 if count > 0 else 0
                print(f"  {gender}: {g_count} ({percentage:.1f}%)")
        
        if primary_site_counts:
            print(f"\nüéØ Top primary sites:")
            sorted_sites = sorted(primary_site_counts.items(), key=lambda x: x[1], reverse=True)
            for site, site_count in sorted_sites[:5]:
                percentage = (site_count / count) * 100 if count > 0 else 0
                print(f"  {site}: {site_count} ({percentage:.1f}%)")
        
        if count > 0:
            print(f"\nüìã Sample case IDs:")
            for i, case in enumerate(case_ids[:5], 1):
                print(f"  {i}. {case['submitter_id']} ({case['project']}, {case['gender']}, {case['primary_site']})")
        
        results["EV-H06"] = {
            "status": "success",
            "result": f"{count} cases with alcohol history AND AJCC Stage II",
            "data": {
                "total_count": total_count,
                "cases": case_ids,
                "project_counts": project_counts,
                "gender_counts": gender_counts,
                "primary_site_counts": primary_site_counts
            },
            "time": time.time() - start
        }
    except Exception as e:
        print(f"‚ùå EV-H06 Failed: {e}")
        import traceback
        traceback.print_exc()
        results["EV-H06"] = {"status": "error", "error": str(e), "time": time.time() - start}

eval_H06()


üîç Querying cases with alcohol history AND AJCC Stage II...
üìä Total cases found: 56
üìä Total cases found: 56

‚úÖ EV-H06: Found 56 cases with alcohol history AND AJCC Stage II

üè• Cases by project:
  CPTAC-3: 31 cases
  TCGA-HNSC: 24 cases
  TCGA-ESCA: 1 cases

üë§ Gender distribution:
  male: 43 (76.8%)
  female: 13 (23.2%)

üéØ Top primary sites:
  Other and unspecified parts of tongue: 10 (17.9%)
  Kidney: 9 (16.1%)
  Larynx: 9 (16.1%)
  Floor of mouth: 9 (16.1%)
  Uterus, NOS: 5 (8.9%)

üìã Sample case IDs:
  1. C3N-03619 (CPTAC-3, male, Other and unspecified parts of tongue)
  2. C3L-02669 (CPTAC-3, male, Bronchus and lung)
  3. C3L-00981 (CPTAC-3, male, Kidney)
  4. C3L-04354 (CPTAC-3, male, Larynx)
  5. C3N-02296 (CPTAC-3, female, Uterus, NOS)

‚úÖ EV-H06: Found 56 cases with alcohol history AND AJCC Stage II

üè• Cases by project:
  CPTAC-3: 31 cases
  TCGA-HNSC: 24 cases
  TCGA-ESCA: 1 cases

üë§ Gender distribution:
  male: 43 (76.8%)
  female: 13 (23.2%)

üéØ 

In [9]:
# EV-H07: In the GDC database, count the total number of files associated with cases that meet BOTH of these criteria: (1) the patient's cause of death (demographic.cause_of_death field) is classified as "Cancer Related", AND (2) the patient died at age 50 years or younger (demographic.days_to_death ‚â§ 18,250 days, which equals 50 years). Return both the count of matching cases and the total count of all files linked to those cases.
def eval_H07():
    start = time.time()
    try:
        filters = {
            "op": "and",
            "content": [
                {"op": "=", "content": {"field": "demographic.cause_of_death", "value": "Cancer Related"}},
                {"op": "<=", "content": {"field": "demographic.days_to_death", "value": 18250}}  # 50 years in days
            ]
        }
        
        # First get the cases
        cases_result = rest_query("cases", {
            "filters": json.dumps(filters),
            "size": "0"
        })
        
        cases_count = cases_result["data"]["pagination"]["total"]
        
        # Now get files for these cases
        files_result = rest_query("files", {
            "filters": json.dumps({
                "op": "and",
                "content": [
                    {"op": "=", "content": {"field": "cases.demographic.cause_of_death", "value": "Cancer Related"}},
                    {"op": "<=", "content": {"field": "cases.demographic.days_to_death", "value": 18250}}
                ]
            }),
            "size": "0"
        })
        
        files_count = files_result["data"]["pagination"]["total"]
        
        print(f"‚úÖ EV-H07: Found {files_count} files for {cases_count} cases")
        print(f"  Cases: Cancer-related deaths ‚â§50 years old")
        
        results["EV-H07"] = {
            "status": "success",
            "result": f"{files_count} files for cancer-related deaths ‚â§50 years",
            "data": {"cases_count": cases_count, "files_count": files_count},
            "time": time.time() - start
        }
    except Exception as e:
        print(f"‚ùå EV-H07 Failed: {e}")
        results["EV-H07"] = {"status": "error", "error": str(e), "time": time.time() - start}

eval_H07()

‚úÖ EV-H07: Found 62935 files for 1188 cases
  Cases: Cancer-related deaths ‚â§50 years old


In [10]:
# EV-H08: In the GDC database, retrieve all cases from the TCGA-COAD project (Colon Adenocarcinoma, project.project_id = "TCGA-COAD") and generate a cross-tabulation (joint distribution) showing how cases are distributed across the combination of two demographic variables: (1) gender (demographic.gender field), and (2) primary site (project.primary_site field). Present the results as a two-way frequency table with counts and percentages for each gender-by-primary_site combination.
def eval_H08():
    start = time.time()
    try:
        filters = {
            "op": "=",
            "content": {
                "field": "project.project_id",
                "value": "TCGA-COAD"
            }
        }
        
        # Get detailed case data with gender and primary site information
        result = rest_query("cases", {
            "filters": json.dumps(filters),
            "size": "2000",
            "fields": "submitter_id,demographic.gender,project.primary_site"
        })
        
        # Process the data to create joint distribution
        joint_distribution = {}
        gender_totals = {}
        site_totals = {}
        total_cases = 0
        
        for case in result["data"]["hits"]:
            # Extract gender - handle both dict and list formats
            demographic = case.get("demographic", {})
            gender = "Unknown"
            
            if isinstance(demographic, list) and demographic:
                gender = demographic[0].get("gender", "Unknown")
            elif isinstance(demographic, dict):
                gender = demographic.get("gender", "Unknown")
            
            # Extract primary site
            project = case.get("project", {})
            primary_sites = project.get("primary_site", [])
            
            # Handle primary site (could be list or single value)
            if isinstance(primary_sites, list) and primary_sites:
                # Take the first primary site for simplicity
                primary_site = primary_sites[0]
            elif isinstance(primary_sites, str):
                primary_site = primary_sites
            else:
                primary_site = "Unknown"
            
            # Skip missing values
            if gender == "_missing":
                gender = "Unknown"
            if primary_site == "_missing":
                primary_site = "Unknown"
            
            # Update joint distribution
            key = (gender, primary_site)
            joint_distribution[key] = joint_distribution.get(key, 0) + 1
            
            # Update marginal totals
            gender_totals[gender] = gender_totals.get(gender, 0) + 1
            site_totals[primary_site] = site_totals.get(primary_site, 0) + 1
            total_cases += 1
        
        # Create a formatted cross-tabulation table
        print(f"‚úÖ EV-H08: Joint distribution (gender √ó primary_site) for TCGA-COAD")
        print(f"  Total cases analyzed: {total_cases}")
        print()
        
        # Get unique genders and sites for table structure
        genders = sorted(gender_totals.keys())
        sites = sorted(site_totals.keys())
        
        # Print cross-tabulation table
        print("üìä CROSS-TABULATION TABLE:")
        
        # Header row
        header = "Primary Site \\ Gender".ljust(25)
        for gender in genders:
            header += f"{gender:>10}"
        header += f"{'Total':>10}"
        print(header)
        print("-" * len(header))
        
        # Data rows
        for site in sites:
            row = site.ljust(25)
            row_total = 0
            for gender in genders:
                count = joint_distribution.get((gender, site), 0)
                row += f"{count:>10}"
                row_total += count
            row += f"{row_total:>10}"
            print(row)
        
        # Total row
        total_row = "Total".ljust(25)
        for gender in genders:
            total_row += f"{gender_totals[gender]:>10}"
        total_row += f"{total_cases:>10}"
        print("-" * len(header))
        print(total_row)
        
        # Show percentages
        print(f"\nüìà PERCENTAGE BREAKDOWN:")
        for (gender, site), count in sorted(joint_distribution.items(), key=lambda x: x[1], reverse=True):
            percentage = (count / total_cases) * 100 if total_cases > 0 else 0
            print(f"  {gender} √ó {site}: {count} cases ({percentage:.2f}%)")
        
        results["EV-H08"] = {
            "status": "success",
            "result": f"Joint distribution (gender √ó primary_site) for TCGA-COAD",
            "data": {
                "joint_distribution": joint_distribution,
                "gender_totals": gender_totals,
                "site_totals": site_totals,
                "total_cases": total_cases,
                "genders": genders,
                "sites": sites
            },
            "time": time.time() - start
        }
    except Exception as e:
        print(f"‚ùå EV-H08 Failed: {e}")
        results["EV-H08"] = {"status": "error", "error": str(e), "time": time.time() - start}

eval_H08()

‚úÖ EV-H08: Joint distribution (gender √ó primary_site) for TCGA-COAD
  Total cases analyzed: 461

üìä CROSS-TABULATION TABLE:
Primary Site \ Gender        female      malenot reported     Total
-------------------------------------------------------------------
Rectosigmoid junction           216       243         2       461
-------------------------------------------------------------------
Total                           216       243         2       461

üìà PERCENTAGE BREAKDOWN:
  male √ó Rectosigmoid junction: 243 cases (52.71%)
  female √ó Rectosigmoid junction: 216 cases (46.85%)
  not reported √ó Rectosigmoid junction: 2 cases (0.43%)


In [11]:
# EV-H09: In the GDC database, find and list all cases that meet BOTH of these criteria: (1) have a documented family history where a relative's primary diagnosis was "Breast Cancer" (family_histories.relationship_primary_diagnosis = "Breast Cancer"), AND (2) have at least one associated file that was generated using the RNA-Seq experimental strategy (files.experimental_strategy = "RNA-Seq"). Return the case identifiers, total count, and breakdown by project.
def eval_H09():
    start = time.time()
    try:
        filters = {
            "op": "and",
            "content": [
                {"op": "=", "content": {"field": "family_histories.relationship_primary_diagnosis", "value": "Breast Cancer"}},
                {"op": "=", "content": {"field": "files.experimental_strategy", "value": "RNA-Seq"}}
            ]
        }
        
        result = rest_query("cases", {
            "filters": json.dumps(filters),
            "size": "1000",
            "fields": "submitter_id,case_id,project.project_id,files.experimental_strategy"
        })
        
        cases_with_history_and_rnaseq = []
        project_counts = {}
        
        for case in result["data"]["hits"]:
            # Verify RNA-Seq files exist
            has_rnaseq = False
            files = case.get("files", [])
            for file_info in files:
                if file_info.get("experimental_strategy") == "RNA-Seq":
                    has_rnaseq = True
                    break
            
            if has_rnaseq:
                case_info = {
                    "case_id": case["case_id"],
                    "submitter_id": case["submitter_id"],
                    "project": case.get("project", {}).get("project_id", "Unknown")
                }
                cases_with_history_and_rnaseq.append(case_info)
                
                # Count by project
                project = case_info["project"]
                project_counts[project] = project_counts.get(project, 0) + 1
        
        count = len(cases_with_history_and_rnaseq)
        total_count = result["data"]["pagination"]["total"]
        
        print(f"‚úÖ EV-H09: Found {count} cases with family history of breast cancer AND RNA-Seq")
        print(f"  Total matching cases: {total_count}")
        
        if project_counts:
            sorted_projects = sorted(project_counts.items(), key=lambda x: x[1], reverse=True)
            print(f"  Cases by project:")
            for project, proj_count in sorted_projects:
                print(f"    {project}: {proj_count} cases")
        
        results["EV-H09"] = {
            "status": "success",
            "result": f"{count} cases with family history of breast cancer AND RNA-Seq",
            "data": {"count": count, "total_count": total_count, "cases": cases_with_history_and_rnaseq, "project_counts": project_counts},
            "time": time.time() - start,
        }
    except Exception as e:
        print(f"‚ùå EV-H09 Failed: {e}")
        results["EV-H09"] = {"status": "error", "error": str(e), "time": time.time() - start}

eval_H09()

‚úÖ EV-H09: Found 144 cases with family history of breast cancer AND RNA-Seq
  Total matching cases: 144
  Cases by project:
    MMRF-COMMPASS: 71 cases
    TCGA-BLCA: 30 cases
    TCGA-TGCT: 15 cases
    TCGA-PAAD: 10 cases
    TCGA-MESO: 7 cases
    TCGA-CHOL: 6 cases
    HCMI-CMDC: 4 cases
    TCGA-UVM: 1 cases


In [22]:
# EV-H10: In the GDC database, analyze all 608 cases in the TCGA-OV project (Ovarian Serous Cystadenocarcinoma). For each case, count how many diagnosis records it has in its diagnoses array. Then categorize ALL cases into exactly three groups: (1) cases with ZERO diagnoses (no diagnosis records at all), (2) cases with exactly ONE diagnosis, and (3) cases with MULTIPLE diagnoses (2 or more). Return the count for each category, ensuring that all three counts add up to the total of 608 cases. Also provide the case identifiers and diagnosis counts for cases with multiple diagnoses.
def eval_H10():
    start = time.time()
    try:
        filters = {
            "op": "=",
            "content": {
                "field": "project.project_id",
                "value": "TCGA-OV"
            }
        }
        
        # Fetch all cases (handle pagination if needed)
        all_cases = []
        from_ = 0
        size = 2000
        
        while True:
            result = rest_query("cases", {
                "filters": json.dumps(filters),
                "size": str(size),
                "from": str(from_),
                "fields": "submitter_id,case_id,diagnoses.diagnosis_id"
            })
            
            hits = result["data"]["hits"]
            if not hits:
                break
            
            all_cases.extend(hits)
            from_ += size
            
            # Check if we got all cases
            if len(all_cases) >= result["data"]["pagination"]["total"]:
                break
        
        # Count diagnosis categories
        cases_with_multiple_diagnoses = []
        zero_diagnosis_count = 0
        single_diagnosis_count = 0
        multiple_diagnosis_count = 0
        
        for case in all_cases:
            diagnoses = case.get("diagnoses", [])
            diagnosis_count = len(diagnoses)
            
            if diagnosis_count == 0:
                zero_diagnosis_count += 1
            elif diagnosis_count == 1:
                single_diagnosis_count += 1
            else:  # diagnosis_count > 1
                cases_with_multiple_diagnoses.append({
                    "case_id": case["case_id"],
                    "submitter_id": case["submitter_id"],
                    "diagnosis_count": diagnosis_count
                })
                multiple_diagnosis_count += 1
        
        total_cases_processed = len(all_cases)
        total_cases_in_db = result["data"]["pagination"]["total"]
        
        # Verify counts
        count_sum = zero_diagnosis_count + single_diagnosis_count + multiple_diagnosis_count
        assert count_sum == total_cases_processed, f"Count mismatch: {count_sum} != {total_cases_processed}"
        
        print(f"‚úÖ EV-H10: TCGA-OV diagnoses analysis:")
        print(f"  Total cases in database: {total_cases_in_db}")
        print(f"  Total cases processed: {total_cases_processed}")
        print(f"  Cases with ZERO diagnoses: {zero_diagnosis_count}")
        print(f"  Cases with single diagnosis: {single_diagnosis_count}")
        print(f"  Cases with multiple diagnoses: {multiple_diagnosis_count}")
        print(f"  Verification: {count_sum} = {total_cases_processed} ‚úì")
        
        if cases_with_multiple_diagnoses:
            print(f"  Sample cases with multiple diagnoses:")
            for case in cases_with_multiple_diagnoses[:5]:
                print(f"    {case['submitter_id']}: {case['diagnosis_count']} diagnoses")
        
        results["EV-H10"] = {
            "status": "success",
            "result": f"{multiple_diagnosis_count} TCGA-OV cases with multiple diagnoses",
            "data": {
                "total_cases": total_cases_processed,
                "zero_diagnoses": zero_diagnosis_count,
                "single_diagnosis": single_diagnosis_count,
                "multiple_diagnoses": multiple_diagnosis_count,
                "multiple_diagnosis_cases": cases_with_multiple_diagnoses
            },
            "time": time.time() - start,
        }
    except Exception as e:
        print(f"‚ùå EV-H10 Failed: {e}")
        results["EV-H10"] = {"status": "error", "error": str(e), "time": time.time() - start}

eval_H10()

‚úÖ EV-H10: TCGA-OV diagnoses analysis:
  Total cases in database: 608
  Total cases processed: 608
  Cases with ZERO diagnoses: 21
  Cases with single diagnosis: 216
  Cases with multiple diagnoses: 371
  Verification: 608 = 608 ‚úì
  Sample cases with multiple diagnoses:
    TCGA-10-0927: 2 diagnoses
    TCGA-42-2582: 7 diagnoses
    TCGA-24-2029: 2 diagnoses
    TCGA-10-0933: 2 diagnoses
    TCGA-29-1705: 3 diagnoses
