# COSMIC Cancer Annotation Example

This notebook demonstrates the `knownCancer` functionality with both MAF and VCF data files.

## Overview
- Load MAF data using `read_maf`
- Load VCF data using `read_vcf`
- Apply COSMIC cancer annotation to both instances
- Display annotated columns and results


In [8]:
# Setup and imports
import sys
import logging
from pathlib import Path

# Add src to path
sys.path.insert(0, str(Path.cwd() / "src"))

# Configure logging to show only important messages
logging.basicConfig(level=logging.WARNING)
logger = logging.getLogger('pyMut')
logger.setLevel(logging.INFO)

# Import pyMut functions
from pyMut.input import read_maf, read_vcf

# Define the specific columns we want to display from knownCancer annotation
knowncancer_columns = [
    "COSMIC_ROLE_IN_CANCER",
    "COSMIC_TIER",
    "OncoKB_Is Oncogene",
    "OncoKB_Is Tumor Suppressor Gene",
    "OncoKB_OncoKB Annotated",
    "OncoKB_MSK-IMPACT",
    "OncoKB_MSK-HEME",
    "OncoKB_FOUNDATION ONE",
    "OncoKB_FOUNDATION ONE HEME",
    "OncoKB_Vogelstein",
    "Is_Oncogene_any"
]

print("✓ Setup complete")


✓ Setup complete


## Data File Paths

Define the paths to our example data files and COSMIC annotation table.


In [9]:
# Data file paths
MAF_FILE = "../../../src/pyMut/data/examples/MAF/tcga_laml_VEP_annotated.maf.gz"
VCF_FILE = "../../../src/pyMut/data/examples/VCF/subset_50k_variants_vep_protein_gene_variant_class.vcf.gz"
COSMIC_ANNOTATION = "../../../src/pyMut/data/resources/COSMIC/Cosmic_CancerGeneCensus_Tsv_v102_GRCh38/Cosmic_CancerGeneCensus_v102_GRCh38.tsv.gz"
ONCOKB_ANNOTATION = "../../../src/pyMut/data/resources/OncoKb/cancerGeneList.tsv"

# Verify files exist
for file_path, name in [(MAF_FILE, "MAF"), (VCF_FILE, "VCF"), (COSMIC_ANNOTATION, "COSMIC"), (ONCOKB_ANNOTATION, "OncoKB")]:
    if Path(file_path).exists():
        print(f"✓ {name} file found: {Path(file_path).name}")
    else:
        print(f"✗ {name} file not found: {file_path}")


✓ MAF file found: tcga_laml_VEP_annotated.maf.gz
✓ VCF file found: subset_50k_variants_vep_protein_gene_variant_class.vcf.gz
✓ COSMIC file found: Cosmic_CancerGeneCensus_v102_GRCh38.tsv.gz
✓ OncoKB file found: cancerGeneList.tsv


## 1. Load MAF Data

Load the TCGA LAML MAF file using `read_maf`.


In [10]:
# Load MAF data
print("Loading MAF data...")
py_mut_maf = read_maf(MAF_FILE,assembly="37")

print("\n📊 MAF Data Summary:")
print(f"   Shape: {py_mut_maf.data.shape}")
print(f"   Source: {py_mut_maf.metadata.source_format}")
print(f"   Unique genes: {py_mut_maf.data['Hugo_Symbol'].nunique()}")
print(f"   Unique samples: {py_mut_maf.data['Tumor_Sample_Barcode'].nunique()}")

# Show first few rows
print("\n📋 First 3 rows:")
display(py_mut_maf.data[['Hugo_Symbol', 'Variant_Classification', 'Tumor_Sample_Barcode']].head(3))


INFO:pyMut.input:Starting MAF reading: ../../../src/pyMut/data/examples/MAF/tcga_laml_VEP_annotated.maf.gz
INFO:pyMut.input:Reading MAF with 'pyarrow' engine…
INFO:pyMut.input:Detected 193 unique samples.
INFO:pyMut.input:Saving to cache: ../../../src/pyMut/data/examples/MAF/.pymut_cache/tcga_laml_VEP_annotated.maf_f372f4345eeea066.parquet
A suitable version of pyarrow or fastparquet is required for parquet support.
Trying to import the above resulted in these errors:
 - Missing optional dependency 'pyarrow'. pyarrow is required for parquet support. Use pip or conda to install pyarrow.
 - Missing optional dependency 'fastparquet'. fastparquet is required for parquet support. Use pip or conda to install fastparquet.
INFO:pyMut.input:MAF processed successfully: 2207 rows, 237 columns in 0.11 seconds


Loading MAF data...

📊 MAF Data Summary:
   Shape: (2207, 237)
   Source: MAF
   Unique genes: 1611
   Unique samples: 193

📋 First 3 rows:


Unnamed: 0,Hugo_Symbol,Variant_Classification,Tumor_Sample_Barcode
0,ABCA10,SPLICE_SITE,TCGA-AB-2988
1,ABCA4,MISSENSE_MUTATION,TCGA-AB-2869
2,ABCB11,MISSENSE_MUTATION,TCGA-AB-3009


## 2. Load VCF Data

Load the 1000 Genomes VCF file using `read_vcf`.


In [None]:
# Load VCF data
print("Loading VCF data...")
py_mut_vcf = read_vcf(VCF_FILE,assembly="38")

print("\n📊 VCF Data Summary:")
print(f"   Shape: {py_mut_vcf.data.shape}")
print(f"   Source: {py_mut_vcf.metadata.source_format}")
print(f"   Unique genes: {py_mut_vcf.data['Hugo_Symbol'].nunique()}")
print(f"   Chromosome: {py_mut_vcf.data['CHROM'].unique()[0]}")

# Show first few rows
print("\n📋 First 3 rows:")
display(py_mut_vcf.data[['Hugo_Symbol', 'CHROM', 'POS', 'REF', 'ALT', 'Variant_Classification']].head(3))


INFO:pyMut.input:Starting optimized VCF reading: ../../../src/pyMut/data/examples/VCF/subset_50k_variants_vep_protein_gene_variant_class.vcf.gz


Loading VCF data...


INFO:pyMut.input:Reading VCF with pandas + pyarrow optimization...
INFO:pyMut.input:Starting vectorized genotype conversion before INFO expansion...


## 3. Apply COSMIC Cancer Annotation to MAF Data

Apply the `knownCancer` method to the MAF data.


In [5]:
# Apply COSMIC annotation to MAF data
print("🔬 Applying COSMIC cancer annotation to MAF data...")

# Apply annotation (in_place=False to get returned DataFrame)
maf_annotated = py_mut_maf.knownCancer(
    annotation_table=COSMIC_ANNOTATION,
    oncokb_table=ONCOKB_ANNOTATION,
    in_place=False
)

print("\n✅ MAF Annotation Complete!")
print(f"   Original shape: {py_mut_maf.data.shape}")
print(f"   Annotated shape: {maf_annotated.shape}")

# Show new annotation columns
original_cols = set(py_mut_maf.data.columns)
new_cols = [col for col in maf_annotated.columns if col not in original_cols]
print(f"\n🏷️  New annotation columns ({len(new_cols)}):")
for col in new_cols:
    print(f"   • {col}")

# Show annotation results for genes with annotations
annotated_genes = maf_annotated[maf_annotated['Is_Oncogene_any'] == True]
if len(annotated_genes) > 0:
    print(f"\n🎯 Genes with cancer annotations ({len(annotated_genes)} variants):")
    # Use specific knowncancer_columns that are available in the data
    available_cols = ['Hugo_Symbol'] + [col for col in knowncancer_columns if col in maf_annotated.columns]
    display(annotated_genes[available_cols].drop_duplicates('Hugo_Symbol').head(10))
else:
    print("\n⚠️  No genes found with cancer annotations in this dataset")


INFO:pyMut.annotate.cosmic_cancer_annotate:DataFrame memory usage: 0.03 GB
INFO:pyMut.annotate.cosmic_cancer_annotate:Using pandas backend for annotation
INFO:pyMut.annotate.cosmic_cancer_annotate:Starting pandas annotation for DataFrame: 2207 rows, 237 columns
INFO:pyMut.annotate.cosmic_cancer_annotate:Using join column: Hugo_Symbol
INFO:pyMut.annotate.cosmic_cancer_annotate:Reading annotation table: ../../../src/pyMut/data/resources/COSMIC/Cosmic_CancerGeneCensus_Tsv_v102_GRCh38/Cosmic_CancerGeneCensus_v102_GRCh38.tsv.gz
INFO:pyMut.annotate.cosmic_cancer_annotate:Annotation table loaded: 758 rows, 21 columns
INFO:pyMut.annotate.cosmic_cancer_annotate:Creating synonyms dictionary from column 'SYNONYMS'...
INFO:pyMut.annotate.cosmic_cancer_annotate:Created synonyms dictionary with 4710 mappings
INFO:pyMut.annotate.cosmic_cancer_annotate:Applying synonyms mapping to PyMutation data...
INFO:pyMut.annotate.cosmic_cancer_annotate:Gene mapping results: 2202 direct matches, 5 synonym matches

🔬 Applying COSMIC cancer annotation to MAF data...


INFO:pyMut.annotate.cosmic_cancer_annotate:COSMIC annotation completed: 2207 rows, 257 columns
INFO:pyMut.annotate.cosmic_cancer_annotate:Added 20 COSMIC annotation columns
INFO:pyMut.annotate.cosmic_cancer_annotate:Reading OncoKB table: ../../../src/pyMut/data/resources/OncoKb/cancerGeneList.tsv
INFO:pyMut.annotate.cosmic_cancer_annotate:OncoKB table loaded: 1195 rows, 17 columns
INFO:pyMut.annotate.cosmic_cancer_annotate:Creating OncoKB synonyms dictionary from column 'Gene Aliases'...
INFO:pyMut.annotate.cosmic_cancer_annotate:Created OncoKB synonyms dictionary with 3291 mappings
INFO:pyMut.annotate.cosmic_cancer_annotate:Applying synonyms mapping to PyMutation data...
INFO:pyMut.annotate.cosmic_cancer_annotate:Gene mapping results: 2190 direct matches, 17 synonym matches
INFO:pyMut.annotate.cosmic_cancer_annotate:Performing OncoKB annotation merge...
INFO:pyMut.annotate.cosmic_cancer_annotate:OncoKB annotation completed: 2207 rows, 273 columns
INFO:pyMut.annotate.cosmic_cancer_anno


✅ MAF Annotation Complete!
   Original shape: (2207, 237)
   Annotated shape: (2207, 248)

🏷️  New annotation columns (11):
   • COSMIC_ROLE_IN_CANCER
   • COSMIC_TIER
   • OncoKB_Is Oncogene
   • OncoKB_Is Tumor Suppressor Gene
   • OncoKB_OncoKB Annotated
   • OncoKB_MSK-IMPACT
   • OncoKB_MSK-HEME
   • OncoKB_FOUNDATION ONE
   • OncoKB_FOUNDATION ONE HEME
   • OncoKB_Vogelstein
   • Is_Oncogene_any

🎯 Genes with cancer annotations (536 variants):


Unnamed: 0,Hugo_Symbol,COSMIC_ROLE_IN_CANCER,COSMIC_TIER,OncoKB_Is Oncogene,OncoKB_Is Tumor Suppressor Gene,OncoKB_OncoKB Annotated,OncoKB_MSK-IMPACT,OncoKB_MSK-HEME,OncoKB_FOUNDATION ONE,OncoKB_FOUNDATION ONE HEME,OncoKB_Vogelstein,Is_Oncogene_any
8,ABL1,"oncogene, fusion",1.0,Yes,No,Yes,Yes,Yes,Yes,Yes,Yes,True
40,AFF4,"oncogene, fusion",1.0,Yes,No,Yes,No,No,No,Yes,No,True
57,ALOX5,,,Yes,Yes,Yes,No,No,No,No,No,True
69,ANKRD26,,,Yes,No,Yes,No,No,No,No,No,True
88,ARHGAP5,oncogene,2.0,,,,,,,,,True
90,ARHGEF10L,TSG,2.0,,,,,,,,,True
96,ARID1A,"TSG, fusion",1.0,No,Yes,Yes,Yes,Yes,Yes,Yes,Yes,True
97,ARID2,TSG,1.0,No,Yes,Yes,Yes,Yes,No,Yes,Yes,True
107,ASXL1,TSG,1.0,No,Yes,Yes,Yes,Yes,Yes,Yes,Yes,True
112,ASXL2,TSG,2.0,No,Yes,Yes,Yes,Yes,No,No,No,True


## 4. Apply COSMIC Cancer Annotation to VCF Data

Apply the `knownCancer` method to the VCF data.


In [6]:
# Apply COSMIC annotation to VCF data
print("🔬 Applying COSMIC cancer annotation to VCF data...")

# Apply annotation (in_place=False to get returned DataFrame)
vcf_annotated = py_mut_vcf.knownCancer(
    annotation_table=COSMIC_ANNOTATION,
    oncokb_table=ONCOKB_ANNOTATION,
    in_place=False
)

print("\n✅ VCF Annotation Complete!")
print(f"   Original shape: {py_mut_vcf.data.shape}")
print(f"   Annotated shape: {vcf_annotated.shape}")

# Show new annotation columns
original_cols = set(py_mut_vcf.data.columns)
new_cols = [col for col in vcf_annotated.columns if col not in original_cols]
print(f"\n🏷️  New annotation columns ({len(new_cols)}):")
for col in new_cols:
    print(f"   • {col}")

# Show annotation results for genes with annotations
annotated_genes = vcf_annotated[vcf_annotated['Is_Oncogene_any'] == True]
if len(annotated_genes) > 0:
    print(f"\n🎯 Genes with cancer annotations ({len(annotated_genes)} variants):")
    # Use specific knowncancer_columns that are available in the data, plus VCF-specific columns
    available_annotation_cols = [col for col in knowncancer_columns if col in vcf_annotated.columns]
    annotation_cols = ['Hugo_Symbol', 'CHROM', 'POS'] + available_annotation_cols
    display(annotated_genes[annotation_cols].drop_duplicates('Hugo_Symbol').head(10))
else:
    print("\n⚠️  No genes found with cancer annotations in this dataset")


🔬 Applying COSMIC cancer annotation to VCF data...


INFO:pyMut.annotate.cosmic_cancer_annotate:DataFrame memory usage: 7.29 GB
INFO:pyMut.annotate.cosmic_cancer_annotate:Using pandas backend for annotation
INFO:pyMut.annotate.cosmic_cancer_annotate:Starting pandas annotation for DataFrame: 50000 rows, 2602 columns
INFO:pyMut.annotate.cosmic_cancer_annotate:Using join column: Hugo_Symbol
INFO:pyMut.annotate.cosmic_cancer_annotate:Reading annotation table: ../../../src/pyMut/data/resources/COSMIC/Cosmic_CancerGeneCensus_Tsv_v102_GRCh38/Cosmic_CancerGeneCensus_v102_GRCh38.tsv.gz
INFO:pyMut.annotate.cosmic_cancer_annotate:Annotation table loaded: 758 rows, 21 columns
INFO:pyMut.annotate.cosmic_cancer_annotate:Creating synonyms dictionary from column 'SYNONYMS'...
INFO:pyMut.annotate.cosmic_cancer_annotate:Created synonyms dictionary with 4710 mappings
INFO:pyMut.annotate.cosmic_cancer_annotate:Applying synonyms mapping to PyMutation data...
INFO:pyMut.annotate.cosmic_cancer_annotate:Gene mapping results: 50000 direct matches, 0 synonym matc


✅ VCF Annotation Complete!
   Original shape: (50000, 2602)
   Annotated shape: (50000, 2613)

🏷️  New annotation columns (11):
   • COSMIC_ROLE_IN_CANCER
   • COSMIC_TIER
   • OncoKB_Is Oncogene
   • OncoKB_Is Tumor Suppressor Gene
   • OncoKB_OncoKB Annotated
   • OncoKB_MSK-IMPACT
   • OncoKB_MSK-HEME
   • OncoKB_FOUNDATION ONE
   • OncoKB_FOUNDATION ONE HEME
   • OncoKB_Vogelstein
   • Is_Oncogene_any

🎯 Genes with cancer annotations (17072 variants):


Unnamed: 0,Hugo_Symbol,CHROM,POS,COSMIC_ROLE_IN_CANCER,COSMIC_TIER,OncoKB_Is Oncogene,OncoKB_Is Tumor Suppressor Gene,OncoKB_OncoKB Annotated,OncoKB_MSK-IMPACT,OncoKB_MSK-HEME,OncoKB_FOUNDATION ONE,OncoKB_FOUNDATION ONE HEME,OncoKB_Vogelstein,Is_Oncogene_any
25959,LARP4B,chr10,753787,TSG,2.0,No,No,Yes,No,No,No,No,No,True
38840,ADARB2,chr10,1151506,,,Yes,No,Yes,No,No,No,No,No,True


## 5. Summary and Comparison

Compare the annotation results between MAF and VCF data.


In [7]:
# Summary comparison
print("📊 COSMIC Cancer Annotation Summary")
print("=" * 50)

# MAF results
maf_oncogenes = maf_annotated[maf_annotated['Is_Oncogene_any'] == True]['Hugo_Symbol'].nunique()
maf_total_genes = maf_annotated['Hugo_Symbol'].nunique()
maf_cosmic_role = maf_annotated['COSMIC_ROLE_IN_CANCER'].value_counts().to_dict() if 'COSMIC_ROLE_IN_CANCER' in maf_annotated.columns else {}

print("\n🧬 MAF Data Results:")
print(f"   Total unique genes: {maf_total_genes}")
print(f"   Genes with cancer annotations: {maf_oncogenes}")
print(f"   Annotation rate: {maf_oncogenes/maf_total_genes*100:.1f}%")
if maf_cosmic_role:
    print(f"   COSMIC roles found: {list(maf_cosmic_role.keys())}")

# VCF results
vcf_oncogenes = vcf_annotated[vcf_annotated['Is_Oncogene_any'] == True]['Hugo_Symbol'].nunique()
vcf_total_genes = vcf_annotated['Hugo_Symbol'].nunique()
vcf_cosmic_role = vcf_annotated['COSMIC_ROLE_IN_CANCER'].value_counts().to_dict() if 'COSMIC_ROLE_IN_CANCER' in vcf_annotated.columns else {}

print("\n🧬 VCF Data Results:")
print(f"   Total unique genes: {vcf_total_genes}")
print(f"   Genes with cancer annotations: {vcf_oncogenes}")
print(f"   Annotation rate: {vcf_oncogenes/vcf_total_genes*100:.1f}%")
if vcf_cosmic_role:
    print(f"   COSMIC roles found: {list(vcf_cosmic_role.keys())}")

print("\n✅ Annotation process completed successfully for both datasets!")


📊 COSMIC Cancer Annotation Summary

🧬 MAF Data Results:
   Total unique genes: 1611
   Genes with cancer annotations: 158
   Annotation rate: 9.8%
   COSMIC roles found: ['', 'oncogene', 'TSG', 'oncogene, TSG, fusion', 'oncogene, fusion', 'fusion', 'oncogene, TSG', 'TSG, fusion']

🧬 VCF Data Results:
   Total unique genes: 13
   Genes with cancer annotations: 2
   Annotation rate: 15.4%
   COSMIC roles found: ['', 'TSG']

✅ Annotation process completed successfully for both datasets!


## 6. Detailed Annotation Results

Show detailed annotation information for genes that have COSMIC annotations.


In [8]:
# Show detailed annotation results
print("🔍 Detailed Annotation Results")
print("=" * 40)

# Function to show annotation details
def show_annotation_details(data, dataset_name):
    print(f"\n📋 {dataset_name} - Genes with COSMIC annotations:")
    
    # Get genes with annotations
    annotated = data[data['Is_Oncogene_any'] == True]
    
    if len(annotated) == 0:
        print("   No genes with COSMIC annotations found.")
        return
    
    # Show specific knowncancer annotation columns
    available_annotation_cols = [col for col in knowncancer_columns if col in data.columns]
    
    if available_annotation_cols:
        gene_annotations = annotated[['Hugo_Symbol'] + available_annotation_cols].drop_duplicates('Hugo_Symbol')
        
        print(f"   Found {len(gene_annotations)} unique genes with annotations:")
        print(f"   Available annotation columns: {', '.join(available_annotation_cols)}")
        
        # Show detailed table with all available annotation columns
        if len(gene_annotations) > 0:
            print("\n   📋 Detailed annotation table:")
            display(gene_annotations.head(10))
        
        if len(gene_annotations) > 10:
            print(f"   ... and {len(gene_annotations) - 10} more genes")

# Show details for both datasets
show_annotation_details(maf_annotated, "MAF Dataset")
show_annotation_details(vcf_annotated, "VCF Dataset")


🔍 Detailed Annotation Results

📋 MAF Dataset - Genes with COSMIC annotations:
   Found 158 unique genes with annotations:
   Available annotation columns: COSMIC_ROLE_IN_CANCER, COSMIC_TIER, OncoKB_Is Oncogene, OncoKB_Is Tumor Suppressor Gene, OncoKB_OncoKB Annotated, OncoKB_MSK-IMPACT, OncoKB_MSK-HEME, OncoKB_FOUNDATION ONE, OncoKB_FOUNDATION ONE HEME, OncoKB_Vogelstein, Is_Oncogene_any

   📋 Detailed annotation table:


Unnamed: 0,Hugo_Symbol,COSMIC_ROLE_IN_CANCER,COSMIC_TIER,OncoKB_Is Oncogene,OncoKB_Is Tumor Suppressor Gene,OncoKB_OncoKB Annotated,OncoKB_MSK-IMPACT,OncoKB_MSK-HEME,OncoKB_FOUNDATION ONE,OncoKB_FOUNDATION ONE HEME,OncoKB_Vogelstein,Is_Oncogene_any
8,ABL1,"oncogene, fusion",1.0,Yes,No,Yes,Yes,Yes,Yes,Yes,Yes,True
40,AFF4,"oncogene, fusion",1.0,Yes,No,Yes,No,No,No,Yes,No,True
57,ALOX5,,,Yes,Yes,Yes,No,No,No,No,No,True
69,ANKRD26,,,Yes,No,Yes,No,No,No,No,No,True
88,ARHGAP5,oncogene,2.0,,,,,,,,,True
90,ARHGEF10L,TSG,2.0,,,,,,,,,True
96,ARID1A,"TSG, fusion",1.0,No,Yes,Yes,Yes,Yes,Yes,Yes,Yes,True
97,ARID2,TSG,1.0,No,Yes,Yes,Yes,Yes,No,Yes,Yes,True
107,ASXL1,TSG,1.0,No,Yes,Yes,Yes,Yes,Yes,Yes,Yes,True
112,ASXL2,TSG,2.0,No,Yes,Yes,Yes,Yes,No,No,No,True


   ... and 148 more genes

📋 VCF Dataset - Genes with COSMIC annotations:
   Found 2 unique genes with annotations:
   Available annotation columns: COSMIC_ROLE_IN_CANCER, COSMIC_TIER, OncoKB_Is Oncogene, OncoKB_Is Tumor Suppressor Gene, OncoKB_OncoKB Annotated, OncoKB_MSK-IMPACT, OncoKB_MSK-HEME, OncoKB_FOUNDATION ONE, OncoKB_FOUNDATION ONE HEME, OncoKB_Vogelstein, Is_Oncogene_any

   📋 Detailed annotation table:


Unnamed: 0,Hugo_Symbol,COSMIC_ROLE_IN_CANCER,COSMIC_TIER,OncoKB_Is Oncogene,OncoKB_Is Tumor Suppressor Gene,OncoKB_OncoKB Annotated,OncoKB_MSK-IMPACT,OncoKB_MSK-HEME,OncoKB_FOUNDATION ONE,OncoKB_FOUNDATION ONE HEME,OncoKB_Vogelstein,Is_Oncogene_any
25959,LARP4B,TSG,2.0,No,No,Yes,No,No,No,No,No,True
38840,ADARB2,,,Yes,No,Yes,No,No,No,No,No,True
