# COSMIC Cancer Annotation Example

This notebook demonstrates the `cosmic_cancer_annotate` functionality with both MAF and VCF data files.

## Overview
- Load MAF data using `read_maf`
- Load VCF data using `read_vcf`
- Apply COSMIC cancer annotation to both instances
- Display annotated columns and results


In [1]:
# Setup and imports
import sys
import logging
from pathlib import Path
import pandas as pd

# Add src to path
sys.path.insert(0, str(Path.cwd() / "src"))

# Configure logging to show only important messages
logging.basicConfig(level=logging.WARNING)
logger = logging.getLogger('pyMut')
logger.setLevel(logging.INFO)

# Import pyMut functions
from pyMut.input import read_maf, read_vcf

print("✓ Setup complete")


✓ Setup complete


## Data File Paths

Define the paths to our example data files and COSMIC annotation table.


In [2]:
# Data file paths
MAF_FILE = "../../../src/pyMut/data/examples/MAF/tcga_laml_VEP_annotated.maf.gz"
VCF_FILE = "../../../src/pyMut/data/examples/VCF/subset_1k_variants_ALL.chr10.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased_vep_protein_gene_variant_class.vcf"
COSMIC_ANNOTATION = "../../../src/pyMut/data/resources/COSMIC/Cosmic_CancerGeneCensus_Tsv_v102_GRCh38/Cosmic_CancerGeneCensus_v102_GRCh38.tsv.gz"

# Verify files exist
for file_path, name in [(MAF_FILE, "MAF"), (VCF_FILE, "VCF"), (COSMIC_ANNOTATION, "COSMIC")]:
    if Path(file_path).exists():
        print(f"✓ {name} file found: {Path(file_path).name}")
    else:
        print(f"✗ {name} file not found: {file_path}")


✓ MAF file found: tcga_laml_VEP_annotated.maf.gz
✓ VCF file found: subset_1k_variants_ALL.chr10.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased_vep_protein_gene_variant_class.vcf
✓ COSMIC file found: Cosmic_CancerGeneCensus_v102_GRCh38.tsv.gz


## 1. Load MAF Data

Load the TCGA LAML MAF file using `read_maf`.


In [3]:
# Load MAF data
print("Loading MAF data...")
py_mut_maf = read_maf(MAF_FILE)

print(f"\n📊 MAF Data Summary:")
print(f"   Shape: {py_mut_maf.data.shape}")
print(f"   Source: {py_mut_maf.metadata.source_format}")
print(f"   Unique genes: {py_mut_maf.data['Hugo_Symbol'].nunique()}")
print(f"   Unique samples: {py_mut_maf.data['Tumor_Sample_Barcode'].nunique()}")

# Show first few rows
print(f"\n📋 First 3 rows:")
display(py_mut_maf.data[['Hugo_Symbol', 'Variant_Classification', 'Tumor_Sample_Barcode']].head(3))


INFO:pyMut.input:Starting MAF reading: ../../../src/pyMut/data/examples/MAF/tcga_laml_VEP_annotated.maf.gz
INFO:pyMut.input:Reading MAF with 'pyarrow' engine…
INFO:pyMut.input:Reading with 'pyarrow' completed.
INFO:pyMut.input:Detected 193 unique samples.


Loading MAF data...


INFO:pyMut.input:Saving to cache: ../../../src/pyMut/data/examples/MAF/.pymut_cache/tcga_laml_VEP_annotated.maf_53cd8e2f7bdc9e4b.parquet
INFO:pyMut.input:MAF processed successfully: 2207 rows, 237 columns in 0.20 seconds



📊 MAF Data Summary:
   Shape: (2207, 237)
   Source: MAF
   Unique genes: 1611
   Unique samples: 193

📋 First 3 rows:


Unnamed: 0,Hugo_Symbol,Variant_Classification,Tumor_Sample_Barcode
0,ABCA10,SPLICE_SITE,TCGA-AB-2988
1,ABCA4,MISSENSE_MUTATION,TCGA-AB-2869
2,ABCB11,MISSENSE_MUTATION,TCGA-AB-3009


## 2. Load VCF Data

Load the 1000 Genomes VCF file using `read_vcf`.


In [4]:
# Load VCF data
print("Loading VCF data...")
py_mut_vcf = read_vcf(VCF_FILE)

print(f"\n📊 VCF Data Summary:")
print(f"   Shape: {py_mut_vcf.data.shape}")
print(f"   Source: {py_mut_vcf.metadata.source_format}")
print(f"   Unique genes: {py_mut_vcf.data['Hugo_Symbol'].nunique()}")
print(f"   Chromosome: {py_mut_vcf.data['CHROM'].unique()[0]}")

# Show first few rows
print(f"\n📋 First 3 rows:")
display(py_mut_vcf.data[['Hugo_Symbol', 'CHROM', 'POS', 'REF', 'ALT', 'Variant_Classification']].head(3))


INFO:pyMut.input:Starting optimized VCF reading: ../../../src/pyMut/data/examples/VCF/subset_1k_variants_ALL.chr10.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased_vep_protein_gene_variant_class.vcf
INFO:pyMut.input:Reading VCF with pandas + pyarrow optimization...


Loading VCF data...


INFO:pyMut.input:Pandas reading completed.
INFO:pyMut.input:Expanding INFO column with vectorized operations...
INFO:pyMut.input:Expanding VEP CSQ annotations into individual columns...
INFO:pyMut.input:CSQ expanded into 31 VEP annotation columns in 0.10 s
INFO:pyMut.input:Generating Hugo_Symbol column from VEP_SYMBOL and VEP_NEAREST...
INFO:pyMut.input:Hugo_Symbol column generated in 0.00 s
INFO:pyMut.input:Generating Variant_Classification from VEP_Consequence and VEP_VARIANT_CLASS...
INFO:pyMut.input:Variant_Classification generated in 0.22 s
INFO:pyMut.input:Generating Variant_Type from VEP_VARIANT_CLASS...
INFO:pyMut.input:Variant_Type generated in 0.21 s
INFO:pyMut.input:Detected 2555 sample columns. Starting vectorized genotype conversion...
INFO:pyMut.input:GT conversion: 3.55 s
INFO:pyMut.input:Saving to cache: ../../../src/pyMut/data/examples/VCF/.pymut_cache/subset_1k_variants_ALL.chr10.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased_vep_protein_gene_variant_class_c


📊 VCF Data Summary:
   Shape: (1000, 2602)
   Source: VCF
   Unique genes: 2
   Chromosome: chr10

📋 First 3 rows:


Unnamed: 0,Hugo_Symbol,CHROM,POS,REF,ALT,Variant_Classification
0,TUBB8,chr10,11501,C,A,INTRON
1,TUBB8,chr10,36097,G,A,INTRON
2,TUBB8,chr10,45900,C,T,3'FLANK


## 3. Apply COSMIC Cancer Annotation to MAF Data

Apply the `cosmic_cancer_annotate` method to the MAF data.


In [5]:
# Apply COSMIC annotation to MAF data
print("🔬 Applying COSMIC cancer annotation to MAF data...")

# Apply annotation (in_place=False to get returned DataFrame)
maf_annotated = py_mut_maf.cosmic_cancer_annotate(
    annotation_table=COSMIC_ANNOTATION,
    in_place=False
)

print(f"\n✅ MAF Annotation Complete!")
print(f"   Original shape: {py_mut_maf.data.shape}")
print(f"   Annotated shape: {maf_annotated.shape}")

# Show new annotation columns
original_cols = set(py_mut_maf.data.columns)
new_cols = [col for col in maf_annotated.columns if col not in original_cols]
print(f"\n🏷️  New annotation columns ({len(new_cols)}):")
for col in new_cols:
    print(f"   • {col}")

# Show annotation results for genes with annotations
annotated_genes = maf_annotated[maf_annotated['Is_Oncogene_any'] == True]
if len(annotated_genes) > 0:
    print(f"\n🎯 Genes with cancer annotations ({len(annotated_genes)} variants):")
    annotation_cols = ['Hugo_Symbol'] + [col for col in new_cols if col in maf_annotated.columns]
    display(annotated_genes[annotation_cols].drop_duplicates('Hugo_Symbol').head(10))
else:
    print("\n⚠️  No genes found with cancer annotations in this dataset")


INFO:pyMut.annotate.cosmic_cancer_annotate:DataFrame memory usage: 0.03 GB
INFO:pyMut.annotate.cosmic_cancer_annotate:Using pandas backend for annotation
INFO:pyMut.annotate.cosmic_cancer_annotate:Starting pandas annotation for DataFrame: 2207 rows, 237 columns
INFO:pyMut.annotate.cosmic_cancer_annotate:Using MAF join column: Hugo_Symbol
INFO:pyMut.annotate.cosmic_cancer_annotate:Reading annotation table: ../../../src/pyMut/data/resources/COSMIC/Cosmic_CancerGeneCensus_Tsv_v102_GRCh38/Cosmic_CancerGeneCensus_v102_GRCh38.tsv.gz
INFO:pyMut.annotate.cosmic_cancer_annotate:Annotation table loaded: 758 rows, 21 columns
INFO:pyMut.annotate.cosmic_cancer_annotate:Creating synonyms dictionary from column 'SYNONYMS'...
INFO:pyMut.annotate.cosmic_cancer_annotate:Created synonyms dictionary with 4710 mappings
INFO:pyMut.annotate.cosmic_cancer_annotate:Applying synonyms mapping to MAF data...
INFO:pyMut.annotate.cosmic_cancer_annotate:Gene mapping results: 2202 direct matches, 5 synonym matches
IN

🔬 Applying COSMIC cancer annotation to MAF data...

✅ MAF Annotation Complete!
   Original shape: (2207, 237)
   Annotated shape: (2207, 240)

🏷️  New annotation columns (3):
   • COSMIC_ROLE_IN_CANCER
   • COSMIC_TIER
   • Is_Oncogene_any

🎯 Genes with cancer annotations (513 variants):


Unnamed: 0,Hugo_Symbol,COSMIC_ROLE_IN_CANCER,COSMIC_TIER,Is_Oncogene_any
8,ABL1,"oncogene, fusion",1.0,True
40,AFF4,"oncogene, fusion",1.0,True
88,ARHGAP5,oncogene,2.0,True
90,ARHGEF10L,TSG,2.0,True
96,ARID1A,"TSG, fusion",1.0,True
97,ARID2,TSG,1.0,True
107,ASXL1,TSG,1.0,True
112,ASXL2,TSG,2.0,True
125,ATP2B3,TSG,1.0,True
141,BACH1,TSG,1.0,True


## 4. Apply COSMIC Cancer Annotation to VCF Data

Apply the `cosmic_cancer_annotate` method to the VCF data.


In [6]:
# Apply COSMIC annotation to VCF data
print("🔬 Applying COSMIC cancer annotation to VCF data...")

# Apply annotation (in_place=False to get returned DataFrame)
vcf_annotated = py_mut_vcf.cosmic_cancer_annotate(
    annotation_table=COSMIC_ANNOTATION,
    in_place=False
)

print(f"\n✅ VCF Annotation Complete!")
print(f"   Original shape: {py_mut_vcf.data.shape}")
print(f"   Annotated shape: {vcf_annotated.shape}")

# Show new annotation columns
original_cols = set(py_mut_vcf.data.columns)
new_cols = [col for col in vcf_annotated.columns if col not in original_cols]
print(f"\n🏷️  New annotation columns ({len(new_cols)}):")
for col in new_cols:
    print(f"   • {col}")

# Show annotation results for genes with annotations
annotated_genes = vcf_annotated[vcf_annotated['Is_Oncogene_any'] == True]
if len(annotated_genes) > 0:
    print(f"\n🎯 Genes with cancer annotations ({len(annotated_genes)} variants):")
    annotation_cols = ['Hugo_Symbol', 'CHROM', 'POS'] + [col for col in new_cols if col in vcf_annotated.columns]
    display(annotated_genes[annotation_cols].drop_duplicates('Hugo_Symbol').head(10))
else:
    print("\n⚠️  No genes found with cancer annotations in this dataset")


🔬 Applying COSMIC cancer annotation to VCF data...


INFO:pyMut.annotate.cosmic_cancer_annotate:DataFrame memory usage: 0.15 GB
INFO:pyMut.annotate.cosmic_cancer_annotate:Using pandas backend for annotation
INFO:pyMut.annotate.cosmic_cancer_annotate:Starting pandas annotation for DataFrame: 1000 rows, 2602 columns
INFO:pyMut.annotate.cosmic_cancer_annotate:Using MAF join column: Hugo_Symbol
INFO:pyMut.annotate.cosmic_cancer_annotate:Reading annotation table: ../../../src/pyMut/data/resources/COSMIC/Cosmic_CancerGeneCensus_Tsv_v102_GRCh38/Cosmic_CancerGeneCensus_v102_GRCh38.tsv.gz
INFO:pyMut.annotate.cosmic_cancer_annotate:Annotation table loaded: 758 rows, 21 columns
INFO:pyMut.annotate.cosmic_cancer_annotate:Creating synonyms dictionary from column 'SYNONYMS'...
INFO:pyMut.annotate.cosmic_cancer_annotate:Created synonyms dictionary with 4710 mappings
INFO:pyMut.annotate.cosmic_cancer_annotate:Applying synonyms mapping to MAF data...
INFO:pyMut.annotate.cosmic_cancer_annotate:Gene mapping results: 1000 direct matches, 0 synonym matches
I


✅ VCF Annotation Complete!
   Original shape: (1000, 2602)
   Annotated shape: (1000, 2605)

🏷️  New annotation columns (3):
   • COSMIC_ROLE_IN_CANCER
   • COSMIC_TIER
   • Is_Oncogene_any

⚠️  No genes found with cancer annotations in this dataset


## 5. Summary and Comparison

Compare the annotation results between MAF and VCF data.


In [7]:
# Summary comparison
print("📊 COSMIC Cancer Annotation Summary")
print("=" * 50)

# MAF results
maf_oncogenes = maf_annotated[maf_annotated['Is_Oncogene_any'] == True]['Hugo_Symbol'].nunique()
maf_total_genes = maf_annotated['Hugo_Symbol'].nunique()
maf_cosmic_role = maf_annotated['COSMIC_ROLE_IN_CANCER'].value_counts().to_dict() if 'COSMIC_ROLE_IN_CANCER' in maf_annotated.columns else {}

print(f"\n🧬 MAF Data Results:")
print(f"   Total unique genes: {maf_total_genes}")
print(f"   Genes with cancer annotations: {maf_oncogenes}")
print(f"   Annotation rate: {maf_oncogenes/maf_total_genes*100:.1f}%")
if maf_cosmic_role:
    print(f"   COSMIC roles found: {list(maf_cosmic_role.keys())}")

# VCF results
vcf_oncogenes = vcf_annotated[vcf_annotated['Is_Oncogene_any'] == True]['Hugo_Symbol'].nunique()
vcf_total_genes = vcf_annotated['Hugo_Symbol'].nunique()
vcf_cosmic_role = vcf_annotated['COSMIC_ROLE_IN_CANCER'].value_counts().to_dict() if 'COSMIC_ROLE_IN_CANCER' in vcf_annotated.columns else {}

print(f"\n🧬 VCF Data Results:")
print(f"   Total unique genes: {vcf_total_genes}")
print(f"   Genes with cancer annotations: {vcf_oncogenes}")
print(f"   Annotation rate: {vcf_oncogenes/vcf_total_genes*100:.1f}%")
if vcf_cosmic_role:
    print(f"   COSMIC roles found: {list(vcf_cosmic_role.keys())}")

print(f"\n✅ Annotation process completed successfully for both datasets!")


📊 COSMIC Cancer Annotation Summary

🧬 MAF Data Results:
   Total unique genes: 1611
   Genes with cancer annotations: 136
   Annotation rate: 8.4%
   COSMIC roles found: ['', 'oncogene', 'TSG', 'oncogene, TSG, fusion', 'oncogene, fusion', 'fusion', 'oncogene, TSG', 'TSG, fusion']

🧬 VCF Data Results:
   Total unique genes: 2
   Genes with cancer annotations: 0
   Annotation rate: 0.0%
   COSMIC roles found: ['']

✅ Annotation process completed successfully for both datasets!


## 6. Detailed Annotation Results

Show detailed annotation information for genes that have COSMIC annotations.


In [8]:
# Show detailed annotation results
print("🔍 Detailed Annotation Results")
print("=" * 40)

# Function to show annotation details
def show_annotation_details(data, dataset_name):
    print(f"\n📋 {dataset_name} - Genes with COSMIC annotations:")
    
    # Get genes with annotations
    annotated = data[data['Is_Oncogene_any'] == True]
    
    if len(annotated) == 0:
        print("   No genes with COSMIC annotations found.")
        return
    
    # Show annotation columns
    annotation_cols = [col for col in data.columns if col.startswith('COSMIC_') or col == 'Is_Oncogene_any']
    
    if annotation_cols:
        gene_annotations = annotated[['Hugo_Symbol'] + annotation_cols].drop_duplicates('Hugo_Symbol')
        
        print(f"   Found {len(gene_annotations)} unique genes with annotations:")
        
        # Show each annotated gene
        for _, row in gene_annotations.head(10).iterrows():
            gene = row['Hugo_Symbol']
            role = row.get('COSMIC_ROLE_IN_CANCER', 'N/A')
            tier = row.get('COSMIC_TIER', 'N/A')
            oncogene = row.get('Is_Oncogene_any', False)
            
            print(f"   • {gene}: Role={role}, Tier={tier}, Oncogene={oncogene}")
        
        if len(gene_annotations) > 10:
            print(f"   ... and {len(gene_annotations) - 10} more genes")

# Show details for both datasets
show_annotation_details(maf_annotated, "MAF Dataset")
show_annotation_details(vcf_annotated, "VCF Dataset")


🔍 Detailed Annotation Results

📋 MAF Dataset - Genes with COSMIC annotations:
   Found 136 unique genes with annotations:
   • ABL1: Role=oncogene, fusion, Tier=1.0, Oncogene=True
   • AFF4: Role=oncogene, fusion, Tier=1.0, Oncogene=True
   • ARHGAP5: Role=oncogene, Tier=2.0, Oncogene=True
   • ARHGEF10L: Role=TSG, Tier=2.0, Oncogene=True
   • ARID1A: Role=TSG, fusion, Tier=1.0, Oncogene=True
   • ARID2: Role=TSG, Tier=1.0, Oncogene=True
   • ASXL1: Role=TSG, Tier=1.0, Oncogene=True
   • ASXL2: Role=TSG, Tier=2.0, Oncogene=True
   • ATP2B3: Role=TSG, Tier=1.0, Oncogene=True
   • BACH1: Role=TSG, Tier=1.0, Oncogene=True
   ... and 126 more genes

📋 VCF Dataset - Genes with COSMIC annotations:
   No genes with COSMIC annotations found.
