# VCF Statistics Analysis - P2374372 Dataset

This notebook analyzes VCF statistics from the P2374372 project.
The analysis code has been refactored into modules for better organization and reusability.
This notebook is adapted for the new data structure with DT (DNA Tumor), DN (DNA Normal), and RT (RNA Tumor) naming conventions.

In [1]:
# Import VCF statistics modules
import sys
from pathlib import Path
import pandas as pd

# Add the vcf_stats directory to the path
vcf_stats_path = Path.cwd() / "vcf_stats"
if str(vcf_stats_path) not in sys.path:
    sys.path.insert(0, str(vcf_stats_path))

# Force complete module reload
for module_name in list(sys.modules.keys()):
    if module_name.startswith("vcf_stats"):
        del sys.modules[module_name]

# Now import all required modules (includes new constants)
from vcf_stats import (
    VCFFileDiscovery,
    VCFStatisticsExtractor,
    VCFVisualizer,
    BAMValidator,
    process_all_vcfs,
    analyze_rescue_vcf,
    export_rescue_analysis,
    StatisticsAggregator,
    TOOLS,
    MODALITIES,
    CATEGORY_ORDER,
    VCF_STAGE_ORDER,
    CATEGORY_COLORS,
)

print("✓ VCF statistics modules imported successfully (refactored version)")
print(f"  - Category order: {CATEGORY_ORDER}")
print(f"  - VCF processing stages: {VCF_STAGE_ORDER}")

✓ Variant classification functions defined
✓ VCF Statistics Extractor (Notebook Version) loaded successfully
✓ Enhanced Statistics Aggregator imported successfully
✓ VCF statistics core module initialized
✓ VCF statistics modules imported successfully (refactored version)
  - Category order: ['Somatic', 'Germline', 'Reference', 'Artifact', 'RNA_Edit', 'NoConsensus']
  - VCF processing stages: ['rescue', 'cosmic_gnomad', 'rna_editing', 'filtered_rescue']


## Setup and Configuration

Define paths and parameters for the analysis.
Sample naming convention: DT (DNA Tumor), DN (DNA Normal), RT (RNA Tumor)

In [2]:
# Configuration for P2374372 dataset
# Updated pipeline base directory
BASE_DIR = Path("/t9k/mnt/WorkSpace/data/ngs/xuzhenyu/work/aim_11/output/P2374372")
OUTPUT_DIR = Path("P2374372_statistics_output")

# Create output directory
OUTPUT_DIR.mkdir(exist_ok=True)

# Sample naming mapping (DT, DN, RT to modality names)
SAMPLE_SUFFIX_MAP = {"DT": "DNA_TUMOR", "DN": "DNA_NORMAL", "RT": "RNA_TUMOR"}

print(f"Base directory: {BASE_DIR}")
print(f"Output directory: {OUTPUT_DIR}")
print(f"Available tools: {TOOLS}")
print(f"Available modalities: {MODALITIES}")
print(f"\nSample naming convention:")
for suffix, modality in SAMPLE_SUFFIX_MAP.items():
    print(f"  {suffix} -> {modality}")

print(f"\n✓ VCF processing stages (in order):")
for i, stage in enumerate(VCF_STAGE_ORDER, 1):
    print(f"  {i}. {stage}")

Base directory: /t9k/mnt/WorkSpace/data/ngs/xuzhenyu/work/aim_11/output/P2374372
Output directory: P2374372_statistics_output
Available tools: ['strelka', 'deepsomatic', 'mutect2']
Available modalities: ['DNA_TUMOR_vs_DNA_NORMAL', 'RNA_TUMOR_vs_DNA_NORMAL']

Sample naming convention:
  DT -> DNA_TUMOR
  DN -> DNA_NORMAL
  RT -> RNA_TUMOR

✓ VCF processing stages (in order):
  1. rescue
  2. cosmic_gnomad
  3. rna_editing
  4. filtered_rescue


## Helper Functions for Sample Naming

Functions to map DT/DN/RT naming to standard modality names for compatibility with existing code.

**Note:** If you've updated the vcf_stats modules, run cell 2 again to reload all modules before proceeding with discovery.

In [3]:
def extract_sample_suffix(sample_name):
    """Extract the sample suffix (DT, DN, or RT) from a sample name."""
    for suffix in ["DT", "DN", "RT"]:
        if sample_name.endswith(suffix):
            return suffix
    return None


def map_suffix_to_modality(suffix):
    """Map sample suffix to modality name."""
    return SAMPLE_SUFFIX_MAP.get(suffix, suffix)


def normalize_bam_key(sample_name):
    """Normalize BAM file sample names to modality names."""
    suffix = extract_sample_suffix(sample_name)
    if suffix:
        return map_suffix_to_modality(suffix)
    return sample_name


def normalize_bam_files(bam_dict):
    """Normalize all BAM file keys to modality names."""
    normalized = {}
    for key, path in bam_dict.items():
        modality = normalize_bam_key(key)
        normalized[modality] = path
    return normalized


print("✓ Helper functions for sample naming defined")

✓ Helper functions for sample naming defined


## VCF File Discovery

Discover all VCF files in the pipeline output directory.

In [4]:
# Discover VCF files (refactored to skip variant_calling, focus on normalized and annotation stages)
print("\\n" + "=" * 80)
print("DISCOVERING VCF FILES (REFACTORED DISCOVERY)")
print("=" * 80)

discovery = VCFFileDiscovery(BASE_DIR)
vcf_files = discovery.discover_vcfs()
bam_files = discovery.discover_alignments()

# Normalize BAM file keys to modality names
bam_files = normalize_bam_files(bam_files)
print("\\n✓ BAM files normalized to modality names")

# Print discovery summary
discovery.print_summary()

print(f"\\n✓ Discovered VCF categories:")
for category, files in vcf_files.items():
    if files:
        print(f"  - {category}: {len(files)} file(s)")

print(f"\\n✓ Available annotation stages:")
annotation_stages = discovery.get_all_annotation_stages()
for stage, files in annotation_stages.items():
    print(f"  - {stage}: {len(files)} file(s)")

print(f"\\n✓ Discovered {len(bam_files)} BAM/CRAM files with normalized names")
print(f"  Available modalities: {list(bam_files.keys())}")

DISCOVERING VCF FILES (REFACTORED DISCOVERY)
\n✓ BAM files normalized to modality names
VCF FILE DISCOVERY SUMMARY

NORMALIZED VCFs (6 files):
  strelka_RNA_TUMOR_vs_DNA_NORMAL: 2374372RT_vs_2374372DN.strelka.variants.dec.norm.vcf.gz
  strelka_DNA_TUMOR_vs_DNA_NORMAL: 2374372DT_vs_2374372DN.strelka.variants.dec.norm.vcf.gz
  deepsomatic_RNA_TUMOR_vs_DNA_NORMAL: 2374372RT_vs_2374372DN.deepsomatic.variants.dec.norm.vcf.gz
  deepsomatic_DNA_TUMOR_vs_DNA_NORMAL: 2374372DT_vs_2374372DN.deepsomatic.variants.dec.norm.vcf.gz
  mutect2_RNA_TUMOR_vs_DNA_NORMAL: 2374372RT_vs_2374372DN.mutect2.variants.dec.norm.vcf.gz
  mutect2_DNA_TUMOR_vs_DNA_NORMAL: 2374372DT_vs_2374372DN.mutect2.variants.dec.norm.vcf.gz

CONSENSUS VCFs (2 files):
  RNA_TUMOR_vs_DNA_NORMAL: 2374372RT_vs_2374372DN.consensus.vcf.gz
  DNA_TUMOR_vs_DNA_NORMAL: 2374372DT_vs_2374372DN.consensus.vcf.gz

RESCUE VCFs (1 files):
  2374372DT_vs_2374372DN_rescued_2374372RT_vs_2374372DN: 2374372DT_vs_2374372DN_rescued_2374372RT_vs_2374372DN

## VCF Statistics Processing & Visualization

Extract comprehensive statistics from all VCF files and generate visualizations.

In [5]:
# Process all VCF files and extract statistics
print("\\n" + "=" * 80)
print("PROCESSING ALL VCF FILES")
print("=" * 80)

all_vcf_stats = process_all_vcfs(vcf_files)
print(f"\\n✓ Processed {len(all_vcf_stats)} categories")

# Create aggregator and visualizer
aggregator = StatisticsAggregator(all_vcf_stats)
visualizer = VCFVisualizer(all_vcf_stats)

print("✓ Statistics aggregator created")
print("✓ Visualizer created. Ready to generate plots.")

PROCESSING ALL VCF FILES

PROCESSING: NORMALIZED

Processing: 2374372RT_vs_2374372DN.strelka.variants.dec.norm.vcf.gz


  [DEBUG] Starting header parsing...
  [DEBUG] Found 25 INFO fields in header
  [DEBUG] Processed 10001 variants, calculating statistics...
  [DEBUG] Calculated statistics for 21 INFO fields
  ✓ Total variants: 78,048
  ✓ SNPs: 74,343
  ✓ INDELs: 3,705
  ✓ Classification: {'Reference': 68745, 'Artifact': 917, 'Germline': 1187, 'Somatic': 7199}
  ✓ Chromosomes: 24

Processing: 2374372DT_vs_2374372DN.strelka.variants.dec.norm.vcf.gz
  [DEBUG] Starting header parsing...
  [DEBUG] Found 25 INFO fields in header
  [DEBUG] Processed 8562 variants, calculating statistics...
  [DEBUG] Calculated statistics for 21 INFO fields
  ✓ Total variants: 8,562
  ✓ SNPs: 8,458
  ✓ INDELs: 104
  ✓ Classification: {'Somatic': 182, 'Reference': 7464, 'Germline': 861, 'Artifact': 55}
  ✓ Chromosomes: 24

Processing: 2374372RT_vs_2374372DN.deepsomatic.variants.dec.norm.vcf.gz
  [DEBUG] Starting header parsing...
  [DEBUG] Found 2 INFO fields in header
  [DEBUG] Processed 10001 variants, calculating statistics

In [6]:
# Display detailed breakdown of all discovered and processed VCF categories
print("\n" + "=" * 80)
print("DETAILED VCF CATEGORY BREAKDOWN")
print("=" * 80)

for category, files in all_vcf_stats.items():
    if files:
        print(f"\n{category.upper()} ({len(files)} files):")
        for name, data in files.items():
            total_vars = data["stats"]["basic"]["total_variants"]
            print(f"  {name}: {total_vars:,} variants")

print("\n" + "=" * 80)
print("This includes:")
print("  • Individual caller VCFs (Strelka, DeepSomatic, Mutect2) per modality")
print("  • DNA consensus VCF (combined DNA callers)")
print("  • RNA consensus VCF (combined RNA callers)")
print("  • Rescue VCFs (DNA + RNA combined)")
print("  • All normalized and annotated variants")
print("=" * 80)


DETAILED VCF CATEGORY BREAKDOWN

NORMALIZED (6 files):
  strelka_RNA_TUMOR_vs_DNA_NORMAL: 78,048 variants
  strelka_DNA_TUMOR_vs_DNA_NORMAL: 8,562 variants
  deepsomatic_RNA_TUMOR_vs_DNA_NORMAL: 213,444 variants
  deepsomatic_DNA_TUMOR_vs_DNA_NORMAL: 124,928 variants
  mutect2_RNA_TUMOR_vs_DNA_NORMAL: 23,595 variants
  mutect2_DNA_TUMOR_vs_DNA_NORMAL: 4,354 variants

CONSENSUS (2 files):
  RNA_TUMOR_vs_DNA_NORMAL: 288,691 variants
  DNA_TUMOR_vs_DNA_NORMAL: 129,289 variants

RESCUE (1 files):
  2374372DT_vs_2374372DN_rescued_2374372RT_vs_2374372DN: 402,511 variants

COSMIC_GNOMAD (1 files):
  2374372DT_vs_2374372DN_rescued_2374372RT_vs_2374372DN: 402,510 variants

RNA_EDITING (1 files):
  2374372DT_vs_2374372DN_rescued_2374372RT_vs_2374372DN: 402,510 variants

FILTERED_RESCUE (3 files):
  2374372RT_vs_2374372DN: 288,691 variants
  2374372DT_vs_2374372DN: 129,289 variants
  2374372DT_vs_2374372DN_rescued_2374372RT_vs_2374372DN: 402,510 variants

This includes:
  • Individual caller VCF

## Statistics Aggregation

Create summary tables and aggregated statistics from all VCF categories:
- Individual caller VCFs (Strelka, DeepSomatic, Mutect2) for each modality
- DNA consensus VCF (combined DNA callers)
- RNA consensus VCF (combined RNA callers)
- Rescue VCFs (DNA + RNA combined)

In [7]:
# Generate summary tables
try:
    variant_summary = aggregator.create_variant_count_summary()
    print("✓ Variant count summary created")
except Exception as e:
    print(f"✗ Error creating variant count summary: {e}")
    variant_summary = pd.DataFrame()

try:
    summary_report = aggregator.create_summary_report()
    print("✓ Summary report created")
except Exception as e:
    print(f"✗ Error creating summary report: {e}")
    summary_report = {}

print("✓ Statistics aggregation completed")

✓ Variant count summary created
✓ Summary report created
✓ Statistics aggregation completed


In [8]:
# Display variant count summary (refactored: shows count distribution, not pass/filtered rates)
if not variant_summary.empty:
    print("\\n" + "=" * 80)
    print("VARIANT COUNT SUMMARY WITH CATEGORY DISTRIBUTION")
    print("=" * 80)

    # Select key columns for display
    display_cols = ["Category", "Tool", "Modality", "Total_Variants", "SNPs", "Indels"]

    # Add category columns from CATEGORY_ORDER
    for cat in CATEGORY_ORDER:
        if cat in variant_summary.columns:
            display_cols.append(cat)

    # Add percentage columns for categories
    for cat in CATEGORY_ORDER:
        pct_col = f"{cat}_pct"
        if pct_col in variant_summary.columns:
            display_cols.append(pct_col)

    # Filter to available columns and show
    available_cols = [c for c in display_cols if c in variant_summary.columns]
    display_df = variant_summary[available_cols]
    print(display_df.to_string(index=False))

    print("\\n" + "=" * 80)
    print("Legend:")
    print("  Total_Variants: Total number of variants in VCF")
    print("  SNPs/Indels: Count of SNPs and insertions/deletions")
    print(f"  {', '.join(CATEGORY_ORDER)}: Variant count for each category")
    print("  *_pct: Percentage of variants in each category")
    print("=" * 80)
else:
    print("No variant count summary data available")

VARIANT COUNT SUMMARY WITH CATEGORY DISTRIBUTION
       Category        Tool                                    Modality  Total_Variants   SNPs  Indels  Somatic  Germline  Reference  Artifact  RNA_Edit  NoConsensus  Somatic_pct  Germline_pct  Reference_pct  Artifact_pct  RNA_Edit_pct  NoConsensus_pct
     normalized     strelka                     RNA_TUMOR_vs_DNA_NORMAL           78048  74343    3705   7199.0    1187.0    68745.0     917.0       NaN          NaN     9.223811      1.520859      88.080412      1.174918           NaN              NaN
     normalized     strelka                     DNA_TUMOR_vs_DNA_NORMAL            8562   8458     104    182.0     861.0     7464.0      55.0       NaN          NaN     2.125672     10.056062      87.175893      0.642373           NaN              NaN
     normalized deepsomatic                     RNA_TUMOR_vs_DNA_NORMAL          213444 127274   86170    610.0    6724.0   206110.0       NaN       NaN          NaN     0.285789      3.150241

In [9]:
# Display information from summary report
if summary_report:
    print("\n" + "=" * 80)
    print("VARIANT BIOLOGICAL CLASSIFICATION FROM SUMMARY REPORT")
    print("=" * 80)

    # Check what's available in the summary report
    for name, df in summary_report.items():
        print(f"\n{name}:")
        print(df.head(10))
else:
    print("No summary report data available")


VARIANT BIOLOGICAL CLASSIFICATION FROM SUMMARY REPORT

variant_count_summary:
        Category         Tool                                     Modality  \
0     normalized      strelka                      RNA_TUMOR_vs_DNA_NORMAL   
1     normalized      strelka                      DNA_TUMOR_vs_DNA_NORMAL   
2     normalized  deepsomatic                      RNA_TUMOR_vs_DNA_NORMAL   
3     normalized  deepsomatic                      DNA_TUMOR_vs_DNA_NORMAL   
4     normalized      mutect2                      RNA_TUMOR_vs_DNA_NORMAL   
5     normalized      mutect2                      DNA_TUMOR_vs_DNA_NORMAL   
6      consensus          RNA                          TUMOR_vs_DNA_NORMAL   
7      consensus          DNA                          TUMOR_vs_DNA_NORMAL   
8         rescue    2374372DT  vs_2374372DN_rescued_2374372RT_vs_2374372DN   
9  cosmic_gnomad    2374372DT  vs_2374372DN_rescued_2374372RT_vs_2374372DN   

                                                File  Total_Va

## Visualization

Create visualizations for the VCF statistics.
All plots include data from:
- Individual variant callers (Strelka, DeepSomatic, Mutect2)
- DNA and RNA consensus VCFs
- Rescue VCFs (if available)

### Plot 1: Variant Counts by Tool

In [10]:
visualizer.plot_variant_counts_by_tool()

No data available for plotting


### Plot 2: Variant Type Distribution

In [11]:
visualizer.plot_variant_type_distribution()

### Plot 3: Consensus vs Individual Tools

In [12]:
visualizer.plot_consensus_comparison()

### Plot 4: Filter Status

In [13]:
visualizer.plot_filter_status()

### Plot 5: Annotation Progression (NEW)

Shows variant count and category changes through annotation pipeline stages.

In [14]:
try:
    visualizer.plot_annotation_progression()
except Exception as e:
    print(f"Note: Annotation progression plot not available: {e}")

## Advanced Analysis - Rescue VCF Statistics

Analyze the rescue VCFs that combine DNA and RNA modality variants.

In [15]:
analyze_rescue_vcf(all_vcf_stats)


Category        DNA Consensus   RNA Consensus   Rescued        
------------------------------------------------------------
Somatic         0               0               0              
Germline        588             551             1,058          
Reference       5,234           6,955           12,115         
Artifact        2,261           13,352          15,453         
RNA_Edit        0               0               0              
NoConsensus     121,206         267,833         373,885        
PASS            0               0               0              
LowQual         0               0               0              
StrandBias      0               0               0              
Clustered       0               0               0              
Other           0               0               0              
------------------------------------------------------------
TOTAL           129,289         288,691         402,511        

Detailed breakdown:

Germline Category:
  DN

{'dna_consensus': {'total_variants': 129289,
  'snps': 125702,
  'indels': 3587,
  'mnps': 0,
  'complex': 0,
  'chromosomes': ['chr1',
   'chr10',
   'chr11',
   'chr12',
   'chr13',
   'chr14',
   'chr15',
   'chr16',
   'chr17',
   'chr18',
   'chr19',
   'chr2',
   'chr20',
   'chr21',
   'chr22',
   'chr3',
   'chr4',
   'chr5',
   'chr6',
   'chr7',
   'chr8',
   'chr9',
   'chrM',
   'chrX'],
  'qualities': [0.10000000149011612,
   37.20000076293945,
   99.0,
   49.900001525878906,
   50.70000076293945,
   68.30000305175781,
   46.20000076293945,
   1.399999976158142,
   63.79999923706055,
   40.79999923706055,
   48.900001525878906,
   50.79999923706055,
   65.30000305175781,
   47.599998474121094,
   55.599998474121094,
   44.0,
   6.300000190734863,
   37.099998474121094,
   49.79999923706055,
   99.0,
   52.5,
   46.400001525878906,
   76.69999694824219,
   77.0,
   39.20000076293945,
   88.9000015258789,
   47.599998474121094,
   47.5,
   32.900001525878906,
   61.299999237

## BAM Validation (Optional)

Validate variants using BAM/CRAM alignment files if available.

In [16]:
# Optional: BAM validation using final filtered rescue VCF
if bam_files:
    print("\\n" + "=" * 80)
    print("BAM VALIDATION USING FINAL FILTERED RESCUE VCF")
    print("=" * 80)

    validator = BAMValidator()

    # Use filtered_rescue VCF if available, otherwise fall back to rescue VCF
    sample_vcf = None
    vcf_source = "unknown"

    # First try to find filtered_rescue VCF (final pipeline output)
    if "filtered_rescue" in vcf_files and vcf_files["filtered_rescue"]:
        sample_vcf = next(iter(vcf_files["filtered_rescue"].values()))
        vcf_source = "filtered_rescue"
        print(f"\\n✓ Using final filtered rescue VCF for validation")
    # Fallback to rescue VCF
    elif "rescue" in vcf_files and vcf_files["rescue"]:
        sample_vcf = next(iter(vcf_files["rescue"].values()))
        vcf_source = "rescue"
        print(f"\\n⚠ Filtered VCF not found, using rescue VCF for validation")
    # Last resort: use any consensus VCF
    else:
        for category, files in vcf_files.items():
            if files:
                sample_vcf = next(iter(files.values()))
                vcf_source = category
                break

    if sample_vcf and any(bam_files.values()):
        print(f"Validating variants from: {sample_vcf.name}")
        print(f"Source VCF category: {vcf_source}")

        # BAM files are already normalized to modality names
        bam_paths = bam_files

        if bam_paths:
            print(f"Using {len(bam_paths)} BAM/CRAM file(s) for validation")
            validation_results = validator.validate_variants(
                sample_vcf, bam_paths, max_variants=50
            )
            validation_df = validator.summarize_validation(validation_results)

            if not validation_df.empty:
                print("\\nValidation Summary:")
                print(f"Total variants validated: {len(validation_df)}")

                support_counts = validation_df["support"].value_counts()
                print(f"\\nSupport distribution:")
                for support_type, count in support_counts.items():
                    pct = count / len(validation_df) * 100
                    print(f"  {support_type}: {count} ({pct:.1f}%)")
        else:
            print("No BAM files available for validation")
    else:
        print("No suitable VCF file found for validation")
else:
    print("No BAM files found for validation")

BAM VALIDATION USING FINAL FILTERED RESCUE VCF
\n✓ Using final filtered rescue VCF for validation
Validating variants from: 2374372RT_vs_2374372DN.filtered.vcf.gz
Source VCF category: filtered_rescue
Using 3 BAM/CRAM file(s) for validation
\nValidation Summary:
Total variants validated: 150
\nSupport distribution:
  error: 141 (94.0%)
  unsupported: 9 (6.0%)


## Export Results

Export all analysis results to files.

In [17]:
# Export all results
print("Exporting results...")

try:
    # Export aggregated statistics
    if hasattr(aggregator, "export_report"):
        aggregator.export_report(OUTPUT_DIR, format="both")
        print(f"✓ Aggregated statistics exported to {OUTPUT_DIR}")
except Exception as e:
    print(f"✗ Error exporting aggregated statistics: {e}")

try:
    # Export rescue analysis if it exists
    if "rescue_analysis" in locals() and rescue_analysis:
        rescue_dir = OUTPUT_DIR / "rescue_analysis"
        export_rescue_analysis(rescue_analysis, rescue_dir, format="both")
        print(f"✓ Rescue analysis exported to {rescue_dir}")
except Exception as e:
    print(f"✗ Error exporting rescue analysis: {e}")

# Create plots directory
plot_dir = OUTPUT_DIR / "plots"
plot_dir.mkdir(exist_ok=True)
print(f"✓ Created plots directory at {plot_dir}")

print(f"\n✓ Export operations completed")

# List exported files
try:
    print("\nExported files:")
    for file_path in OUTPUT_DIR.rglob("*"):
        if file_path.is_file():
            print(f"  - {file_path.relative_to(OUTPUT_DIR)}")
except Exception as e:
    print(f"✗ Error listing exported files: {e}")

Exporting results...
✓ Report exported to Excel: P2374372_statistics_output/vcf_statistics_report.xlsx
✓ Report exported to CSV files in: P2374372_statistics_output/csv_reports
✓ Aggregated statistics exported to P2374372_statistics_output
✓ Created plots directory at P2374372_statistics_output/plots

✓ Export operations completed

Exported files:
  - vcf_statistics_report.xlsx
  - csv_reports/variant_count_summary.csv
  - csv_reports/category_distribution.csv
  - csv_reports/stage_progression.csv
  - tier_visualizations/2374372DT_vs_2374372DN_rescued_2374372RT_vs_2374372DN/caller_support.html
  - tier_visualizations/2374372DT_vs_2374372DN_rescued_2374372RT_vs_2374372DN/tier_distribution.html
  - tier_visualizations/2374372DT_vs_2374372DN_rescued_2374372RT_vs_2374372DN/tier_composition.html
  - tier_visualizations/2374372DT_vs_2374372DN_rescued_2374372RT_vs_2374372DN/category_tier_heatmap.html
  - tier_visualizations/2374372DT_vs_2374372DN_rescued_2374372RT_vs_2374372DN/statistics_summ

## Rescue Tiering & IGV-Reports Visualization

Compute variant tiers for the final rescue VCF based on number of callers and modality support, sample representatives, and render IGV-like BAM views for manual review.

In [18]:
# Tiering and IGV-Reports Visualization for Rescue Variants
# Using correct command-line create_report API with category-tiered organization

from pathlib import Path
from vcf_stats import (
    tier_rescue_variants,
    check_igv_reports_available,
    get_alignment_index_path,
    organize_by_category_tier,
)
import subprocess

# Discover rescue VCFs
rescue_files = discovery.get_rescue_files() if "discovery" in locals() else {}
if not rescue_files:
    print("No rescue VCFs discovered. Skipping tiering and IGV-reports visualization.")
else:
    print(f"\n" + "=" * 80)
    print("RESCUE VARIANT TIERING & IGV-REPORTS VISUALIZATION")
    print("=" * 80)
    print(f"Found {len(rescue_files)} rescue VCF(s)\n")

    # Prepare BAM files - use normalized modality names
    chosen_bams = bam_files if "bam_files" in locals() else {}
    vis_bams = {}
    for key in ["DNA_TUMOR", "DNA_NORMAL", "RNA_TUMOR"]:
        if key in chosen_bams:
            vis_bams[key] = chosen_bams[key]

    if not set(["DNA_TUMOR", "DNA_NORMAL", "RNA_TUMOR"]).issubset(set(vis_bams.keys())):
        print(
            "⚠ Warning: Missing one or more modality BAMs (DNA_TUMOR, DNA_NORMAL, RNA_TUMOR)"
        )
        print(f"  Available: {list(vis_bams.keys())}\n")

    # Reference FASTA (required for IGV-reports)
    REF_FASTA = Path(
        "/t9k/mnt/joey/bio_db/references/Homo_sapiens/GATK/GRCh38/Sequence/WholeGenomeFasta/Homo_sapiens_assembly38.fasta"
    )

    if not REF_FASTA.exists():
        print(f"⚠ Reference FASTA not found at {REF_FASTA}")
        print("  IGV-reports visualization requires indexed reference FASTA.\n")

    # Output directory
    igvreports_dir = OUTPUT_DIR / "igv_reports"
    igvreports_dir.mkdir(exist_ok=True)

    K = 3  # representatives per tier (for sampling)

    for rescue_name, rescue_path in rescue_files.items():
        print(f"\nProcessing rescue: {rescue_name}")
        print("-" * 80)

        # Step 1: Compute fine-grained tiers based on per-modality caller counts
        print("\n1. Computing variant tiers...")
        tiered = tier_rescue_variants(rescue_path)
        if tiered.empty:
            print(f"   No variants parsed from {rescue_path}")
            continue

        # Show tier distribution PER CATEGORY
        print("\n   Tier distribution by category:")
        categories = sorted(tiered["filter_category"].unique())
        for cat in categories:
            cat_data = tiered[tiered["filter_category"] == cat]
            tier_counts = cat_data["tier"].value_counts().sort_index()
            pct_total = (len(cat_data) / len(tiered)) * 100
            print(
                f"\n     {cat:12s} ({len(cat_data):6d} variants, {pct_total:5.1f}% of total):"
            )
            for tier in [f"T{i}" for i in range(1, 9)]:
                if tier in tier_counts.index:
                    count = tier_counts[tier]
                    pct = (count / len(cat_data)) * 100
                    print(f"       {tier}: {count:6d} ({pct:5.1f}%)")

        # Step 2: Check IGV-reports availability
        print("\n2. Checking IGV-reports installation...")
        if not check_igv_reports_available():
            print("   ✗ igv-reports is not installed")
            print("   To enable IGV visualization, install: pip install igv-reports")
            print("   Skipping visualization generation.")
            continue
        else:
            print("   ✓ igv-reports is installed")

        # Step 3: Verify prerequisites
        if not REF_FASTA.exists():
            print(f"\n   ✗ Reference FASTA not found: {REF_FASTA}")
            print("   Skipping visualization (reference required)")
            continue

        # Check reference is indexed
        ref_index = Path(str(REF_FASTA) + ".fai")
        if not ref_index.exists():
            print(f"\n   ⚠ Reference FASTA index not found: {ref_index}")
            print("   Attempting to create index...")
            try:
                subprocess.run(
                    ["samtools", "faidx", str(REF_FASTA)],
                    capture_output=True,
                    check=True,
                )
                print("   ✓ Reference index created")
            except Exception as e:
                print(f"   ✗ Failed to index reference: {e}")
                continue

        # Check VCF is indexed
        vcf_index = Path(str(rescue_path) + ".tbi")
        if not vcf_index.exists():
            print(f"\n   ⚠ VCF index not found: {vcf_index}")
            print("   Attempting to create index...")
            try:
                subprocess.run(
                    ["tabix", str(rescue_path)],
                    capture_output=True,
                    check=True,
                )
                print("   ✓ VCF index created")
            except Exception as e:
                print(f"   ✗ Failed to index VCF: {e}")
                continue

        # Check BAM/CRAM files are indexed (using proper index extensions)
        for sample, bam_path in vis_bams.items():
            bam_index = get_alignment_index_path(bam_path)
            if not bam_index.exists():
                print(f"\n   ⚠ {sample} alignment index not found: {bam_index}")
                print(f"     Attempting to create index...")
                try:
                    subprocess.run(
                        ["samtools", "index", str(bam_path)],
                        capture_output=True,
                        check=True,
                    )
                    print(f"     ✓ Index created")
                except Exception as e:
                    print(f"     ✗ Failed: {e}")

        # Step 4: Generate IGV-reports visualization
        print("\n3. Generating category-tiered IGV-reports...")
        tier_output_dir = igvreports_dir / rescue_name

        try:
            tier_reports = organize_by_category_tier(
                tiered, rescue_path, vis_bams, REF_FASTA, tier_output_dir, k_per_tier=K
            )

            print(
                f"\n   ✓ Generated reports for {len(tier_reports)} category-tier combinations"
            )
            print(f"   📊 Reports directory: {tier_output_dir}")
            print(f"   🌐 Landing page: {tier_output_dir}/index.html")
            print(f"\n   Structure (Category/Tier):")
            for key, (count, path) in sorted(tier_reports.items()):
                print(f"     {key}: {count:6d} variants")

        except RuntimeError as e:
            print(f"\n   ✗ Prerequisites not met: {e}")
        except Exception as e:
            print(f"\n   ✗ Error generating IGV-reports: {e}")
            import traceback

            traceback.print_exc()

print("\n" + "=" * 80)
print("TIERING & VISUALIZATION COMPLETE")
print("=" * 80)


RESCUE VARIANT TIERING & IGV-REPORTS VISUALIZATION
Found 1 rescue VCF(s)


Processing rescue: 2374372DT_vs_2374372DN_rescued_2374372RT_vs_2374372DN
--------------------------------------------------------------------------------

1. Computing variant tiers...

   Tier distribution by category:

     Artifact     ( 11414 variants,   2.8% of total):
       T1:     70 (  0.6%)
       T2:    746 (  6.5%)
       T3:   1265 ( 11.1%)
       T4:    201 (  1.8%)
       T6:   9132 ( 80.0%)

     Germline     (  1058 variants,   0.3% of total):
       T1:     82 (  7.8%)
       T2:    226 ( 21.4%)
       T3:    281 ( 26.6%)
       T4:    468 ( 44.2%)
       T6:      1 (  0.1%)

     Other        (373885 variants,  92.9% of total):
       T4:  12731 (  3.4%)
       T5: 107454 ( 28.7%)
       T7: 253700 ( 67.9%)

     Reference    ( 12115 variants,   3.0% of total):
       T1:    111 (  0.9%)
       T2:    423 (  3.5%)
       T3:   4720 ( 39.0%)
       T4:    341 (  2.8%)
       T6:   6520 ( 53.8%

## Rescue VCF Tiering & Visualizations

Generate comprehensive visualizations for tiered rescue variants.

In [19]:
from vcf_stats import (
    TierVisualizer,
    create_tier_visualization_report,
    tier_rescue_variants,
)

print("\\n" + "=" * 80)
print("GENERATING TIER VISUALIZATION REPORTS")
print("=" * 80)

# Output directory for visualizations
tier_viz_dir = OUTPUT_DIR / "tier_visualizations"
tier_viz_dir.mkdir(exist_ok=True)

# Get rescue files
rescue_files = discovery.get_rescue_files() if "discovery" in locals() else {}

# Fallback discovery if needed
if not rescue_files:
    rescue_dir = Path(BASE_DIR) / "rescue"
    if rescue_dir.exists():
        manual_rescue_files = {}
        for subdir in rescue_dir.iterdir():
            if subdir.is_dir():
                vcf_files_found = list(subdir.glob("*.rescued.vcf.gz"))
                if vcf_files_found:
                    manual_rescue_files[subdir.name] = vcf_files_found[0]
        if manual_rescue_files:
            rescue_files = manual_rescue_files
            print(f"✓ Manually discovered {len(rescue_files)} rescue VCF(s)")

# Process rescue VCFs
if rescue_files:
    from IPython.display import display

    for rescue_name, rescue_path in rescue_files.items():
        print(f"\\nProcessing: {rescue_name}")
        tiered_df = tier_rescue_variants(rescue_path)
        if not tiered_df.empty:
            print(f"  ✓ Loaded {len(tiered_df):,} variants")
            rescue_viz_dir = tier_viz_dir / rescue_name
            rescue_viz_dir.mkdir(parents=True, exist_ok=True)

            report = create_tier_visualization_report(
                tiered_df, output_dir=rescue_viz_dir
            )
            print(f"  ✓ Generated visualizations")
            print(f"  📊 Saved to: {rescue_viz_dir}")
else:
    print("✗ No rescue VCFs available for tier visualization")

GENERATING TIER VISUALIZATION REPORTS
\nProcessing: 2374372DT_vs_2374372DN_rescued_2374372RT_vs_2374372DN
  ✓ Loaded 402,511 variants
✓ Saved: P2374372_statistics_output/tier_visualizations/2374372DT_vs_2374372DN_rescued_2374372RT_vs_2374372DN/tier_distribution.html
✓ Saved: P2374372_statistics_output/tier_visualizations/2374372DT_vs_2374372DN_rescued_2374372RT_vs_2374372DN/category_tier_heatmap.html
✓ Saved: P2374372_statistics_output/tier_visualizations/2374372DT_vs_2374372DN_rescued_2374372RT_vs_2374372DN/statistics_summary.html
✓ Saved: P2374372_statistics_output/tier_visualizations/2374372DT_vs_2374372DN_rescued_2374372RT_vs_2374372DN/caller_support.html
✓ Saved: P2374372_statistics_output/tier_visualizations/2374372DT_vs_2374372DN_rescued_2374372RT_vs_2374372DN/tier_composition.html
  ✓ Generated visualizations
  📊 Saved to: P2374372_statistics_output/tier_visualizations/2374372DT_vs_2374372DN_rescued_2374372RT_vs_2374372DN


## Analysis Summary

Summary of the VCF statistics analysis for P2374372 dataset.

In [20]:
print("\\n" + "=" * 80)
print("VCF STATISTICS ANALYSIS - P2374372 DATASET")
print("=" * 80)

print("\\n✓ Analysis Complete!")
print("\\n✓ Output Generated:")
print(f"  Base output directory: {OUTPUT_DIR}")
print(f"  Tier visualizations: {OUTPUT_DIR}/tier_visualizations/")
print(f"  IGV reports: {OUTPUT_DIR}/igv_reports/")

print("\\n✓ Sample Naming Conventions:")
print("  2374372DT (DNA Tumor) -> DNA_TUMOR")
print("  2374372DN (DNA Normal) -> DNA_NORMAL")
print("  2374372RT (RNA Tumor) -> RNA_TUMOR")

print("\\n✓ Implementation Status:")
print("  ✓ Notebook adapted for P2374372 dataset")
print("  ✓ Sample naming properly mapped")
print("  ✓ All modules working with new structure")

print("\\n" + "=" * 80)

VCF STATISTICS ANALYSIS - P2374372 DATASET
\n✓ Analysis Complete!
\n✓ Output Generated:
  Base output directory: P2374372_statistics_output
  Tier visualizations: P2374372_statistics_output/tier_visualizations/
  IGV reports: P2374372_statistics_output/igv_reports/
\n✓ Sample Naming Conventions:
  2374372DT (DNA Tumor) -> DNA_TUMOR
  2374372DN (DNA Normal) -> DNA_NORMAL
  2374372RT (RNA Tumor) -> RNA_TUMOR
\n✓ Implementation Status:
  ✓ Notebook adapted for P2374372 dataset
  ✓ Sample naming properly mapped
  ✓ All modules working with new structure
