# Deltect: Genomic Deletion Pathogenicity Classifier
This notebook demonstrates how to interact with the tool 

## 1. Setup and Installation
First, install all the required dependencies

In [21]:
# Install dependencies using uv
!uv pip install -r ../requirements.txt

# or use pip
# !pip install -r ../requirements.txt

[2mUsing Python 3.13.9 environment at: /home/brandon/Deltect/.venv[0m
[2mAudited [1m26 packages[0m [2min 189ms[0m[0m


In [22]:
# verify that the installation worked:

import sys
print(f"python version: {sys.version}")

from pathlib import Path

parent_dir = Path.cwd().parent

# Add to Python path if not already there
if str(parent_dir) not in sys.path:
    sys.path.insert(0, str(parent_dir))

print(f"Added to path: {parent_dir}")

# check required packages
import pysam
import sklearn
import pandas as pd   
import numpy as np 

print(f"pysam: {pysam.__version__}")
print(f"scikit-learn: {sklearn.__version__}")
print(f"pandas: {pd.__version__}")
print(f"numpy: {np.__version__}")

print("All dependencies installed!")

python version: 3.10.12 (main, Nov  4 2025, 08:48:33) [GCC 11.4.0]
Added to path: /home/brandon/Deltect
pysam: 0.23.3
scikit-learn: 1.7.2
pandas: 2.3.3
numpy: 2.2.6
All dependencies installed!


## 2. Import modules

In [23]:
from data.api import fetch_clinvar_deletions_entrez
from data.data_processor import pass_through_variants
from data.preprocessing import summarize_variants
from extraction.deletion_extraction import DeletionExtractor
from training.model import DeletionPathogenicityPredictor

import json
import logging
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)

sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

## Required Files

### 1. Reference Genome (Required)
- **File**: `hs37d5.fa` + `hs37d5.fa.fai`
- **Size**: ~3GB
- **Source**: 1000 Genomes Project
- **Purpose**: Reference genome for alignment and sequence extraction

### 2. Gene Annotation (Required)
- **File**: `gencode.v19.annotation.gtf`
- **Size**: ~1GB uncompressed
- **Source**: GENCODE
- **Purpose**: Gene boundaries and consequence prediction

### 3. GIAB Benchmark VCF (Required for validation)
- **File**: `HG002_GRCh37_1_22_v4.2.1_benchmark.vcf.gz` + `.tbi`
- **Size**: ~200MB
- **Source**: NIST GIAB
- **Purpose**: Validation truth set

Run the following command to download the genomic files used for training and prediction.
```bash
./download_references.sh
```

In [24]:
# Consider the preset hyperparameters

import config  

print(f"Chromosomes: {config.CHROMOSOMES}")
print(f"Max Variants Extracted: {config.MAX_VARIANTS_PER_CHR}")
print(f"Test Size: {config.TEST_SIZE * 100}%")
print(f"CV Folds: {config.CV_FOLDS}")
print(f"Reference Fasta: {config.REFERENCE_FASTA}")

Chromosomes: ['17']
Max Variants Extracted: 20000
Test Size: 20.0%
CV Folds: 10
Reference Fasta: hs37d5.fa


## 3. Manual Pipeline Implementation (For Demo)

In [25]:
from data.api import ClinVarClient

client = ClinVarClient("../.env") # instantiate a clinvar client

variants = []

for chr in config.CHROMOSOMES:
    variants.extend(client.fetch_deletion_variants(17, config.MAX_VARIANTS_PER_CHR))

print(f"There are {len(variants)} variants")


INFO:data.api:Initialized ClinVarClient with email: tangbrandonk@gmail.com
INFO:data.api:Found 9543 variant IDs, fetching in batches of 200...
INFO:data.api:Fetched 9543 pathogenic variants
INFO:data.api:Found 1659 variant IDs, fetching in batches of 200...
INFO:data.api:Fetched 1659 non-pathogenic variants
INFO:data.api:Total variants fetched: 11202 (pathogenic: 9543, non-pathogenic: 1659)


There are 11202 variants


In [26]:
import json

# What the data looks like:
print(json.dumps(variants[0], indent=2))

{
  "obj_type": "Deletion",
  "accession": "VCV004530604",
  "accession_version": "VCV004530604.1",
  "title": "NM_000342.4(SLC4A1):c.111del (p.His37fs)",
  "variation_set": [
    {
      "measure_id": "4642013",
      "variation_xrefs": [],
      "variation_name": "NM_000342.4(SLC4A1):c.111del (p.His37fs)",
      "cdna_change": "c.111del",
      "aliases": [],
      "variation_loc": [
        {
          "status": "current",
          "assembly_name": "GRCh38",
          "chr": "17",
          "band": "",
          "start": "44261632",
          "stop": "44261632",
          "inner_start": "",
          "inner_stop": "",
          "outer_start": "",
          "outer_stop": "",
          "display_start": "44261632",
          "display_stop": "44261632",
          "assembly_acc_ver": "GCF_000001405.38",
          "annotation_release": "",
          "alt": "",
          "ref": ""
        },
        {
          "status": "previous",
          "assembly_name": "GRCh37",
          "chr": "1

## 4 Filter the variants

In [27]:
processed_variants = pass_through_variants(variants)

print(json.dumps(processed_variants[0], indent=2))

{
  "uid": "VCV004530604",
  "gene": "SLC4A1",
  "title": "NM_000342.4(SLC4A1):c.111del (p.His37fs)",
  "chr": "17",
  "start": "42339000",
  "end": "42339000",
  "assembly": "GRCh37",
  "variant_type": "Deletion",
  "clinical_significance": "Pathogenic",
  "review_status": "criteria provided, single submitter",
  "condition": "Hereditary spherocytosis type 4",
  "consequence": "frameshift variant"
}


## 4.1 We can analyze the distribution of our data

In [28]:
summarize_variants(processed_variants)

  - Pathogenic: 7462
  - Likely pathogenic: 1617
  - Pathogenic/Likely pathogenic: 461
  - Pathogenic/Likely risk allele: 1
  - Likely pathogenic/Pathogenic, low penetrance: 1
  - Pathogenic; Affects: 1
  - Likely benign: 986
  - Benign: 604
  - Benign/Likely benign: 69


In [29]:
pathogenicity_predictor = DeletionPathogenicityPredictor(threshold = 0.6)

try:
    path_results = pathogenicity_predictor.train(
        processed_variants,
        test_size=config.TEST_SIZE,
        cv_folds=config.CV_FOLDS,
    )
    
    print("\n" + "="*70)
    print("PATHOGENICITY PREDICTOR RESULTS")
    print("="*70)
    
    # Cross-validation results
    print("\nCross-Validation Performance (Mean ± Std):")
    print(f"  Precision:    {path_results.get('cv_precision', 0):.4f}")
    print(f"  Recall:       {path_results.get('cv_recall', 0):.4f}")
    print(f"  F1 Score:     {path_results.get('cv_f1', 0):.4f}")
    print(f"  Specificity:  {path_results.get('cv_specificity', 0):.4f}")
    print(f"  AUC-ROC:      {path_results.get('cv_auc', 0):.4f}")
    
    # Test set results
    print("\nTest Set Performance:")
    print(f"  Precision:    {path_results.get('test_precision', 0):.4f}")
    print(f"  Recall:       {path_results.get('test_recall', 0):.4f}")
    print(f"  F1 Score:     {path_results.get('test_f1', 0):.4f}")
    print(f"  AUC-ROC:      {path_results.get('test_auc', 0):.4f}")
    
    # Confusion matrix
    print("\nTest Set Confusion Matrix:")
    print(f"  True Positives:  {path_results.get('test_tp', 0)}")
    print(f"  True Negatives:  {path_results.get('test_tn', 0)}")
    print(f"  False Positives: {path_results.get('test_fp', 0)}")
    print(f"  False Negatives: {path_results.get('test_fn', 0)}")
    
    # Dataset statistics
    print("\nDataset Information:")
    print(f"  Total variants:     {len(processed_variants)}")
    print(f"  Training samples:   {path_results.get('n_train', 0)}")
    print(f"  Test samples:       {path_results.get('n_test', 0)}")
    print(f"  Number of features: {path_results.get('n_features', 0)}")
except ValueError as e:
    print("\n" + "="*70)
    print("ERROR: Model Training Failed")
    print("="*70)
    print(f"Error: {e}")
    logger.error(f"Training error: {e}", exc_info=True)
    import traceback
    traceback.print_exc()
except Exception as e:
    print("\n" + "="*70)
    print("ERROR: Unexpected Training Error")
    print("="*70)
    print(f"Error: {e}")
    logger.error(f"Unexpected training error: {e}", exc_info=True)
    import traceback
    traceback.print_exc()


    

INFO:training.model:=== Training Deletion Pathogenicity Predictor ===
INFO:training.model:Handling class imbalance with weighted loss (no resampling)
INFO:training.model:Using 30 features
INFO:training.model:Dataset: 9543 pathogenic, 1659 benign
INFO:training.model:Imbalance ratio: 5.75:1 (pathogenic:benign)
INFO:training.model:Computed class weights: benign=3.376, pathogenic=0.587
INFO:training.model:Train set: 7634 pathogenic, 1327 benign
INFO:training.model:Test set: 1909 pathogenic, 332 benign
INFO:training.model:Building weighted ensemble (RF + GB + XGB)
INFO:training.model:Fitting ensemble with sample weights...
INFO:training.model:Running 10-fold weighted cross-validation...
INFO:training.model:Evaluating on held-out test set...
INFO:training.model:Top 15 feature importances (from Random Forest):
INFO:training.model:  normalized_chr_position: 0.1985
INFO:training.model:  high_confidence_benign: 0.1368
INFO:training.model:  log_deletion_length: 0.1291
INFO:training.model:  deleti


PATHOGENICITY PREDICTOR RESULTS

Cross-Validation Performance (Mean ± Std):
  Precision:    0.9475
  Recall:       0.8753
  F1 Score:     0.9100
  Specificity:  0.7212
  AUC-ROC:      0.9025

Test Set Performance:
  Precision:    0.9514
  Recall:       0.9015
  F1 Score:     0.9258
  AUC-ROC:      0.9175

Test Set Confusion Matrix:
  True Positives:  1721
  True Negatives:  244
  False Positives: 88
  False Negatives: 188

Dataset Information:
  Total variants:     11202
  Training samples:   8961
  Test samples:       2241
  Number of features: 30


## Demonstration Using a Synthesized Pathogenic Deletion

In [30]:


pathogenic_variant = {
  "uid": "CA10584575",
  "gene": "BRCA1",
  "title": "NM_007294.4(BRCA1):c.442-22_442-13del",
  "chr": "17",
  "start": "43099895",
  "end": "43099904",
  "assembly": "GRCh37",
  "variant_type": "Deletion",
  "clinical_significance": "Pathogenic",
  "review_status": "Expert panel",
  "condition": "BRCA1-related cancer predisposition",
  "consequence": "intronic variant, splice-altering"
}

benign_variant = {
  "uid": "CA003895",
  "gene": "BRCA1",
  "title": "NM_007294.4(BRCA1):c.81-11del",
  "chr": "17",
  "start": "43115790",
  "end": "43115790",
  "assembly": "GRCh37",
  "variant_type": "Deletion",
  "clinical_significance": "Benign",
  "review_status": "Expert panel",
  "condition": "BRCA1-related cancer predisposition",
  "consequence": "intronic variant"
}


pathogenicity_predictor.predict_proba([pathogenic_variant, benign_variant])

array([0.66458978, 0.61477141])