# Deltect: Genomic Deletion Pathogenicity Classifier
This notebook demonstrates how to interact with the tool 

## 1. Setup and Installation
First, install all the required dependencies

In [1]:
# Install dependencies using uv
!uv pip install -r ../requirements.txt

# or use pip
# !pip install -r ../requirements.txt

[2mUsing Python 3.13.9 environment at: /home/tangb/Deltect/.venv[0m
[2mAudited [1m26 packages[0m [2min 3ms[0m[0m


In [2]:
# verify that the installation worked:

import sys
print(f"python version: {sys.version}")

from pathlib import Path

parent_dir = Path.cwd().parent

# Add to Python path if not already there
if str(parent_dir) not in sys.path:
    sys.path.insert(0, str(parent_dir))

print(f"Added to path: {parent_dir}")

# check required packages
import pysam
import sklearn
import pandas as pd   
import numpy as np 

print(f"pysam: {pysam.__version__}")
print(f"scikit-learn: {sklearn.__version__}")
print(f"pandas: {pd.__version__}")
print(f"numpy: {np.__version__}")

print("All dependencies installed!")

python version: 3.13.9 (main, Nov 19 2025, 22:47:49) [Clang 21.1.4 ]
Added to path: /home/tangb/Deltect


pysam: 0.23.3
scikit-learn: 1.7.2
pandas: 2.3.3
numpy: 2.3.4
All dependencies installed!


## 2. Import modules

In [3]:
from data.api import fetch_clinvar_deletions_entrez
from data.data_processor import pass_through_variants
from data.preprocessing import summarize_variants
from extraction.deletion_extraction import DeletionExtractor
from training.model import DeletionPathogenicityPredictor

# import the config vars
import config  

import json
import logging
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)

sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

Run the following command to download the genomic files used for training and prediction.
```bash
./download_references.sh
```

## 3. Manual Pipeline Implementation (For Demo)

In [4]:
from data.api import ClinVarClient

client = ClinVarClient("../.env") # instantiate a clinvar client

variants = []

for chr in config.CHROMOSOMES:
    variants.extend(client.fetch_deletion_variants(17, config.MAX_VARIANTS_PER_CHR))

print(f"There are {len(variants)} variants")


INFO:data.api:Initialized ClinVarClient with email: tangbrandonk@gmail.com
INFO:data.api:Found 9559 variant IDs, fetching in batches of 200...
INFO:data.api:Fetched 9559 pathogenic variants
INFO:data.api:Found 1659 variant IDs, fetching in batches of 200...
INFO:data.api:Fetched 1659 non-pathogenic variants
INFO:data.api:Total variants fetched: 11218 (pathogenic: 9559, non-pathogenic: 1659)


There are 11218 variants


## 4 Filter the variants

In [5]:
processed_variants = pass_through_variants(variants)

In [6]:
from config import CHROMOSOMES
from data.ref_genome_data import ReferenceGenomeSampler


print(f"\n[4/5] Sampling normal reference sequences...")
print(f"Reference genome: {"../hs37d5.fa"}")

all_training_variants = []

try:
    if not Path("../hs37d5.fa").exists():
        print(f"WARNING: Reference genome not found: ../hs37d5.fa")
        print("Skipping reference genome sampling.")
    else:
        # Initialize reference sampler
        sampler = ReferenceGenomeSampler(
            reference_fasta="../hs37d5.fa",
            chromosomes=CHROMOSOMES
        )
        
        # Match deletion distribution
        num_normal_samples = len(processed_variants)
        print(f"Sampling {num_normal_samples} normal sequences to match deletion distribution...")
        
        # In your notebook, change ratio from 1.0 to 0.3-0.5
        normal_sequences = sampler.match_str_distribution(
            str_variants=processed_variants,
            ratio=0.7
        )
        
        print(f"Sampled {len(normal_sequences)} normal reference sequences")
        
        
        # Combine with processed variants
        all_training_variants = processed_variants + normal_sequences
        print(f"\nCombined dataset: {len(all_training_variants)} total samples")
        print(f"  ClinVar variants: {len(processed_variants)}")
        print(f"  Reference benign: {len(normal_sequences)}")
except Exception as e:
    print(f"ERROR sampling reference genome: {e}")
    print("Continuing without normal sequences...")
    all_training_variants = processed_variants

INFO:data.ref_genome_data:Loading reference genome: ../hs37d5.fa
INFO:data.ref_genome_data:Using FASTA index file
INFO:data.ref_genome_data:Loaded 1 chromosomes
INFO:data.ref_genome_data:  chr17: 81,195,210 bp
INFO:data.ref_genome_data:Matching 7852 reference samples to STR distribution
INFO:data.ref_genome_data:STR length range: 1-15541116 bp
INFO:data.ref_genome_data:Matched 500/7852 samples
INFO:data.ref_genome_data:Matched 1000/7852 samples
INFO:data.ref_genome_data:Matched 1500/7852 samples
INFO:data.ref_genome_data:Matched 2000/7852 samples



[4/5] Sampling normal reference sequences...
Reference genome: ../hs37d5.fa
Sampling 11218 normal sequences to match deletion distribution...


INFO:data.ref_genome_data:Matched 2500/7852 samples
INFO:data.ref_genome_data:Matched 3000/7852 samples
INFO:data.ref_genome_data:Matched 3500/7852 samples
INFO:data.ref_genome_data:Matched 4000/7852 samples
INFO:data.ref_genome_data:Matched 4500/7852 samples
INFO:data.ref_genome_data:Matched 5000/7852 samples
INFO:data.ref_genome_data:Matched 5500/7852 samples
INFO:data.ref_genome_data:Matched 6000/7852 samples
INFO:data.ref_genome_data:Matched 6500/7852 samples
INFO:data.ref_genome_data:Matched 7000/7852 samples
INFO:data.ref_genome_data:Matched 7500/7852 samples
INFO:data.ref_genome_data:Sampled 7852 regions in 8192 attempts


Sampled 7852 normal reference sequences

Combined dataset: 19070 total samples
  ClinVar variants: 11218
  Reference benign: 7852


## 5 Training the Model

In [7]:
pathogenicity_predictor = DeletionPathogenicityPredictor(threshold = 0.6)

try:
    path_results = pathogenicity_predictor.train(
        all_training_variants,
        test_size=config.TEST_SIZE,
        cv_folds=config.CV_FOLDS,
    )
    
    # Cross-validation results
    print("\nCross-Validation Performance (Mean + Std):")
    print(f"  Precision:    {path_results.get('cv_precision', 0):.4f}")
    print(f"  Recall:       {path_results.get('cv_recall', 0):.4f}")
    print(f"  F1 Score:     {path_results.get('cv_f1', 0):.4f}")
    print(f"  Specificity:  {path_results.get('cv_specificity', 0):.4f}")
    print(f"  AUC-ROC:      {path_results.get('cv_auc', 0):.4f}")
    
    # Test set results
    print("\nTest Set Performance:")
    print(f"  Precision:    {path_results.get('test_precision', 0):.4f}")
    print(f"  Recall:       {path_results.get('test_recall', 0):.4f}")
    print(f"  F1 Score:     {path_results.get('test_f1', 0):.4f}")
    print(f"  AUC-ROC:      {path_results.get('test_auc', 0):.4f}")
    
    # Confusion matrix
    print("\nTest Set Confusion Matrix:")
    print(f"  True Positives:  {path_results.get('test_tp', 0)}")
    print(f"  True Negatives:  {path_results.get('test_tn', 0)}")
    print(f"  False Positives: {path_results.get('test_fp', 0)}")
    print(f"  False Negatives: {path_results.get('test_fn', 0)}")
    
    # Dataset statistics
    print("\nDataset Information:")
    print(f"  Total variants:     {len(processed_variants)}")
    print(f"  Training samples:   {path_results.get('n_train', 0)}")
    print(f"  Test samples:       {path_results.get('n_test', 0)}")
    print(f"  Number of features: {path_results.get('n_features', 0)}")
except ValueError as e:
    print("\n" + "="*70)
    print("ERROR: Model Training Failed")
    print("="*70)
    print(f"Error: {e}")
    logger.error(f"Training error: {e}", exc_info=True)
    import traceback
    traceback.print_exc()
except Exception as e:
    print("\n" + "="*70)
    print("ERROR: Unexpected Training Error")
    print("="*70)
    print(f"Error: {e}")
    logger.error(f"Unexpected training error: {e}", exc_info=True)
    import traceback
    traceback.print_exc()


    

INFO:training.model:=== Training Deletion Pathogenicity Predictor ===
INFO:training.model:Dataset: 9559 pathogenic, 9511 benign
INFO:training.model:Imbalance ratio: 1.01:1 (pathogenic:benign)
INFO:training.model:Computed class weights: benign=1.003, pathogenic=0.997
INFO:training.model:Train set: 7648 pathogenic, 7608 benign
INFO:training.model:Test set: 1911 pathogenic, 1903 benign
INFO:training.model:Building weighted ensemble (RF + GB + XGB)
INFO:training.model:XGBoost scale_pos_weight: 0.99
INFO:training.model:XGBoost included in ensemble
INFO:training.model:Fitting ensemble with sample weights...
INFO:training.model:Running 10-fold weighted cross-validation...
INFO:training.model:Evaluating on held-out test set...
INFO:training.model:Top 15 feature importances (Random Forest):
INFO:training.model:  has_gene: 0.2729
INFO:training.model:  gene_encoded: 0.0970
INFO:training.model:  homopolymer_run: 0.0961
INFO:training.model:  gene_length: 0.0893
INFO:training.model:  complexity_scor


Cross-Validation Performance (Mean + Std):
  Precision:    0.8981
  Recall:       0.9688
  F1 Score:     0.9321
  Specificity:  0.8895
  AUC-ROC:      0.9742

Test Set Performance:
  Precision:    0.8949
  Recall:       0.9628
  F1 Score:     0.9277
  AUC-ROC:      0.9751

Test Set Confusion Matrix:
  True Positives:  1840
  True Negatives:  1687
  False Positives: 216
  False Negatives: 71

Dataset Information:
  Total variants:     11218
  Training samples:   15256
  Test samples:       3814
  Number of features: 18


## Demonstration Using a Synthesized Pathogenic Deletion (Validation)

In [8]:

# See: https://www.ncbi.nlm.nih.gov/clinvar/variation/246362/
pathogenic_variant = {
  "uid": "CA10584575",
  "gene": "BRCA1",
  "title": "NM_007294.4(BRCA1):c.442-22_442-13del",
  "chr": "17",
  "start": "43099895",
  "end": "43099904",
  "assembly": "GRCh37",
  "variant_type": "Deletion",
  "consequence": "intronic variant, splice-altering"
}
# See: https://www.ncbi.nlm.nih.gov/clinvar/variation/1584690/
benign_variant = {
  "uid": "VCV001584690",
  "gene": "NXN",
  "title": "NM_022463.5(NXN):c.360+14del",
  "chr": "17",
  "start": "979305",
  "end": "979305",
  "assembly": "GRCh38",
  "variant_type": "Deletion",
  "consequence": "intron variant"
}
pathogenicity_predictor.predict_proba([pathogenic_variant, benign_variant])

array([0.92512812, 0.21463536])