# Deltect: Genomic Deletion Pathogenicity Classifier
This notebook demonstrates how to interact with the tool 

## 1. Setup and Installation
First, install all the required dependencies

In [62]:
# Install dependencies using uv
!uv pip install -r ../requirements.txt

# or use pip
# !pip install -r ../requirements.txt

[2mUsing Python 3.13.5 environment at: /home/brandon/Deltect/.venv[0m
[2mAudited [1m26 packages[0m [2min 3ms[0m[0m


In [63]:
# verify that the installation worked:

import sys
print(f"python version: {sys.version}")

from pathlib import Path

parent_dir = Path.cwd().parent

# Add to Python path if not already there
if str(parent_dir) not in sys.path:
    sys.path.insert(0, str(parent_dir))

print(f"Added to path: {parent_dir}")

# check required packages
import pysam
import sklearn
import pandas as pd   
import numpy as np 

print(f"pysam: {pysam.__version__}")
print(f"scikit-learn: {sklearn.__version__}")
print(f"pandas: {pd.__version__}")
print(f"numpy: {np.__version__}")

print("All dependencies installed!")

python version: 3.13.5 | packaged by Anaconda, Inc. | (main, Jun 12 2025, 16:09:02) [GCC 11.2.0]
Added to path: /home/brandon/Deltect
pysam: 0.23.3
scikit-learn: 1.7.2
pandas: 2.3.3
numpy: 2.3.4
All dependencies installed!


## 2. Import modules

In [64]:
from data.api import fetch_clinvar_deletions_entrez
from data.data_processor import pass_through_variants
from data.preprocessing import summarize_variants
from extraction.deletion_extraction import DeletionExtractor
from training.model import DeletionPathogenicityPredictor

import json
import logging
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)

sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

## Required Files

### 1. Reference Genome (Required)
- **File**: `hs37d5.fa` + `hs37d5.fa.fai`
- **Size**: ~3GB
- **Source**: 1000 Genomes Project
- **Purpose**: Reference genome for alignment and sequence extraction

### 2. Gene Annotation (Required)
- **File**: `gencode.v19.annotation.gtf`
- **Size**: ~1GB uncompressed
- **Source**: GENCODE
- **Purpose**: Gene boundaries and consequence prediction

### 3. GIAB Benchmark VCF (Required for validation)
- **File**: `HG002_GRCh37_1_22_v4.2.1_benchmark.vcf.gz` + `.tbi`
- **Size**: ~200MB
- **Source**: NIST GIAB
- **Purpose**: Validation truth set

Run the following command to download the genomic files used for training and prediction.
```bash
./download_references.sh
```

In [65]:
# Consider the preset hyperparameters

import config  

print(f"Chromosomes: {config.CHROMOSOMES}")
print(f"Max Variants Extracted: {config.MAX_VARIANTS_PER_CHR}")
print(f"Test Size: {config.TEST_SIZE * 100}%")
print(f"CV Folds: {config.CV_FOLDS}")
print(f"Reference Fasta: {config.REFERENCE_FASTA}")

Chromosomes: ['3']
Max Variants Extracted: 10000
Test Size: 20.0%
CV Folds: 5
Reference Fasta: hs37d5.fa


## 3. Manual Pipeline Implementation (For Demo)

In [66]:
from data.api import ClinVarClient

client = ClinVarClient("../.env") # instantiate a clinvar client

variants = []

for chr in config.CHROMOSOMES:
    variants.extend(client.fetch_deletion_variants(17, config.MAX_VARIANTS_PER_CHR))

print(f"There are {len(variants)} variants")


2025-11-27 12:21:13,211 - data.api - INFO - Initialized ClinVarClient with email: tangbrandonk@gmail.com
2025-11-27 12:21:14,087 - data.api - INFO - Found 5000 variant IDs, fetching in batches of 200...
2025-11-27 12:21:32,518 - data.api - INFO - Fetched 5000 pathogenic variants
2025-11-27 12:21:32,856 - data.api - INFO - Found 1551 variant IDs, fetching in batches of 200...
2025-11-27 12:21:38,787 - data.api - INFO - Fetched 1551 non-pathogenic variants
2025-11-27 12:21:38,788 - data.api - INFO - Total variants fetched: 6551 (pathogenic: 5000, non-pathogenic: 1551)


There are 6551 variants


In [67]:
import json

# What the data looks like:
print(json.dumps(variants[0], indent=2))

{
  "obj_type": "Deletion",
  "accession": "VCV004530604",
  "accession_version": "VCV004530604.1",
  "title": "NM_000342.4(SLC4A1):c.111del (p.His37fs)",
  "variation_set": [
    {
      "measure_id": "4642013",
      "variation_xrefs": [],
      "variation_name": "NM_000342.4(SLC4A1):c.111del (p.His37fs)",
      "cdna_change": "c.111del",
      "aliases": [],
      "variation_loc": [
        {
          "status": "current",
          "assembly_name": "GRCh38",
          "chr": "17",
          "band": "",
          "start": "44261632",
          "stop": "44261632",
          "inner_start": "",
          "inner_stop": "",
          "outer_start": "",
          "outer_stop": "",
          "display_start": "44261632",
          "display_stop": "44261632",
          "assembly_acc_ver": "GCF_000001405.38",
          "annotation_release": "",
          "alt": "",
          "ref": ""
        },
        {
          "status": "previous",
          "assembly_name": "GRCh37",
          "chr": "1

## 4 Filter the variants

In [68]:
processed_variants = pass_through_variants(variants)

print(json.dumps(processed_variants[0], indent=2))

{
  "uid": "VCV004530604",
  "gene": "SLC4A1",
  "title": "NM_000342.4(SLC4A1):c.111del (p.His37fs)",
  "chr": "17",
  "start": "42339000",
  "end": "42339000",
  "assembly": "GRCh37",
  "variant_type": "Deletion",
  "clinical_significance": "Pathogenic",
  "review_status": "criteria provided, single submitter",
  "condition": "Hereditary spherocytosis type 4",
  "consequence": "frameshift variant"
}


## 4.1 We can analyze the distribution of our data

In [69]:
summarize_variants(processed_variants)

  - Pathogenic: 3816
  - Likely pathogenic: 1008
  - Pathogenic/Likely pathogenic: 176
  - Likely benign: 927
  - Benign: 555
  - Benign/Likely benign: 69
