# WGS QUALITY CONTROL & PREPARATION PIPELINE

### Overview
This notebook performs extraction, quality control, and formatting of UK Biobank Whole Genome Sequencing (WGS) data for specific genes of interest. It is optimized for the **DRAGEN 500k Release** (pVCF format).

### VEP is gone
VEP functionality has been removed from this notebook since it was brittle and heavy. Currently, .annotation and .setlist files have to be built manually until someone steps up and vibe codes an actual notebook based on the exported _variants.tsv

### Workflow Logic
1.  **Target Selection:** Loads pre-computed gene coordinates and identifies the specific VCF data blocks containing these genes.
2.  **Data Extraction:** Imports only the relevant VCF blocks and immediately filters to the gene interval to minimize memory usage.
3.  **Sample Quality Control:** Removes samples identified as "poor quality" or "related" from the upstream QC notebook (*01_QC_Samples*).
4.  **Variant Quality Control:** * Filters for `FT == PASS` (or missing, as per DRAGEN standards).
    * Removes variants with low call rates (<95%) or no alternate alleles.
5.  **Export:**
    * **Genotypes (.bgen):** Converts Phred-scaled likelihoods (PL) to Genotype Probabilities (GP) and exports to BGEN format for association testing (e.g., Regenie/SAIGE).
    * **Statistics (.tsv):** Exports a clean summary of passing variants (AF, AC, Call Rate) for record-keeping.

---

### Required Input Files
Before running, ensure the following files are present in the project. If you have just uploaded them, you may need to restart the Jupyter Lab VM to see them.

#### 1. Gene-VCF Overlaps File (`gene_vcf_overlaps.tsv`)
* **Description:** A mapping file that links Gene Symbols to their genomic coordinates and the specific UKB VCF block(s) that contain them.
* **Source:** Generated by mapping the GENCODE GTF against the UKB VCF block coordinates.
* **Columns:** `gene_name`, `chromosome`, `start`, `stop`, `overlapping_vcfs`.

#### 2. Samples to Remove (`samples_to_remove.tsv`)
* **Description:** A list of Sample IDs (EIDs) to exclude from analysis.
* **Source:** Output of `Notebooks/WGS/01_QC_Samples.ipynb`.
* **Content:** Includes withdrawn consent (handled by UKB/DNAx automatically, but good to verify), high-missingness samples, sex discordance, and related individuals (pruned to maximal independent set).

---

#### Initialization 
##### Load packages

In [1]:
import dxpy
import pyspark

import hail as hl
from pathlib import Path
from datetime import datetime

In [2]:
# Constants
DATABASE = "matrix_tables"
REFERENCE_GENOME = "GRCh38"
PROJ_NAME = "GIPR"

Path("/tmp").resolve().mkdir(parents=True, exist_ok=True)

LOG_FILE = (
    Path("../hail_logs", f"{PROJ_NAME}_{datetime.now().strftime('%H%M')}.log")
    .resolve()
    .__str__()
)

#### Hail and spark configuration

In [3]:
# Spark init
sc = pyspark.SparkContext()
spark = pyspark.sql.SparkSession(sc)

# Create database in DNAX
spark.sql(f"CREATE DATABASE IF NOT EXISTS {DATABASE} LOCATION 'dnax://'")
mt_database = dxpy.find_one_data_object(name=DATABASE)["id"]

# Hail init
hl.init(sc=sc, log=LOG_FILE)
hl.default_reference(REFERENCE_GENOME)

pip-installed Hail requires additional configuration options in Spark referring
  to the path to the Hail Python module directory HAIL_DIR,
  e.g. /path/to/python/site-packages/hail:
    spark.jars=HAIL_DIR/backend/hail-all-spark.jar
    spark.driver.extraClassPath=HAIL_DIR/backend/hail-all-spark.jar
    spark.executor.extraClassPath=./hail-all-spark.jarRunning on Apache Spark version 3.5.2
SparkUI available at http://ip-10-60-122-101.eu-west-2.compute.internal:8081
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.132-678e1f52b999
LOGGING: writing to /opt/hail_logs/GIPR_1917.log
2025-12-19 19:18:23.509 Hail: INFO: Reading table to impute column types
2025-12-19 19:18:24.694 Hail: INFO: Finished type imputation
  Loading field 'gene_id' as type str (imputed)
  Loading field 'gene_name' as type str (imputed)
  Loading field 'chromosome' as type str (imputed)
  Loading field 'start' as type int32 (imputed)
  Loading field 'stop' as 

#### Variables

In [None]:
# RAP
VCF_VERSION = "v1"
FIELD_ID = (
    24311  # ML-corrected DRAGEN population level WGS variants, pVCF format 500k release
)
VCF_DIR = Path(
    "DRAGEN WGS/ML-corrected DRAGEN population level WGS variants, pVCF format 500k release"
)
SAMPLES_REMOVE_PATH = "/mnt/project/TASR/Phenotypes/QC/samples_to_remove.tsv"


# Paths
BULK_DIR = Path("/mnt/project/Bulk")

# Genes
GENES = ["TAS1R2", "GIPR"]

# Downsample for testing
DOWNSAMPLE_FRACTION: float | None = None  # Set to 0.01 for 1% sample, 0.1 for 10%, etc.

### Load

#### Gene intervals and blocks
We use the lookup table rather than build block list here

In [None]:
# --- REPLACEMENT FOR GENE INTERVALS & BLOCKS ---
# Load the pre-computed overlap file
OVERLAP_TSV = "/mnt/project/WGS_Javier/WGS_QC/gene_vcf_overlaps.tsv"  # Update path

overlaps_ht = hl.import_table(f"file://{OVERLAP_TSV}", impute=True)  # ✓

# 1. Filter for the genes of interest
overlaps_ht = overlaps_ht.filter(hl.literal(GENES).contains(overlaps_ht.gene_name))

# 2. Extract Gene Intervals
# We use hl.parse_locus_interval to handle "chr:start-stop" string format automatically
intervals_data = overlaps_ht.select("chromosome", "start", "stop").collect()

gene_intervals = [
    hl.parse_locus_interval(
        f"{row.chromosome}:{row.start}-{row.stop}", reference_genome=REFERENCE_GENOME
    )
    for row in intervals_data
]

print(f"Loaded coordinates for {len(gene_intervals)} genes: {GENES}")

# 3. Construct VCF Paths
# We iterate through the rows to get the correct chromosome folder for each file.
# Path format: {BULK_DIR}/{VCF_DIR}/{chromosome}/{filename}
vcf_data = overlaps_ht.select("chromosome", "overlapping_vcfs").collect()
vcf_paths = set()

# Ensure VCF_DIR is a Path object (if defined as string in constants)
VCF_DIR_PATH = BULK_DIR / Path(VCF_DIR)

for row in vcf_data:
    if not row.overlapping_vcfs:
        continue

    # Split "file1.vcf,file2.vcf" -> ["file1.vcf", "file2.vcf"]
    files = row.overlapping_vcfs.split(",")

    for f in files:
        f = f.strip()
        if not f:
            continue

        # Use the 'chromosome' column from the TSV (e.g., "chr19") for the folder
        full_path = f"file://{VCF_DIR_PATH}/{row.chromosome}/{f}"
        vcf_paths.add(full_path)

vcf_files = list(vcf_paths)
print(f"Identified {len(vcf_files)} VCF files containing the genes.")
print("Sample path:", vcf_files[0] if vcf_files else "None")
print(f"Identified {len(vcf_files)} VCF files containing the genes.")

Loaded coordinates for 2 genes: ['TAS1R2', 'GIPR']
Identified 4 VCF files containing the genes.
Sample path: file:///mnt/project/Bulk/DRAGEN WGS/ML-corrected DRAGEN population level WGS variants, pVCF format 500k release/chr1/ukb24311_c1_b941_v1.vcf.gz
Identified 4 VCF files containing the genes.
file:///mnt/project/Bulk/DRAGEN WGS/ML-corrected DRAGEN population level WGS variants, pVCF format 500k release/chr1/ukb24311_c1_b941_v1.vcf.gz


In [None]:
if Path(
    SAMPLES_REMOVE_PATH.replace("file://", "")
).exists() or SAMPLES_REMOVE_PATH.startswith("dnax://"):
    samples_to_remove = hl.import_table(f"file://{SAMPLES_REMOVE_PATH}", key="eid")
else:
    samples_to_remove = None
    print(
        f"WARNING: QC file not found at {SAMPLES_REMOVE_PATH}. Skipping sample removal."
    )

In [None]:
# --- IMPORT VCFS ---
mt = hl.import_vcf(
    vcf_files,
    drop_samples=False,
    reference_genome=REFERENCE_GENOME,
    array_elements_required=True,
    force_bgz=True,
)

# Avoid write when checkpoint
if samples_to_remove is not None:
    mt = mt.anti_join_cols(samples_to_remove)


# Also, watch out
if DOWNSAMPLE_FRACTION is not None:
    print(f"⚠️  DOWNSAMPLING to {DOWNSAMPLE_FRACTION*100}% of samples for testing")
    mt = mt.sample_cols(p=DOWNSAMPLE_FRACTION, seed=42)

# Filter MT to exact gene boundaries (removes flanking regions in the 20kb block)
mt = hl.filter_intervals(mt, gene_intervals)

# Godlike filtering
mt = mt.filter_rows(hl.len(mt.filters) == 0)

# 2. Filter Entries (Genotype Level)
# ML-corrected data is cleaner, but we still enforce PASS and basic depth/quality if available
# TODO: Find some actual documentation here
# Minimal load for the |

mt = mt.filter_entries(hl.is_missing(mt.FT) | (mt.FT == "PASS"))


mt.describe()

----------------------------------------
Global fields:
    None
----------------------------------------
Column fields:
    's': str
----------------------------------------
Row fields:
    'locus': locus<GRCh38>
    'alleles': array<str>
    'rsid': str
    'qual': float64
    'filters': set<str>
    'info': struct {
        AC: array<int32>, 
        AN: int32, 
        NS: int32, 
        NS_GT: int32, 
        NS_NOGT: int32, 
        NS_NODATA: int32, 
        IC: float64, 
        HWE: array<float64>, 
        ExcHet: array<float64>, 
        HWE_CHISQ: float64, 
        AF: array<float64>
    }
----------------------------------------
Entry fields:
    'GT': call
    'GQ': int32
    'LAD': array<int32>
    'FT': str
    'LPL': array<int32>
    'LAA': array<str>
    'LAF': array<float64>
    'QL': float64
----------------------------------------
Column key: ['s']
Row key: ['locus', 'alleles']
----------------------------------------


### Checkpoint MT
Checkpoint after coarse filtering

In [9]:
# First checkpoint

stage = "FIRST"
checkpoint_file = f"/tmp/{PROJ_NAME}.{stage}.cp.mt"

mt = mt.checkpoint(checkpoint_file, overwrite=True)
# mt = hl.read_matrix_table(checkpoint_file)

mt.describe()
print(f"Pre-QC count: {mt.count_rows()} variants, {mt.count_cols()} samples")

----------------------------------------
Global fields:
    None
----------------------------------------
Column fields:
    's': str
----------------------------------------
Row fields:
    'locus': locus<GRCh38>
    'alleles': array<str>
    'rsid': str
    'qual': float64
    'filters': set<str>
    'info': struct {
        AC: array<int32>, 
        AN: int32, 
        NS: int32, 
        NS_GT: int32, 
        NS_NOGT: int32, 
        NS_NODATA: int32, 
        IC: float64, 
        HWE: array<float64>, 
        ExcHet: array<float64>, 
        HWE_CHISQ: float64, 
        AF: array<float64>
    }
----------------------------------------
Entry fields:
    'GT': call
    'GQ': int32
    'LAD': array<int32>
    'FT': str
    'LPL': array<int32>
    'LAA': array<str>
    'LAF': array<float64>
    'QL': float64
----------------------------------------
Column key: ['s']
Row key: ['locus', 'alleles']
----------------------------------------
Pre-QC count: 15667 variants, 490541 samples


#### Quality control filtering
Remove samples from 01_QC_Samples

Filter to FT (empty? Documentation says PASS? Who knows?)

A lot of the entry fields are populated, so check if ever adding more filters. GQ and LPL are empty for example. 

GT is as usual the one we want. 

variant_qc is run last to save a little compute

In [None]:
# --- QC & FILTERING ---

# 4. Filter Samples (Columns)
# Remove samples with low call rate (< 95%)
mt = hl.sample_qc(mt)
mt = mt.filter_cols(mt.sample_qc.call_rate > 0.95)


# 3. Run Variant QC
mt = hl.variant_qc(mt)  # needs this for later too
mt = mt.filter_rows(
    (mt.variant_qc.n_non_ref > 0)  # Must have at least one Alt allele
    & (mt.variant_qc.call_rate > 0.95)  # High call rate
)

Running Sample QC...
Running Variant QC...


In [None]:
# Second checkpoint
# Hail still likes checkpointing
# TODO: consider re-instating but likely not needed here

# stage = "SECOND"
# checkpoint_file = f"/tmp/{PROJ_NAME}.{stage}.cp.mt"

# mt = mt.checkpoint(checkpoint_file, overwrite=True)
# print(f"Post-QC count: {mt.count_rows()} variants, {mt.count_cols()} samples")
# mt.describe()

Post-QC count: 13366 variants, 442704 samples


# Exports

Again, Regenie checks bgen GP but wants GT

Removed all VEP logic from here. 

In [36]:
# --- EXPORTS (Cleaned Up) ---

# Define Paths
LOCAL_TMP = "/tmp"  # Local container storage
HDFS_TMP = "file:///tmp"  # Explicitly pointing to local FS for Hail
DX_OUTPUT_DIR = "/TASR/Genotypes"
BASE_NAME = f"{PROJ_NAME}"

# 1. Prepare Annotations (GP & VarID)
# Fallback to Hard Calls (since we know PL/LPL are likely missing/empty)
GPs = hl.literal([[1.0, 0.0, 0.0], [0.0, 1.0, 0.0], [0.0, 0.0, 1.0]])
mt = mt.annotate_entries(GP=GPs[mt.GT.n_alt_alleles()])

mt = mt.annotate_rows(
    varid=hl.delimit(
        [mt.locus.contig, hl.str(mt.locus.position), mt.alleles[0], mt.alleles[1]], ":"
    )
)

# 2. Export BGEN (To Local /tmp via "file://")
# "parallel=None" ensures we get a single .bgen file, not a directory of shards.
# We use "file://" prefix to force Hail to write to the local node's /tmp,
# making it instantly accessible for dx upload without hadoop fs -get.
print(f"Exporting BGEN to {LOCAL_TMP}/{BASE_NAME}.bgen ... (takes a little while)")

hl.export_bgen(
    mt=mt,
    output=f"{HDFS_TMP}/{BASE_NAME}",  # file:///tmp/GIPR
    gp=mt.GP,
    varid=mt.varid,
    rsid=mt.varid,
    parallel=None,
)

# 3. Export Variant Stats TSV
rows_ht = mt.rows()

export_ht = rows_ht.select(
    chromosome=rows_ht.locus.contig,
    position=rows_ht.locus.position,
    ref=rows_ht.alleles[0],
    alt=rows_ht.alleles[1],
    rsid=rows_ht.rsid,
    call_rate=rows_ht.variant_qc.call_rate,
    AC=rows_ht.variant_qc.AC[1],
    AF=rows_ht.variant_qc.AF[1],
    AN=rows_ht.variant_qc.AN,
    N_HOM=rows_ht.variant_qc.homozygote_count[1],
    N_HET=rows_ht.variant_qc.n_het,
    N_NC=rows_ht.variant_qc.n_not_called,
)

tsv_name = f"{BASE_NAME}_variants.tsv"
print(f"Exporting Stats to {LOCAL_TMP}/{tsv_name} ...")
export_ht.export(f"{HDFS_TMP}/{tsv_name}")

Exporting BGEN to /tmp/GIPR.bgen ... (takes a little while)
Exporting Stats to /tmp/GIPR_variants.tsv ...


In [None]:
# We now upload directly from /tmp where Hail wrote the files
!dx upload {LOCAL_TMP}/{BASE_NAME}.bgen {LOCAL_TMP}/{BASE_NAME}.sample {LOCAL_TMP}/{tsv_name} --path {DX_OUTPUT_DIR}