# SCRIPT TO OBTAIN VARIANTS OF INTERES FROM WHOLE-GENOME SEQUENCING DATA

In order to run, there has to be several files in the project folder:
- GENCODE GTF: Run Scripts/WGS/01_get_gencode_annotation.sh. Obtain from: https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_46/gencode.v46.annotation.gtf.gz (Check for newer versions).

Once completed, a new Jupyter Notebook should be initialized so we can access this file. Or unmount and mount again the project (https://community.ukbiobank.ac.uk/hc/en-gb/community/posts/16019592366365-It-seems-that-the-recently-dx-uploaded-files-does-not-show-up-on-mnt-project-until-I-re-start-the-whole-Jupyter-Lab-VM)


- PVCF BLOCKS: Run Notebooks/WGS/DragenBlockProcessing.ipynb. Obtain from: https://biobank.ndph.ox.ac.uk/ukb/ukb/auxdata/dragen_pvcf_coordinates.zip 
It needs parsing, but in https://github.com/HauserGroup/gogoGPCR2/tree/main/data/misc it is already parsed.

This code needs to be run with Spark Version 2.3.1

The output of this code are: OPRM1_missense_variants.bed, OPRM1_missense_variants.bim, OPRM1_missense_variants.fam and OPRM1_missense_variants.annotations that contains missense variants of OPRM1 gene.

#### Initialization 
##### Load packages


Import to current directory:
- src/project_permed

In [None]:
import dxpy
import pyspark

import hail as hl
from pathlib import Path
from datetime import datetime

from matrixtables import smart_split_multi_mt

In [5]:
# Constants
DATABASE = "matrix_tables"
REFERENCE_GENOME = "GRCh38"
PROJ_NAME = "OPRM1"

Path("/tmp").resolve().mkdir(parents=True, exist_ok=True)

LOG_FILE = (
    Path("../hail_logs", f"{PROJ_NAME}_{datetime.now().strftime('%H%M')}.log")
    .resolve()
    .__str__()
)

#### Hail and spark configuration

In [6]:
# Spark init
sc = pyspark.SparkContext()
spark = pyspark.sql.SparkSession(sc)

# Create database in DNAX
spark.sql(f"CREATE DATABASE IF NOT EXISTS {DATABASE} LOCATION 'dnax://'")
mt_database = dxpy.find_one_data_object(name=DATABASE, classname="database")["id"]

# Hail init
hl.init(sc=sc, default_reference=REFERENCE_GENOME, log=LOG_FILE)

pip-installed Hail requires additional configuration options in Spark referring
  to the path to the Hail Python module directory HAIL_DIR,
  e.g. /path/to/python/site-packages/hail:
    spark.jars=HAIL_DIR/backend/hail-all-spark.jar
    spark.driver.extraClassPath=HAIL_DIR/backend/hail-all-spark.jar
    spark.executor.extraClassPath=./hail-all-spark.jarRunning on Apache Spark version 3.2.3
SparkUI available at http://ip-10-60-154-109.eu-west-2.compute.internal:8081
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.116-cd64e0876c94
LOGGING: writing to /opt/hail_logs/OPRM1_1224.log


#### Variables

In [None]:
# RAP
VCF_VERSION = "v1"
FIELD_ID = 24310  # DRAGEN population level WGS variants, pVCF format 500k release

# Paths
BULK_DIR = Path("/mnt/project/Bulk")

# Genes
GENES = ["OPRM1"]

### Quality control

#### Gene intervals and blocks 

In [8]:
# Get gene intervals
gene_interval = hl.experimental.get_gene_intervals(
    gene_symbols=GENES,
    reference_genome="GRCh38",
    gtf_file="file:///mnt/project/WGS_Lucia/WGS_QC/gencode.v46.annotation.gtf",
)
gene_interval

2025-02-13 12:24:33.634 Hail: INFO: Reading table without type imputation
  Loading field 'f0' as type str (not specified)
  Loading field 'f1' as type str (not specified)
  Loading field 'f2' as type str (not specified)
  Loading field 'f3' as type int32 (user-supplied)
  Loading field 'f4' as type int32 (user-supplied)
  Loading field 'f5' as type float64 (user-supplied)
  Loading field 'f6' as type str (not specified)
  Loading field 'f7' as type int32 (user-supplied)
  Loading field 'f8' as type str (not specified)
2025-02-13 12:24:50.608 Hail: INFO: wrote table with 3467156 rows in 12 partitions to /tmp/85ErUHzyFapqJxSSp0VrMo
2025-02-13 12:24:54.875 Hail: INFO: Ordering unsorted dataset with network shuffle
2025-02-13 12:24:59.773 Hail: INFO: get_gene_intervals found 1 entries:
gene: OPRM1 (ENSG00000112038)


[Interval(start=Locus(contig=chr6, position=154010496, reference_genome=GRCh38), end=Locus(contig=chr6, position=154246867, reference_genome=GRCh38), includes_start=True, includes_end=True)]

In [None]:
# Get DRAGEN pVCF blocks
blocks = hl.import_table(
    "file:///mnt/project/WGS_Lucia/WGS_QC/dragen_pvcf_blocks.tsv", no_header=False
)
blocks = blocks.annotate(
    Chromosome=blocks.Chromosome.replace("23", "X").replace("24", "Y")
)
blocks = blocks.annotate(region=hl.str("").join([hl.str("chr"), blocks.Chromosome]))
blocks = blocks.annotate(
    interval=hl.locus_interval(
        blocks.region,
        hl.int32(blocks.Starting_Position),
        hl.int32(blocks.Ending_Position),
        reference_genome="GRCh38",
    )
).key_by("interval")

2025-02-13 12:25:00.953 Hail: INFO: Reading table without type imputation
  Loading field 'Row_Number' as type str (not specified)
  Loading field 'Chromosome' as type str (not specified)
  Loading field 'Block' as type str (not specified)
  Loading field 'Starting_Position' as type str (not specified)
  Loading field 'Ending_Position' as type str (not specified)


In [10]:
# Get blocks for given genes
gb = blocks.filter(hl.any(lambda inter: blocks.interval.overlaps(inter), gene_interval))
gb.show()

Row_Number,Chromosome,Block,Starting_Position,Ending_Position,region,interval
str,str,str,str,str,str,interval<locus<GRCh38>>
"""60766""","""6""","""7700""","""153991433""","""154011429""","""chr6""",[chr6:153991433-chr6:154011429)
"""60767""","""6""","""7701""","""154011430""","""154031420""","""chr6""",[chr6:154011430-chr6:154031420)
"""60768""","""6""","""7702""","""154031421""","""154051414""","""chr6""",[chr6:154031421-chr6:154051414)
"""60769""","""6""","""7703""","""154051415""","""154071408""","""chr6""",[chr6:154051415-chr6:154071408)
"""60770""","""6""","""7704""","""154071409""","""154091406""","""chr6""",[chr6:154071409-chr6:154091406)
"""60771""","""6""","""7705""","""154091407""","""154111396""","""chr6""",[chr6:154091407-chr6:154111396)
"""60772""","""6""","""7706""","""154111397""","""154131395""","""chr6""",[chr6:154111397-chr6:154131395)
"""60773""","""6""","""7707""","""154131396""","""154151386""","""chr6""",[chr6:154131396-chr6:154151386)
"""60774""","""6""","""7708""","""154151387""","""154171378""","""chr6""",[chr6:154151387-chr6:154171378)
"""60775""","""6""","""7709""","""154171379""","""154191374""","""chr6""",[chr6:154171379-chr6:154191374)


#### Import vcf files of specific blocks

In [None]:
VCF_DIR = Path(
    "DRAGEN WGS/DRAGEN population level WGS variants, pVCF format 500k release"
)

vcf_files = [
    f"file://{BULK_DIR / VCF_DIR}/{chromosome}/ukb{FIELD_ID}_c{chromosome.replace('chr', '')}_b{block}_{VCF_VERSION}.vcf.gz"
    for block, chromosome in zip(gb.Block.collect(), gb.region.collect())
]

mt = hl.import_vcf(
    vcf_files,
    drop_samples=False,
    reference_genome="GRCh38",
    array_elements_required=False,
    force_bgz=True,
)

2025-02-13 12:25:04.869 Hail: INFO: Coerced sorted dataset
2025-02-13 12:25:07.149 Hail: INFO: Coerced sorted dataset


In [12]:
# Only genes of interest
mt = hl.filter_intervals(mt, gene_interval)

In [13]:
# Remove singletons (variants that appear only once across all samples)
mt = mt.filter_rows(hl.agg.sum(mt.GT.n_alt_alleles()) > 1)

In [14]:
# First checkpoint
stage = "FIRST"
checkpoint_file = f"/tmp/{PROJ_NAME}.{stage}.cp.mt"

mt = mt.checkpoint(checkpoint_file, overwrite=True)

2025-02-13 12:25:33.245 Hail: INFO: scanning VCF for sortedness...
2025-02-13 12:29:24.577 Hail: INFO: Coerced sorted VCF - no additional import work to do
2025-02-13 12:42:54.668 Hail: INFO: wrote matrix table with 46265 rows and 490541 columns in 678 partitions to /tmp/OPRM1.FIRST.cp.mt


#### Multi-allele filtering

In [15]:
# Remove variants with 6 or more alleles
mt = mt.filter_rows(mt.alleles.length() <= 6)

In [16]:
# Split multi-allele variants into single ones
mt = smart_split_multi_mt(mt)

In [17]:
# Second checkpoint
stage = "SECOND"
checkpoint_file = f"/tmp/{PROJ_NAME}.{stage}.cp.mt"

mt = mt.checkpoint(checkpoint_file, overwrite=True)

2025-02-13 12:53:52.035 Hail: INFO: wrote matrix table with 59311 rows and 490541 columns in 1356 partitions to /tmp/OPRM1.SECOND.cp.mt


#### Quality control filtering

In [18]:
mt = mt.filter_entries(mt.FT == "PASS")

# Then, filter variants where there is at least one non-missing genotype
mt = mt.filter_rows(hl.agg.any(hl.is_defined(mt.GT)))

In [None]:
# Compute statistics about the number and fraction of filtered entries.
mt = hl.MatrixTable.compute_entry_filter_stats(
    mt, row_field="entry_stats_row", col_field="entry_stats_col"
)

In [None]:
row_fraction_threshold = 0.95

# Filter variants where at least 95% of genotypes are unfiltered
mt = mt.filter_rows((1 - mt.entry_stats_row.fraction_filtered) > row_fraction_threshold)

In [None]:
col_fraction_threshold = 0.95

# Filter samples where at least 95% of variants are unfiltered
mt = mt.filter_cols((1 - mt.entry_stats_col.fraction_filtered) > col_fraction_threshold)

In [22]:
# third checkpoint
stage = "THIRD"
checkpoint_file = f"/tmp/{PROJ_NAME}.{stage}.cp.mt"

mt = mt.checkpoint(checkpoint_file, overwrite=True)

2025-02-13 13:05:37.898 Hail: INFO: wrote matrix table with 58863 rows and 490541 columns in 1356 partitions to /tmp/OPRM1.THIRD.cp.mt


#### Variant Effect Predictor (VEP)

In [23]:
VEP_JSON = Path("GRCh38_VEP.json").resolve()

In [None]:
mt = hl.vep(mt, f"file:{VEP_JSON}", block_size=100)

2025-02-13 13:10:55.778 Hail: INFO: wrote table with 58863 rows in 1356 partitions to /tmp/persist_TableIUGw2uRitt


In [None]:
is_MANE = mt.aggregate_rows(
    hl.agg.all(hl.is_defined(mt.vep.transcript_consequences.mane_select))
)
assert is_MANE, "Selected transcript may not be MANE Select. Check manually."

mt = mt.annotate_rows(
    protCons=mt.vep.transcript_consequences.amino_acids[0].split("/")[0]
    + hl.str(mt.vep.transcript_consequences.protein_end[0])
    + mt.vep.transcript_consequences.amino_acids[0].split("/")[-1],
    varid=hl.variant_str(mt.locus, mt.alleles),
)

In [None]:
# Seven checkpoint
stage = "FOURTH"
checkpoint_file = f"/tmp/{PROJ_NAME}.{stage}.cp.mt"

mt = mt.checkpoint(checkpoint_file, overwrite=True)
# mt = hl.read_matrix_table(checkpoint_file)

2025-02-13 13:23:13.592 Hail: INFO: wrote matrix table with 58863 rows and 490541 columns in 1356 partitions to /tmp/OPRM1.FOURTH.cp.mt


### Filtering

In [None]:
GENE = "OPRM1"
# gene=mt.vep.transcript_consequences.gene_symbol[0]
mt = mt.filter_rows(
    (mt.vep.transcript_consequences.gene_symbol[0] == GENE)
    & (mt.vep.most_severe_consequence == "missense_variant")
)

In [28]:
print(f"{mt.count_rows()} variants after quality filtering")

229 variants after quality filtering


In [None]:
# Seven checkpoint
stage = "FITH"
checkpoint_file = f"/tmp/{PROJ_NAME}.{stage}.cp.mt"

mt = mt.checkpoint(checkpoint_file, overwrite=True)
# mt = hl.read_matrix_table(checkpoint_file)

2025-02-13 13:31:18.884 Hail: INFO: wrote matrix table with 229 rows and 490541 columns in 1356 partitions to /tmp/OPRM1.FITH.cp.mt


#### Export 

In [None]:
# PLINK file
PLINK_FILE = "/tmp/OPRM1_missense_variants"


hl.export_plink(mt, varid=mt.varid, output="file:" + PLINK_FILE)

2025-02-13 13:31:25.657 Hail: INFO: merging 1357 files totalling 26.8M...
2025-02-13 13:31:26.273 Hail: INFO: while writing:
    file:/tmp/OPRM1_missense_variants.bed
  merge time: 615.845ms
2025-02-13 13:31:26.455 Hail: INFO: merging 1356 files totalling 9.4K...
2025-02-13 13:31:26.663 Hail: INFO: while writing:
    file:/tmp/OPRM1_missense_variants.bim
  merge time: 208.017ms


In [31]:
bed_file = PLINK_FILE + ".bed"
bim_file = PLINK_FILE + ".bim"
fam_file = PLINK_FILE + ".fam"

!dx upload $bed_file $bim_file $fam_file --path /WGS_Lucia/WGS_QC/Output/Drug_variant_matrix/

ID                                file-Gyfz8GQJb4JF6YG1yz0Ff63P
Class                             file
Project                           project-GfVK998Jb4JJgVBjKXPyxJ9q
Folder                            /WGS_Lucia/WGS_QC/Output
Name                              OPRM1_missense_variants.bed
State                             [33mclosing[0m
Visibility                        visible
Types                             -
Properties                        -
Tags                              -
Outgoing links                    -
Created                           Thu Feb 13 13:31:30 2025
Created by                        luciass6
 via the job                      job-Gyfx36QJb4J98f1V04bB0QV9
Last modified                     Thu Feb 13 13:31:31 2025
Media type                        
archivalState                     "live"
cloudAccount                      "cloudaccount-dnanexus"
ID                                file-Gyfz8J0Jb4J0q7Qx1j7y99K2
Class                             file
Project    

In [None]:
# ANNOTATIONS file
ANNOTATIONS_FILE = "/tmp/OPRM1_missense_variants.annotations"

annotations = (
    mt.select_rows(
        varid=mt.varid,
        gene=mt.vep.transcript_consequences.gene_symbol[0],
        annotation=hl.if_else(
            # Check if 'protCons' is missing, if so, use "most_severe_consequence"
            hl.is_missing(mt.protCons),
            mt.vep.most_severe_consequence,
            mt.protCons,
        ),
    )
    .rows()
    .key_by("varid")
    .drop("locus")
    .drop("alleles")
)

annotations.export("file:" + ANNOTATIONS_FILE, header=False)

2025-02-13 13:31:44.178 Hail: INFO: Coerced sorted dataset
2025-02-13 13:31:46.241 Hail: INFO: merging 9 files totalling 6.9K...
2025-02-13 13:31:46.264 Hail: INFO: while writing:
    file:/tmp/OPRM1_missense_variants.annotations
  merge time: 22.238ms


In [33]:
!dx upload $ANNOTATIONS_FILE --path /WGS_Lucia/WGS_QC/Output/Drug_variant_matrix/

ID                                file-Gyfz8V0Jb4JF6YG1yz0Ff63k
Class                             file
Project                           project-GfVK998Jb4JJgVBjKXPyxJ9q
Folder                            /WGS_Lucia/WGS_QC/Output
Name                              OPRM1_missense_variants.annotations
State                             [33mclosing[0m
Visibility                        visible
Types                             -
Properties                        -
Tags                              -
Outgoing links                    -
Created                           Thu Feb 13 13:31:48 2025
Created by                        luciass6
 via the job                      job-Gyfx36QJb4J98f1V04bB0QV9
Last modified                     Thu Feb 13 13:31:49 2025
Media type                        
archivalState                     "live"
cloudAccount                      "cloudaccount-dnanexus"
