The goal of this analysis is to get an intuition for how genetic variant rarity changes with (predicted) genic consequence. 

First, this notebook will compute count tables of variant rarity category by ensembl predicted consequence. 

- Load in all gnomad variants (with no malinouis predictions)
- Discard low-quality variants. 
    - Those that don't pass gnomad's own filters
    - Those at loci queried in a low number of people
    - those with a minor allele frequency of 0
- Compute allele-frequency category
    - Compute "rare", "ultra-rare", "common", "singleton", etc...
- Extract Ensembl VEP score categories into their own columns
- Tally the number of alleles falling into each allele frequency category 
- write table to disc. 

TODO
- Fix effect extraction
- Recompute based on consequence code
    - Assign only one consequence code to each : just the worst !
- Remove non-SNPs!

## Import relevant libraries

In [2]:
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
import pyspark.sql.types as T
import json

## create the spark session.

In [3]:
spark = SparkSession.builder \
    .appName("purifying_selection") \
    .getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/06/10 17:40:58 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


## Load variants

In [5]:
df = spark.read \
    .option("comment", "#") \
    .option("delimiter", "\t") \
    .option("header", "true") \
    .csv("/gpfs/gibbs/pi/reilly/VariantEffects/scripts/noon_data/2.3.add_transposons/*.csv.gz/*.csv.gz")

24/06/10 17:41:09 WARN GarbageCollectionMetrics: To enable non-built-in garbage collector(s) List(G1 Concurrent GC), users should configure it(them) to spark.eventLog.gcMetrics.youngGenerationGarbageCollectors or spark.eventLog.gcMetrics.oldGenerationGarbageCollectors
                                                                                

## Filter out non-SNP variants

In [6]:
df= df.filter(
     df.REF.isin("A", "T", "C", "G") & df.ALT.isin("A", "T", "C", "G")
)

# Filter out `MAF_OR_AC_IS_ZERO`

In [7]:
df=df.filter(F.col("category")!="MAF_OR_AC_IS_ZERO")

## Count occurances of each consequence code in each vep string.

First, get a list of consequences for each variant. This is a little involved, because of the many layers we have to trawl through:

![schema](./info_field.drawio.png)

In [8]:
#semicolon split
df=df.withColumn("info_split",F.split(df["INFO"],";"))
df=df.withColumn("vep_alone",F.expr("filter(info_split, x -> x LIKE 'vep=%')[0]"))

In [9]:
#comma split
df=df.withColumn("vep_split",F.split(df["vep_alone"],","))

#pipe split & grab first element. 
df = df.withColumn(
    "extracted_codes",
    F.transform(F.col("vep_split"), lambda x: F.split(x, "\\|")[1])
)

In [10]:
#break up anpersand-ligated conseqence codes
df=df.withColumn("consq_codes",F.expr("flatten(transform(extracted_codes,x->split(x,'&')))"))

In [24]:
#Some variants will naturally have no predicted consequences. We will use NONE

df=df.withColumn("consq_codes", F.when(F.col("consq_codes").isNull(), F.array(F.lit("NONE"))).otherwise(F.col("consq_codes")))

Next, compute the worst consequence code for each var.

I've retrieved consequences from [here](https://useast.ensembl.org/info/genome/variation/prediction/predicted_data.html) on 2024-06-10. 

In [26]:
#0 : HIGH
#1 : MODERATE
#2 : LOW
#3 : MODIFIER
#4 : NONE

consq_code_lut = {"transcript_ablation":0, 
                  "splice_acceptor_variant":0, 
                  "splice_donor_variant":0, 
                  "stop_gained":0, 
                  "frameshift_variant":0, 
                  "stop_lost":0, 
                  "start_lost":0, 
                  "transcript_amplification":0,
                  "feature_elongation":0,
                  "feature_truncation":0,
                  "inframe_insertion":1,
                  "inframe_deletion":1,
                  "missense_variant":1,
                  "protein_altering_variant":1,
                  "splice_donor_5th_base_variant":2,
                  "splice_region_variant":2,
                  "splice_donor_region_variant":2,
                  "splice_polypyrimidine_tract_variant":2,
                  "incomplete_terminal_codon_variant":2,
                  "start_retained_variant":2,
                  "stop_retained_variant":2,
                  "synonymous_variant":2,
                  "coding_sequence_variant":3,
                  "mature_miRNA_variant":3,
                  "5_prime_UTR_variant":3,
                  "3_prime_UTR_variant":3,
                  "non_coding_transcript_exon_variant":3,
                  "intron_variant":3,
                  "NMD_transcript_variant":3,
                  "non_coding_transcript_variant":3,
                  "coding_transcript_variant":3,
                  "upstream_gene_variant":3,
                  "downstream_gene_variant":3,
                  "TFBS_ablation":3,
                  "TFBS_amplification":3,
                  "TF_binding_site_variant":3,
                  "regulatory_region_ablation":3,
                  "regulatory_region_amplification":3,
                  "regulatory_region_variant":3,
                  "intergenic_variant":3,
                  "sequence_variant":3,
                  "NONE":4
                 }

In [27]:
lookup_broadcast = spark.sparkContext.broadcast(consq_code_lut)

In [29]:
def lookup_transform(inp):
    #turns a list of consequence codes into a list of severity ints
    lookup = lookup_broadcast.value
    return [lookup.get(item) for item in inp]

#register the UDF
lookup_udf = F.udf(lookup_transform, returnType=T.ArrayType(T.IntegerType()))

In [30]:
#Apply the lookup UDF to convert string consequence codes to severity ints
df=df.withColumn("consq_numeric",lookup_udf(df["consq_codes"]))

In [32]:
#get the worst severity score for each variant.
df=df.withColumn("min_consq_numeric", F.array_min(df["consq_numeric"]))

In [33]:
df=df.select("min_consq_numeric","category")

In [34]:
#count 
counts=df.groupBy("category","min_consq_numeric").agg(F.count("*").alias("count"))

Dump to disc

In [36]:
counts.coalesce(1).write.csv("counts.csv", mode="overwrite", header=True)

ERROR:root:KeyboardInterrupt while sending command.              (0 + 0) / 2036]
Traceback (most recent call last):
  File "/home/mcn26/.conda/envs/mcn_varef/lib/python3.10/site-packages/py4j/java_gateway.py", line 1038, in send_command
    response = connection.send_command(command)
  File "/home/mcn26/.conda/envs/mcn_varef/lib/python3.10/site-packages/py4j/clientserver.py", line 511, in send_command
    answer = smart_decode(self.stream.readline()[:-1])
  File "/home/mcn26/.conda/envs/mcn_varef/lib/python3.10/socket.py", line 705, in readinto
    return self._sock.recv_into(b)
KeyboardInterrupt


KeyboardInterrupt: 

[Stage 12:>            (90 + 10) / 2036][Stage 13:>              (0 + 0) / 2036]