The goal of this analysis is to get an intuition for how genetic variant rarity changes with (predicted) genic consequence. 

This notebook will compute count tables of variant rarity category by ensembl predicted consequence, plus phylop & roulette scores (sum and sum of squares). 

## Import relevant libraries

In [2]:
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
import pyspark.sql.types as T
import json

## create the spark session.

In [3]:
spark = SparkSession.builder \
    .appName("purifying_selection") \
    .getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/06/21 14:27:40 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


## Load variants

In [23]:
#real data

df = spark.read \
    .option("comment", "#") \
    .option("delimiter", "\t") \
    .option("header", "true") \
    .csv("/gpfs/gibbs/pi/reilly/VariantEffects/scripts/noon_data/2.3.add_transposons/*.csv.gz/*.csv.gz")

#toy data

#df=spark.read \
#    .option("delimiter","\t") \
#    .option("header","true") \
#    .csv("toy_with_missing_vep.tsv")
#OR
#    .csv("toy.tsv")



### Note that since we're tapping off variants at the 2.3 add transposon step, we're missing:
- 2.5 filter : so we don't filter out exonic variants (this is desireable)
- 3.0 pleio_and_filter : so we haven't dropped MAF_OR_AC_IS_ZERO (which is performed below)
- 3.5 add_tf : (no great loss)
- 3.6 remove non-snp (which we do below)

## Filter out non-SNP variants

In [24]:
df= df.filter(
     df.REF.isin("A", "T", "C", "G") & df.ALT.isin("A", "T", "C", "G")
)

# Filter out `MAF_OR_AC_IS_ZERO`

In [25]:
df=df.filter(F.col("category")!="MAF_OR_AC_IS_ZERO")

## Count occurances of each consequence code in each vep string.

First, get a list of consequences for each variant. This is a little involved, because of the many layers we have to trawl through:

![schema](./info_field.drawio.png)

In [26]:
#semicolon split
df=df.withColumn("info_split",F.split(df["INFO"],";"))
df=df.withColumn("vep_alone",F.expr("filter(info_split, x -> x LIKE 'vep=%')[0]"))

In [27]:
#comma split
df=df.withColumn("vep_split",F.split(df["vep_alone"],","))

#pipe split & grab first element. 
df = df.withColumn(
    "extracted_codes",
    F.transform(F.col("vep_split"), lambda x: F.split(x, "\\|")[1])
)

In [28]:
#break up anpersand-ligated conseqence codes
df=df.withColumn("consq_codes",F.expr("flatten(transform(extracted_codes,x->split(x,'&')))"))

In [29]:
#Some variants will naturally have no predicted consequences. We will use NONE

df=df.withColumn("consq_codes", F.when(F.col("consq_codes").isNull(), F.array(F.lit("NONE"))).otherwise(F.col("consq_codes")))

Next, compute the worst consequence code for each var.

I've retrieved consequences from [here](https://useast.ensembl.org/info/genome/variation/prediction/predicted_data.html) on 2024-06-10. 

In [30]:
#This order is taken from the website linked above, which states that 
#the codes are shown in order of severity (though it admits this is subjective)
#I've assigned numbers, where the smaller the more severe

consq_code_lut = {"transcript_ablation":0, 
                  "splice_acceptor_variant":1, 
                  "splice_donor_variant":2, 
                  "stop_gained":3, 
                  "frameshift_variant":4, 
                  "stop_lost":5, 
                  "start_lost":6, 
                  "transcript_amplification":7,
                  "feature_elongation":8,
                  "feature_truncation":9,
                  "inframe_insertion":10,
                  "inframe_deletion":11,
                  "missense_variant":12,
                  "protein_altering_variant":13,
                  "splice_donor_5th_base_variant":14,
                  "splice_region_variant":15,
                  "splice_donor_region_variant":16,
                  "splice_polypyrimidine_tract_variant":17,
                  "incomplete_terminal_codon_variant":18,
                  "start_retained_variant":19,
                  "stop_retained_variant":20,
                  "synonymous_variant":21,
                  "coding_sequence_variant":22,
                  "mature_miRNA_variant":23,
                  "5_prime_UTR_variant":24,
                  "3_prime_UTR_variant":25,
                  "non_coding_transcript_exon_variant":26,
                  "intron_variant":27,
                  "NMD_transcript_variant":28,
                  "non_coding_transcript_variant":29,
                  "coding_transcript_variant":30,
                  "upstream_gene_variant":31,
                  "downstream_gene_variant":32,
                  "TFBS_ablation":33,
                  "TFBS_amplification":34,
                  "TF_binding_site_variant":35,
                  "regulatory_region_ablation":36,
                  "regulatory_region_amplification":37,
                  "regulatory_region_variant":38,
                  "intergenic_variant":39,
                  "sequence_variant":40,
                  "NONE":41
                 }

In [31]:
lookup_broadcast = spark.sparkContext.broadcast(consq_code_lut)

reverse_consequence_code_lut= {value: key for key, value in consq_code_lut.items()}

#lookup_broadcast_reverse = spark.sparkContext.broadcast(reverse_consequence_code_lut)

In [32]:
def lookup_transform(inp):
    #turns a list of consequence codes into a list of severity ints
    lookup=lookup_broadcast.value
    return [lookup.get(item) for item in inp]

def lookup_transform_reverse(inp):
    #turns a SINGLE severity int into a consequence code
    return reverse_consequence_code_lut.get(inp,"ERR")

#register the UDFs
lookup_transform_udf = F.udf(lookup_transform, returnType=T.ArrayType(T.IntegerType()))

lookup_transform_reverse_udf = F.udf(lookup_transform_reverse, returnType=T.StringType())

In [33]:
#Apply the lookup UDF to convert string consequence codes to severity ints
df=df.withColumn("consq_numeric",lookup_transform_udf(df["consq_codes"]))

In [34]:
#get the worst severity score for each variant.
df=df.withColumn("min_consq_numeric", F.array_min(df["consq_numeric"]))

In [35]:
#convert minimum consequence code back to string
df=df.withColumn("worst_consq_string",
              lookup_transform_reverse_udf(df["min_consq_numeric"])
             )

In [17]:
#manual verification
#import pandas as pd
#with pd.option_context('display.max_rows', None, 'display.max_columns', None):
#    display(df.limit(3).toPandas())
#df.limit(3).toPandas()["consq_codes"].to_list()

In [36]:
#count 
counts=df.groupBy("category","worst_consq_string").agg(
    
    F.sum("P_ANNO").alias("sum_phylop"),
    F.sum(F.col("P_ANNO") * F.col("P_ANNO")).alias("sum_of_squared_phylop"),
    
    F.sum("roulette_MR").alias("sum_roulette_MR"),
    F.sum(F.col("roulette_MR") * F.col("roulette_MR")).alias("sum_of_squared_roulette_MR"),
    
    
    F.count("*").alias("count")  # Count of elements in each group
)

Dump to disc

In [37]:
counts.coalesce(1).write.csv("counts.csv", mode="overwrite", header=True)
#counts.coalesce(1).write.csv("counts_toy.csv", mode="overwrite", header=True)
#counts.coalesce(1).write.csv("counts_toy_with_missing.csv", mode="overwrite", header=True)