This notebook will:

- Filter out the gorp
    - Remove low-quality variants that don't pass GNOMAD's own filters.
    - Remove low-quality variants not queried in a large number of individuals
    - Remove variants with a MAF of 0 (they don't really "vary") in the population if they dont exist. ,
    - Remove those variants that don't have all of the relevant metrics.
        - For efficiency's sake, this latter task will be done throughout, denoted with $\dagger$

- $\dagger$ remove variants with no PhyloP scores

- Munge "INFO" field strings into...
    - extract malinouis predictions columns
    - extract ensembl VEP scores
- Summarize
    - malinouis predictions : "mean skew"
    - $\dagger$ remove variants with no "mean skew" malinouis predictions
    - Summarize Ensembl VEP : 
- Compute allele-frequency category
    - Compute "rare", "ultra-rare", "common", "singleton", etc...
- Dump table to disc
    

Import all the stuff we will need & set up the spark session. 

In [2]:
### 
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession
from pyspark.sql import Row


from pyspark.sql import functions as F

from pyspark.sql.types import StructType, StructField, IntegerType, StringType, FloatType


conf = SparkConf() \
    .setAppName("Filter")\

# Create a SparkContext with the specified configurations
if 'spark' in locals() and spark!=None:
    spark.stop()

sc = SparkContext(conf=conf)

# Create a SparkSession from the SparkContext
spark = SparkSession(sc)

Load in annotated gnomad variants. 

In [3]:
schema = StructType([
    StructField("CHROM", StringType(), True),
    StructField("POS", StringType(), True),
    StructField("ID", StringType(), True),
    StructField("REF", StringType(), True),
    StructField("ALT", StringType(), True),
    StructField("QUAL", StringType(), True),
    StructField("FILTER", StringType(), True),
    StructField("INFO", StringType(), True),
    StructField("P_ANNO", StringType(), True),
])

df = spark.read \
    .option("comment", "#") \
    .option("delimiter", ",") \
    .schema(schema) \
    .csv("/gpfs/gibbs/pi/reilly/VariantEffects/scripts/noon_scripts/1.annotate/annotated.csv/*.csv", header=True)

AnalysisException: [PATH_NOT_FOUND] Path does not exist: file:/gpfs/gibbs/pi/reilly/VariantEffects/scripts/noon_scripts/1.annotate/annotated.csv/*.csv.

Extract relevant columns from the INFO field

In [None]:
####The `INFO` field contains a lot of useful information, but it is all smashed together into a string. 
#Let's extract information from that string. 

keys_to_extract = [#NONE CAN BE SUBSTRINGS OF THE OTHERS
    "K562__ref", "HepG2__ref", "SKNSH__ref", "K562__alt", "HepG2__alt", "SKNSH__alt",
    "K562__skew", "HepG2__skew", "SKNSH__skew", "AC", "AN", "AF", "cadd_phred", "vep",# "P_ANNO" already in its own column
]

# Apply the regexp_extract function to the DataFrame to create new columns for each key.
# The expression '([^;]*)' captures any sequence of characters that are not a semicolon,
# which is assumed to be the delimiter for the key-value pairs in the 'INFO' column.

for key in keys_to_extract:

    #df = df.withColumn(key, regexp_extract(col("INFO"), "{}=([^;]+);?".format(key), 1))
    #when we find something put it, whne we don't put None
    df = df.withColumn(key, 
                       when(
                           F.regexp_extract(F.col("INFO"), "{}=([^;]+);?".format(key), 1) != "",
                           F.regexp_extract(F.col("INFO"), "{}=([^;]+);?".format(key), 1)).otherwise(None))

In [None]:
df = df.filter(
    #make sure we have the necessary population stats
    (F.col("AF").isNotNull()) &
    (F.col("AC").isNotNull()) &
    (F.col("AN").isNotNull()) &

    #check variant has been queried in a reasonably large number of people
    #approx 1/3 of pop size queried in this release of gnomad
    #a little less conservative than gnomad's own warning threshold
    #which is triggered when a vartiant is queried in < 1/2 population
    (F.col("AN").cast("int") > 25385) &
    
    #gnomad filters passed. See original gnomad vcf header for spec.
    (F.col("FILTER") == "PASS") & 
    
    
    #(col("CHROM") == "chr22") &
)

Compute mean malinouis reference activity and mean malinouis skew

In [None]:
#reference activity

df = df.withColumn("K562__ref", F.col("K562__ref").cast("float"))
df = df.withColumn("HepG2__ref", F.col("HepG2__ref").cast("float"))
df = df.withColumn("SKNSH__ref", F.col("SKNSH__ref").cast("float"))

df=df.withColumn("mean_ref", (abs(F.col("K562__ref")) + abs(F.col("HepG2__ref")) + abs(F.col("SKNSH__ref"))) / 3)

#skew
df = df.withColumn("K562__skew", F.col("K562__skew").cast("float"))
df = df.withColumn("HepG2__skew", F.col("HepG2__skew").cast("float"))
df = df.withColumn("SKNSH__skew", F.col("SKNSH__skew").cast("float"))

df=df.withColumn("mean_skew", (abs(F.col("K562__skew")) + abs(F.col("HepG2__skew")) + abs(F.col("SKNSH__skew"))) / 3)

Extract VEP information.

A single variant may have multiple effects. Thus, the VEP column has multiple entries. These are separated by commas. Within each comma-deliniated entry are multiple fields deliniated by bars (|). The second field of each entry contains the data we are interested in : the "calculated variant consequence" : basically a prediction of what the variant is likely to do. I've retrieved consequences from [here](https://useast.ensembl.org/info/genome/variation/prediction/predicted_data.html) on 2023-12-24. 
If something is in the `protein_coding` list, it's probably important, but irrelevant to the present study, which focuses on noncoding elements. If a variant has any `protein_coding` predictions, I will discard it. Otherwise, I simply count the number of occurances of all codes. 

In [None]:
all_codes=['transcript_ablation','splice_acceptor_variant','splice_donor_variant','stop_gained','frameshift_variant','stop_lost','start_lost','transcript_amplification','feature_elongation','feature_truncation','inframe_insertion','inframe_deletion','missense_variant','splice_donor_5th_base_variant','splice_region_variant','splice_donor_region_variant','splice_polypyrimidine_tract_variant','incomplete_terminal_codon_variant','start_retained_variant','synonymous_variant','coding_sequence_variant','mature_miRNA_variant','5_prime_UTR_variant','3_prime_UTR_variant','non_coding_transcript_exon_variant','intron_variant','NMD_transcript_variant','non_coding_transcript_variant','coding_transcript_variant','upstream_gene_variant','downstream_gene_variant','TFBS_ablation','TFBS_amplification','TF_binding_site_variant','regulatory_region_ablation','regulatory_region_amplification','regulatory_region_variant','intergenic_variant','sequence_variant']
protein_coding=['protein_altering_variant']
#

In [None]:
# Drop the 'INFO' column if it's no longer needed.
df = df.drop("INFO")

Let's compute a summary metric of VEP

In [1]:
##write me...

In [None]:
#remove variants ensembl VEP predicts to modify protein coding genes. 