We will generate a number of count tables, described in sections below.

The basic approach is to load the data, then create a bunch of boolean columns corresponding to the various **conditions** (including bins) we would like to count. Then we compute a count table, where each row is a different combination of these boolean values. 

This results in counts of many rows counting combinations of categories we do not care about. So we subsequently groupby+sum to create sub-count-tables counting combinations of criteria we think may be meaningful.

At various points, we pickle & dump to disc lists of threshold criteria, for the downstream graphing to use. (This allows changes in criteria to be quickly passed to the graphing scripts).

(the old approach was more ad-hoc, re-doing the counting for each meaningful set of criteria. This caused redundant computation & was less flexible to adding more sets.)

# setup

In [11]:
data_output_base="/home/mcn26/varef/scripts/noon_data/4.count/"

## import 

In [2]:
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession
from pyspark.sql import Row
from pyspark.sql import functions as F
import pyspark.sql.types as T
import pickle
import pandas as pd

## create a spark session

In [3]:
conf = SparkConf() \
    .setAppName("Count")\

# Create a SparkContext with the specified configurations
if 'spark' in locals() and spark!=None:
    spark.stop()

sc = SparkContext(conf=conf)

# Create a SparkSession from the SparkContext
spark = SparkSession(sc)

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/02/15 16:54:07 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
24/02/15 16:54:08 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


## Load in gnomad variants filtered in the last script

In [39]:
#loading in all autosomes
#Skipping sex chromosomes, see readme
df = spark.read \
    .option("comment", "#") \
    .option("delimiter", ",") \
    .csv("/home/mcn26/varef/scripts/noon_data/3.pleio_and_filter/chr*/*.csv.gz", header=True)

                                                                                

## cast columns to the appropriate types & Drop columns rows with null values. 

We could only drop those rows with null malinouis skew when computing malinouis-skew-based metrics, drop rows with no phyloP scores when computing phyloP-based metrics, etc etc. However, this would result in different sets of variants summarized by each graph, which could create biases : if, for example, PhyloP scores are annotated for a nonrandom set of variants. Therefore I will drop rows with null data in any relevant columns prior to subsequent analysis. 

In [40]:
int_columns=["POS","AC","AN","pleio"]
float_columns=["AF","K562__ref","HepG2__ref","SKNSH__ref","K562__alt","HepG2__alt","SKNSH__alt","K562__skew","HepG2__skew","SKNSH__skew","cadd_phred","P_ANNO","mean_ref","mean_skew","MAF"]
cre_bool_columns=[]
for column in df.columns:
    if column.startswith("is_in"):
        cre_bool_columns.append(column)

In [41]:
df = df.dropna()#subset=["CHROM","POS","cadd_phred","P_ANNO","mean_ref","mean_skew","category"]+cre_bool_columns

In [42]:

for column in int_columns:
    df = df.withColumn(column, F.col(column).cast(T.IntegerType()))

for column in float_columns:
    df = df.withColumn(column, F.col(column).cast(T.FloatType()))

for column in cre_bool_columns:
    df = df.withColumn(column, F.col(column).cast(T.BooleanType()))

    
df_cre=df

# Add conditions

Pleiotropy has already been added

## Phylop

In [43]:
df_cre=df_cre.withColumn("phylop_significant",F.col("P_ANNO")>=2.27)

## CADD

In [44]:
df_cre=df_cre.withColumn(
    "CADD>=10",F.col("cadd_phred")>=10
).withColumn(
    "CADD>=20",F.col("cadd_phred")>=20
).withColumn(
    "CADD>=30",F.col("cadd_phred")>=30
).withColumn(
    "CADD>=40",F.col("cadd_phred")>=40
).withColumn(
    "CADD>=50",F.col("cadd_phred")>=50
)

cadd_columns=["CADD>=10","CADD>=20","CADD>=30","CADD>=40","CADD>=50"]

In [45]:
with open("cadd_columns.pkl",'wb') as file:
    pickle.dump(cadd_columns,file)

## malinois

Add a mean column

In [46]:
df_cre=df_cre.withColumn("mean_alt", (F.col("K562__alt") + F.col("HepG2__alt") + F.col("SKNSH__alt")) / 3)

Some helper functions

In [47]:
def get_column_names(var):
    final_names=[]
    for sub in var:
        final_names.append(sub[0])
    return final_names

def dump_cutoff_names_to_disc(var,name):
    #so we don't have to hard-code the names in multiple files. 
    #It's ugly enough that we're hard-coding the thresholds
    with open(name+'.pkl', 'wb') as file:
        final_names=get_column_names(var)
        pickle.dump(final_names, file)

#Ugly code! Really ought to combine make_reference_cutoffs & make_skew_cutoffs into one function that takes a list of intervals
#then a second function that can make intervals based on start/stop/step
def make_reference_cutoffs(name):
    return [
        [f"{name}_(-Inf,-6)", (F.col(name) < -6)]
    ] + [
        [f"{name}_[{i},{i+1})", (F.col(name) >= i) & (F.col(name) < i+1)] for i in range(-6, 6)
    ] + [
        [f"{name}_[6,Inf)", (F.col(name) >= 6)]
    ]

def make_skew_cutoffs(name):
    start_int = -9   # corresponds to -4.5 (represented as -9 * 0.5)
    end_int = 9      # corresponds to 4.5 (represented as 9 * 0.5)
    step_int = 1     # Step of 0.5 (represented as 1 * 0.5)

    return [
        [f"{name}_(-Inf, -4.0)", (F.col(name) < -4.0)]
        if i == start_int
        else [f"{name}_(4.0, Inf)", (F.col(name) >= 4.0)]
        if i == end_int - step_int
        else [f"{name}_[{i * 0.5:.1f}, {(i + step_int) * 0.5:.1f})", (F.col(name) >= i * 0.5) & (F.col(name) < (i + step_int) * 0.5)]
        for i in range(start_int, end_int, step_int)
    ]

def apply_cutoffs(df,cutoffs):
    df_working=df
    for name,cutoff_condition in cutoffs:
        df_working=df_working.withColumn(name,cutoff_condition)
    return df_working

Create the thresholds

In [48]:
#list of lists of skew,ref column names we would like to use. 
cuts= [["mean_skew" , "mean_ref"],["K562__skew","K562__ref"],["HepG2__skew","HepG2__ref"],["SKNSH__skew","SKNSH__ref"]]

#create the actual cutoffs & add to the vector
cuts=[{"skew_name":i[0],'skew_cuts':make_skew_cutoffs(i[0]),'ref_name':i[1],'ref_cuts':make_reference_cutoffs(i[1])} for i in cuts]
#dump it all to disc
for i in cuts:
    dump_cutoff_names_to_disc(var=i["skew_cuts"],name=i["skew_name"]+".pkl")
    dump_cutoff_names_to_disc(var=i["ref_cuts"],name=i["ref_name"]+".pkl")

apply all cuts & save their names for later use.

In [49]:
all_cuts=[]

for i in cuts:
    df_cre=apply_cutoffs(df_cre,i["skew_cuts"])
    df_cre=apply_cutoffs(df_cre,i["ref_cuts"])
    
    #all_cuts=all_cuts+i["ref_cuts"]+i["skew_cuts"]
    all_cuts=all_cuts+[sublist[0] for sublist in i["skew_cuts"]]
    all_cuts=all_cuts+[sublist[0] for sublist in i["ref_cuts"]]

# perform actual count


Replace all commas and carats

In [50]:
cell_types=["K562","SKNSH","HepG2"]

to_group_by=cadd_columns+cre_bool_columns+["category","pleio","phylop_significant"]+["emVar_"+i for i in cell_types]+all_cuts
renamed_column_map = {col: col.replace(',', '^').replace('.','&') for col in to_group_by}
new_group=[col.replace(',', '^').replace('.','&') for col in to_group_by]

for old_name, new_name in renamed_column_map.items():
        df_cre = df_cre.withColumnRenamed(old_name, new_name)

In [51]:
count_table = df_cre.groupBy(new_group).count()

In [52]:
#note: this cell will take substantial time & resources to execute.

count_table.coalesce(1).write.csv(data_output_base+"count_all.csv", mode="overwrite", header=True)

                                                                                

Reload `count_table` from disc to avoid recomputation.

In [53]:
count_table = spark.read.csv(data_output_base+"count_all.csv/*.csv", header=True)

# Subset & write

In [54]:
def dump(name,spark_df):
    spark_df.coalesce(1).write.csv(data_output_base+name, mode="overwrite", header=True)

pleiotropy vs rarity vs genomic regions

In [55]:
dump("rarity_pleio",
     count_table.groupBy("category", "pleio", *cre_bool_columns)\
                    .agg(F.sum("count").alias("count"))
    )

                                                                                

## phylop vs rarity vs genomic regions

In [56]:
dump("phylop_count_table",
     count_table.groupBy("category", "phylop_significant", *cre_bool_columns)\
                    .agg(F.sum("count").alias("count"))
    )

                                                                                

## phylop vs pleiotropy vs genomic regions

In [57]:
dump("phylop_pleio",
     count_table.groupBy("pleio", "phylop_significant", *cre_bool_columns)\
                    .agg(F.sum("count").alias("count"))
    )

                                                                                

## cadd vs rarity vs genomic regions

In [58]:
dump("CADD_count_table",
     count_table.groupBy("category", *cadd_columns, *cre_bool_columns)\
                    .agg(F.sum("count").alias("count"))
    )

                                                                                

## cadd vs pleiotropy vs genomic regions

In [59]:
dump("CADD_pleio",
     count_table.groupBy("pleio", *cadd_columns, *cre_bool_columns)\
                    .agg(F.sum("count").alias("count"))
    )

                                                                                

## malinois skew vs malinois reference, (malinois both mean & per cell type) vs genomic regions vs rarity category

We'll do different files for different cell-types (+ mean).

In [60]:
#each item of `cuts` is a cell-type (plus mean)
mean_cut=None
mean_thresh=None

for i in cuts:
    celltype=i["skew_name"].split("_")[0]
    
    
        
    #rarity category & genomic regions
    to_group_by=["category"]+cre_bool_columns
    #add skew & ref coulmns for current 
    to_group_by=to_group_by+get_column_names(i["skew_cuts"])+get_column_names(i["ref_cuts"])
    
    #remove illegal characters 
    to_group_by=[item.replace(',', '^').replace('.','&') for item in to_group_by]
    
    #save mean for later use
    if celltype=="mean":
        mean_cut=i
        mean_thresh=to_group_by
    
    dump(f"malinois_{celltype}",
     count_table.groupBy(*to_group_by)\
                    .agg(F.sum("count").alias("count"))
    )
    

                                                                                

## malin skew vs malin reference (mean only) vs phylop vs genomic region

In [61]:
dump(f"malinois_vs_phylop",
 count_table.groupBy(*to_group_by,"phylop_significant")\
                .agg(F.sum("count").alias("count"))
)

dump(f"malinois_vs_cadd",
 count_table.groupBy(*to_group_by,*cadd_columns)\
                .agg(F.sum("count").alias("count"))
)

                                                                                