In this notebook we want to check whether the mutations change transcription regulation and affect mRNA expression, using the Enformer model. 
Note: this notebook runs in a different environment that is compatible with Enformer. 

What do I need to do in this notebook? 

1. Create a df for the mutations with the following columns, required to run Enformer: 'mut_id', 'Chromosome', 'Variant_Type', 'TSS', 'Transcript stable ID','ref_allele', 'mut_allele', 'Start_Position', 'End_Position'

2. Do the same for random mutations that are in the range of the TSS (100 random mutations synonymous T->C, 100 non-synonymous mutations T->G. 

3. Save the results of the metrics and compare to see if our mutations received a significant score. 


## Imports

In [None]:
import os
import pathlib
from typing import Callable
import pandas as pd
import numpy as np

import enformer_utils as efut
from Alphabet_seq_enc import Alphabet_N_seq_enc


In [2]:
TSS = 87600884 # location of the TSS
mut1 = 87550285 #T1236C
mut2 = 87541302 #T2677G
mut3 = 87509329 #T3435C

#distance from the mutations and the TSS
dist1 = TSS - mut1 
dist2 = TSS - mut2
dist3 = TSS - mut3


In [3]:
dist1

50599

In [4]:
dist2

59582

In [5]:
dist3 #not in the Enformer range! to far from the TSS... 

91555

## Main

In [2]:
'''
Define variables relevant for all MDR1 mutations 
'''

Chromosome = "chr7" 
Variant_Type = "SNP"
TSS = 87550485 #Note: This is not really the TSS. However, we created a code that utilizes Enformer such that the mutation has to be relatively close 
#to the TSS. This is not the case for T2677G and T3435C so we create this false TSS site. Notice that this does not affect the output of the Enformer
#at all! only let's it run without issuing an error. 

#TSS = 87600884 The real location of the TSS. 

Transcript_stable_ID = "ENST00000622132"
Gene = "MDR1"

''' 
Define variables needed to run Enformer, regardless of the specific df 
'''

chromosome_path = pathlib.Path("../Data/Genomes/Human/human_hg38/Chromosome") #path to obtain the chromosome sequence for the enformer input
df_tracks=pd.read_pickle(efut.Enformer_target_info_file) #CAGE tracks


In [5]:
''' Create the df with our info about the mutations '''

original_df = pd.read_pickle("../Data/MDR1_3_muts_df.pickle")

#change the format of the df to fit the input of Enformer
original_df["Chromosome"] = original_df["Chromosome"].apply(lambda x: "chr"+str(x)) #instead of Chromosome = 7, Chromosome = chr7 
original_df["TSS"] = TSS
original_df = original_df.rename(columns = {'Transcript_ID': 'Transcript stable ID', 'Reference_Allele': 'ref_allele',
                                            'Tumor_Seq_Allele2': 'mut_allele'}) #rename several columns
cols_mut_id = ["Gene", "Chromosome", "Start_Position", "ref_allele", "mut_allele"] #different format of "mut_id"
original_df['mut_id'] = original_df.apply(lambda x: ":".join([str(x[col]) for col in cols_mut_id]), axis = 1)
    
print("The dataframe with the original mutations:")
display(original_df)

The dataframe with the original mutations:


Unnamed: 0,mut_id,gene_affected,Gene,Chromosome,Start_Position,End_Position,ref_allele,mut_allele,Variant_Type,Variant_Classification,Transcript stable ID,is_forward,cds_position,TSS
0,ABCB1:chr7:87509329:A:G,ENSG00000085563,ABCB1,chr7,87509329,87509329,A,G,SNP,Silent,ENST00000622132,False,3435,87550485
1,ABCB1:chr7:87550285:A:G,ENSG00000085563,ABCB1,chr7,87550285,87550285,A,G,SNP,Silent,ENST00000622132,False,1236,87550485
2,ABCB1:chr7:87531302:A:C,ENSG00000085563,ABCB1,chr7,87531302,87531302,A,C,SNP,Missense_Mutation,ENST00000622132,False,2677,87550485


In [3]:
'''
Load the Enformer model
'''

# load the model
enformer = efut.load_enformer()

# one-hot encoder
one_hot_encoder: Callable[[str], np.ndarray] = Alphabet_N_seq_enc(alphabet='ACGT', non_char='N', non_char_val=0.0).one_hot_enc_seq


2023-06-28 22:23:51.280241: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2023-06-28 22:23:51.280309: W tensorflow/stream_executor/cuda/cuda_driver.cc:326] failed call to cuInit: UNKNOWN ERROR (303)
2023-06-28 22:23:51.280374: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (compute-0-340.power): /proc/driver/nvidia/version does not exist
2023-06-28 22:23:51.281448: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [7]:
''' Run Enformer on the dataframe of original mutations '''

store_path = pathlib.Path("../Results/Enformer/Original_raw") #path to store the results

#Enformer output is very large. We save only a summation of the output using several different metrics.  
efut.run_enformer_on_dataframe(original_df, store_path, chromosome_path, one_hot_encoder, enformer)
#efut.run_enformer_on_dataframe(original_df, store_path, chromosome_path, one_hot_encoder, enformer, metric_funcs = [efut.metric_wrapper_function], df_tracks = df_tracks)


2023-06-28 13:46:36.873259: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2)
2023-06-28 13:46:36.989471: I tensorflow/core/platform/profile_utils/cpu_utils.cc:114] CPU Frequency: 2300000000 Hz


In [10]:
''' Create the dataframe for random synonymous T->C mutations (to compare to T1236C and T3435G) '''

#lets get all the possible synonymous T->C mutation in the MDR1 gene (created in "create_MDR1_randomization_dfs.ipynb")
df_syn_T_C = pd.read_pickle("../Data/random_mutations_for_pvals/synonymous_T2C.pickle")

#change the format of the df to fit the input of Enformer
df_syn_T_C["Chromosome"] = Chromosome
df_syn_T_C["Variant_Type"] = Variant_Type
df_syn_T_C["Transcript stable ID"] = Transcript_stable_ID
df_syn_T_C["ref_allele"] = df_syn_T_C["Changed_from"].apply(lambda x: efut.reverse_complement(x))
df_syn_T_C["mut_allele"] = df_syn_T_C["Changed_to"].apply(lambda x: efut.reverse_complement(x))
df_syn_T_C["Start_Position"] = df_syn_T_C["Chromosome_position_1_based"]
df_syn_T_C["End_Position"] = df_syn_T_C["Start_Position"]
df_syn_T_C["Gene"] = Gene
df_syn_T_C['mut_id'] = df_syn_T_C.apply(lambda x: ":".join([str(x[col]) for col in cols_mut_id]), axis = 1)
df_syn_T_C['TSS'] = TSS


#sample 100 mutations from the df and predict using Enformer
num_muts =  100
# ==================
df_syn_T_C_sampled = df_syn_T_C.sample(n=num_muts, axis='index')


In [None]:
''' Run Enformer on the dataframe of random T->C mutations '''

store_path = pathlib.Path("../Results/Enformer/synT2C_raw") #path to store the results

#Enformer output is very large. We save only a summation of the output using several different metrics.  
efut.run_enformer_on_dataframe(df_syn_T_C_sampled, store_path, chromosome_path, one_hot_encoder, enformer)


In [6]:
''' Create the dataframe for random non-synonymous T->G mutations (to compare to T2677G) '''

#lets get all the possible synonymous T->C mutation in the MDR1 gene (created in "create_MDR1_valid_mutations_dfs_new.ipynb")
df_nonsyn_T_G = pd.read_pickle("../Data/random_mutations_for_pvals/nonsynonymous_T2G.pickle")

#change the format of the df to fit the input of Enformer
df_nonsyn_T_G["Chromosome"] = Chromosome
df_nonsyn_T_G["Variant_Type"] = Variant_Type
df_nonsyn_T_G["Transcript stable ID"] = Transcript_stable_ID
df_nonsyn_T_G["ref_allele"] = df_nonsyn_T_G["Changed_from"].apply(lambda x: efut.reverse_complement(x))
df_nonsyn_T_G["mut_allele"] = df_nonsyn_T_G["Changed_to"].apply(lambda x: efut.reverse_complement(x))
df_nonsyn_T_G["Start_Position"] = df_nonsyn_T_G["Chromosome_position_1_based"]
df_nonsyn_T_G["End_Position"] = df_nonsyn_T_G["Start_Position"]
df_nonsyn_T_G["Gene"] = Gene
df_nonsyn_T_G['mut_id'] = df_nonsyn_T_G.apply(lambda x: ":".join([str(x[col]) for col in cols_mut_id]), axis = 1)
df_nonsyn_T_G['TSS'] = TSS

#sample 100 mutations from the df and predict using Enformer
num_muts =  100
# ==================
df_nonsyn_T_G_sampled = df_nonsyn_T_G.sample(n=num_muts, axis='index')


In [7]:
''' Run Enformer on the dataframe of random T->G mutations '''

store_path = pathlib.Path("../Results/Enformer/nonsynT2G_raw") #path to store the results

#Enformer output is very large. We save only a summation of the output using several different metrics.  
efut.run_enformer_on_dataframe(df_nonsyn_T_G_sampled, store_path, chromosome_path, one_hot_encoder, enformer)


2023-06-28 22:24:42.322785: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2)
2023-06-28 22:24:42.452976: I tensorflow/core/platform/profile_utils/cpu_utils.cc:114] CPU Frequency: 2300000000 Hz


Processed 50 rows ...


## Compare T1236C, T2677G and T3435C to random variants with similar characteristics

In [33]:
'''
T1236C
'''

'''original results'''

#get the enformer output
original = pd.read_pickle(f"../Results/Enformer/Original_raw/ABCB1_chr7_87550285_A_G__ENST00000622132_87550485.pkl")
tss_bin, _ = efut.pos2bin(original["predt_start"], original["TSS"])
df_tracks = pd.read_pickle("./enformer_targets_df.pkl")

#take the raw output and use several metrics to create a single score from it 
metrics_dict = efut.metric_wrapper_function_mdr1(original["Enformer"], tss_bin, df_tracks = df_tracks)
metrics = metrics_dict.keys()

#map the scores from a dictionary to a df for convinience
original_results_df = pd.DataFrame()
for metric in metrics:
    original_results_df.loc[0,metric] = metrics_dict[metric]
    
'''do the same for random (T->C synonmous mutations) variants'''

#get the enformer output
directory = f"../Results/Enformer/synT2C_raw/" #iterate over the results of the random variants
random_results_df = pd.DataFrame()
    
for i, filename in enumerate(os.listdir(directory)):
    f = os.path.join(directory, filename)
    res = pd.read_pickle(f)
    tss_bin, _ = efut.pos2bin(res["predt_start"], res["TSS"])
    #take the raw output and use several metrics to create a single score from it 
    metrics_dict = efut.metric_wrapper_function_mdr1(res["Enformer"], tss_bin, df_tracks = df_tracks)
    #map the scores from a dictionary to a df for convinience
    for metric in metrics:
        random_results_df.loc[i,metric] = metrics_dict[metric]
        
#compare - does T1236C get a higher score than the random variants for any of the metrics? 
for metric in metrics:
    pval = np.sum(original_results_df[metric].values > random_results_df[metric].values) / (len(random_results_df[metric].values))
    print(f"P-value of metric {metric} is {pval}")

        

P-value of metric bmean_tmean_smean is 0.51
P-value of metric bmean_tmeanCAGE_smean is 0.56
P-value of metric bmean_tmeanMDR1_smean is 0.41
P-value of metric bmean_tmeanCAGEMDR1_smean is 0.53


In [35]:
'''
T2677G
'''

'''original results'''

#get the enformer output
original = pd.read_pickle(f"../Results/Enformer/Original_raw/ABCB1_chr7_87531302_A_C__ENST00000622132_87550485.pkl")
tss_bin, _ = efut.pos2bin(original["predt_start"], original["TSS"])
df_tracks = pd.read_pickle("./enformer_targets_df.pkl")

#take the raw output and use several metrics to create a single score from it 
metrics_dict = efut.metric_wrapper_function_mdr1(original["Enformer"], tss_bin, df_tracks = df_tracks)
metrics = metrics_dict.keys()

#map the scores from a dictionary to a df for convinience
original_results_df = pd.DataFrame()
for metric in metrics:
    original_results_df.loc[0,metric] = metrics_dict[metric]
    
'''do the same for random (T->G non-synonmous mutations) variants'''

#get the enformer output
directory = f"../Results/Enformer/nonsynT2G_raw/" #iterate over the results of the random variants
random_results_df = pd.DataFrame()
    
for i, filename in enumerate(os.listdir(directory)):
    f = os.path.join(directory, filename)
    res = pd.read_pickle(f)
    tss_bin, _ = efut.pos2bin(res["predt_start"], res["TSS"])
    #take the raw output and use several metrics to create a single score from it 
    metrics_dict = efut.metric_wrapper_function_mdr1(res["Enformer"], tss_bin, df_tracks = df_tracks)
    #map the scores from a dictionary to a df for convinience
    for metric in metrics:
        random_results_df.loc[i,metric] = metrics_dict[metric]
        
#compare - does T1236C get a higher score than the random variants for any of the metrics? 
for metric in metrics:
    pval = np.sum(original_results_df[metric].values > random_results_df[metric].values) / (len(random_results_df[metric].values))
    print(f"P-value of metric {metric} is {pval}")

        

P-value of metric bmean_tmean_smean is 0.44
P-value of metric bmean_tmeanCAGE_smean is 0.4
P-value of metric bmean_tmeanMDR1_smean is 0.36
P-value of metric bmean_tmeanCAGEMDR1_smean is 0.39


In [36]:
'''
T3435C
'''

'''original results'''

#get the enformer output
original = pd.read_pickle(f"../Results/Enformer/Original_raw/ABCB1_chr7_87509329_A_G__ENST00000622132_87550485.pkl")
tss_bin, _ = efut.pos2bin(original["predt_start"], original["TSS"])
df_tracks = pd.read_pickle("./enformer_targets_df.pkl")

#take the raw output and use several metrics to create a single score from it 
metrics_dict = efut.metric_wrapper_function_mdr1(original["Enformer"], tss_bin, df_tracks = df_tracks)
metrics = metrics_dict.keys()

#map the scores from a dictionary to a df for convinience
original_results_df = pd.DataFrame()
for metric in metrics:
    original_results_df.loc[0,metric] = metrics_dict[metric]
    
'''do the same for random (T->C synonmous mutations) variants'''

#get the enformer output
directory = f"../Results/Enformer/synT2C_raw/" #iterate over the results of the random variants
random_results_df = pd.DataFrame()
    
for i, filename in enumerate(os.listdir(directory)):
    f = os.path.join(directory, filename)
    res = pd.read_pickle(f)
    tss_bin, _ = efut.pos2bin(res["predt_start"], res["TSS"])
    #take the raw output and use several metrics to create a single score from it 
    metrics_dict = efut.metric_wrapper_function_mdr1(res["Enformer"], tss_bin, df_tracks = df_tracks)
    #map the scores from a dictionary to a df for convinience
    for metric in metrics:
        random_results_df.loc[i,metric] = metrics_dict[metric]
        
#compare - does T1236C get a higher score than the random variants for any of the metrics? 
for metric in metrics:
    pval = np.sum(original_results_df[metric].values > random_results_df[metric].values) / (len(random_results_df[metric].values))
    print(f"P-value of metric {metric} is {pval}")

        

P-value of metric bmean_tmean_smean is 0.61
P-value of metric bmean_tmeanCAGE_smean is 0.59
P-value of metric bmean_tmeanMDR1_smean is 0.59
P-value of metric bmean_tmeanCAGEMDR1_smean is 0.56
