We want to create randomizations for p-values of many analyses. Here we will create several dfs containing the locations of the "valid" mutations that can be created. 
We will create: 

(1) A df with positions of all possible T->C synonymous mutations. (relevant to check against T1236C and T3435C) 
(2) A df with positions of all possible T->G non-synonymous mutations. (relevant to check against T2677G) 


## Imports

In [20]:
import pandas as pd
import numpy as np
from Utils_MDR1 import get_possib_syn_sub_per_positon, codons_syn_maps_dict, nucs_dict, mutate_cds_sequence,variant_info

## Main

Let's get information that is relevant to multiple dfs / all dfs, such as the nucleotide MDR1 sequence, the synonumous substitutiuon matrix of MDR1, the TCGA mutations of MDR1 and the conserved regions.  

In [2]:
''' Get MDR1 CDS sequence '''
gene = 'ENSG00000085563' #MDR1/ABCB1 gene
genes_dict = pd.read_pickle(f"../Data/cdna_{gene}.pickle.gz")
nt_CDS = genes_dict['data'][0]['homologies'][0]['source']['seq'][:-3] #removing stop codon -> msa was on aas and ttanslated back, so no info on stop codons

In [None]:
''' get a binary table of shape (4, gene_length) indicating the possible synonymous substitutions of each
position in the gene. Each row coresponds to a nucleotide in the dna alphabet, sorted alphabetically. 
For example, if we have "1" in position [2,300] it means that changing the nucleotide in position 300 to a 
"G" would result in a synonymous substitution. '''

possible_syn_replacements_for_gene = get_possib_syn_sub_per_positon(nt_CDS ,codons_syn_maps_dict)

In [7]:
''' Get the dictionary from cds relative position to chromosome relative position '''

mapping_dict = pd.read_pickle("../Data/cds_to_chrom_dict_with_protein_id.pickle")
gene_to_protein_dict = pd.read_pickle("../Data/gene_protein_dict.pickle")
#get the mapping of our specific gene
protein_id = gene_to_protein_dict[gene]
mapping_cur_gene = mapping_dict[gene,protein_id]


## (1) A df with positions of all possible T->C synonymous mutations

In [12]:
''' Get the pool of possible synonymous changes of T->C in MDR1''' 

changed_from = "T"
change_to = "C"

pos_T_nuc = [position for position, nucleotide in enumerate(nt_CDS) if nucleotide == changed_from] #positions of "T" nucleotide
pos_can_change_to_C = np.where(possible_syn_replacements_for_gene[nucs_dict[change_to],:] == 1)[0] #positions that can be *synonymously* changed to C
positions_pool = [pos for pos in pos_T_nuc if pos in pos_can_change_to_C] #the intersection is our pool to choose from


In [15]:
''' Create a df with the position (cds relative and chromosome relative, and with the subsequent mutated sequence '''

df_syn_T_C = pd.DataFrame()
df_syn_T_C["CDS_position_0_based"] = positions_pool
df_syn_T_C["Chromosome_position_1_based"] = df_syn_T_C["CDS_position_0_based"].apply(lambda x: mapping_cur_gene[x] + 1)
df_syn_T_C["Sequence"] = df_syn_T_C.apply(lambda x: mutate_cds_sequence(sequence = nt_CDS, position = x.CDS_position_0_based + 1, change_to = "C"), axis = 1)
df_syn_T_C["Changed_from"] = "T" #on the reverse strand!
df_syn_T_C["Changed_to"] = "C" #on the reverse strand!
print(f"There are {df_syn_T_C.shape[0]} possible synonymous T->C substitutions in the MDR1 gene")
display(df_syn_T_C.head())

There are 424 possible synonymous T->C substitutions in the MDR1 gene


Unnamed: 0,CDS_position_0_based,Chromosome_position_1_based,Sequence,Changed_from,Changed_to
0,5,87600179,ATGGACCTTGAAGGGGACCGCAATGGAGGAGCAAAGAAGAAGAACT...,T,C
1,8,87600176,ATGGATCTCGAAGGGGACCGCAATGGAGGAGCAAAGAAGAAGAACT...,T,C
2,23,87600161,ATGGATCTTGAAGGGGACCGCAACGGAGGAGCAAAGAAGAAGAACT...,T,C
3,47,87600137,ATGGATCTTGAAGGGGACCGCAATGGAGGAGCAAAGAAGAAGAACT...,T,C
4,50,87600134,ATGGATCTTGAAGGGGACCGCAATGGAGGAGCAAAGAAGAAGAACT...,T,C


In [21]:
to_remove = [variant_info[1]["cds_position"] - 1, variant_info[3]["cds_position"] - 1] #remove T1236C and T3435C from the list of random T->C variants
df_syn_T_C = df_syn_T_C[~df_syn_T_C["CDS_position_0_based"].isin(to_remove)]

In [12]:
'''Save to pickle'''
df_syn_T_C.to_pickle("../Data/random_mutations_for_pvals/synonymous_T2C.pickle")

## (2) A df with positions of all possible T->G non-synonymous mutations

In [22]:
''' Get the pool of possible non-synonymous changes of T->G in MDR1''' 

changed_from = "T"
change_to = "G"

pos_T_nuc = [position for position, nucleotide in enumerate(nt_CDS) if nucleotide == changed_from] #positions of "T" nucleotide
pos_can_change_to_G = np.where(possible_syn_replacements_for_gene[nucs_dict[change_to],:] == 0)[0] #positions that can be *non-synonymously* changed to G
positions_pool = [pos for pos in pos_T_nuc if pos in pos_can_change_to_G] #the intersection is our pool to choose from


In [23]:
''' Create a df with the position (cds relative and chromosome relative, and with the subsequent mutated sequence '''

df_nonsyn_T_G = pd.DataFrame()
df_nonsyn_T_G["CDS_position_0_based"] = positions_pool
df_nonsyn_T_G["Chromosome_position_1_based"] = df_nonsyn_T_G["CDS_position_0_based"].apply(lambda x: mapping_cur_gene[x] + 1)
df_nonsyn_T_G["Changed_from"] = "T" #on the reverse strand!
df_nonsyn_T_G["Changed_to"] = "G" #on the reverse strand!
df_nonsyn_T_G["Sequence"] = df_nonsyn_T_G.apply(lambda x: mutate_cds_sequence(sequence = nt_CDS, position = x.CDS_position_0_based + 1, change_to = "G"), axis = 1)
print(f"There are {df_nonsyn_T_G.shape[0]} possible non-synonymous T->G substitutions in the MDR1 gene")
display(df_nonsyn_T_G.head())

There are 860 possible non-synonymous T->G substitutions in the MDR1 gene


Unnamed: 0,CDS_position_0_based,Chromosome_position_1_based,Changed_from,Changed_to,Sequence
0,1,87600183,T,G,AGGGATCTTGAAGGGGACCGCAATGGAGGAGCAAAGAAGAAGAACT...
1,5,87600179,T,G,ATGGAGCTTGAAGGGGACCGCAATGGAGGAGCAAAGAAGAAGAACT...
2,7,87600177,T,G,ATGGATCGTGAAGGGGACCGCAATGGAGGAGCAAAGAAGAAGAACT...
3,23,87600161,T,G,ATGGATCTTGAAGGGGACCGCAAGGGAGGAGCAAAGAAGAAGAACT...
4,45,87600139,T,G,ATGGATCTTGAAGGGGACCGCAATGGAGGAGCAAAGAAGAAGAACG...


In [24]:
to_remove = [variant_info[2]["cds_position"] - 1] #remove T2677G from the list of random T->G variants
df_nonsyn_T_G = df_nonsyn_T_G[~df_nonsyn_T_G["CDS_position_0_based"].isin(to_remove)]

In [16]:
'''Save to pickle'''
df_nonsyn_T_G.to_pickle("../Data/random_mutations_for_pvals/nonsynonymous_T2G.pickle")