### About this notebook: 
<font color='grey'>Compare the similarity of gene mutations by estimating the ratio of identical muation sites between two data sets. <br/>In this example, we compare the gene mutations between patient derived somatic mutations from LUAD in the TCGA project and the cell line derived variants of LUAD in the GDSC project, and check the highly frequently altered somatic mutations from the patient derived samples, and similar genes in the cell line derived variants. 
</font>

In [1]:
import sys
import os
import pandas as pd
import json
import pandas as pd
sys.path.append('../scripts/')
import Docket_integration


### Define the input directory and the output directory: 
<font color='grey'> 
    "input_dir": directory of input data<br/>
    "output_dir":directory of output data<br/>
</font>

In [2]:
directories = {"input_dir":"../Data_input_for_BRCA",
               "output_dir":"../Output_BRCA"}

### About the input file and the output files: 
<font color='grey'> 
    "File1": the mutation file from the cell line data;<br/>
    "File2": the mutation file from the patient derived somatic mutations;<br/>
    "File_out": the output file for the list of genes which show high similarity between two files;<br/>
    "Disease_type": The tumor types of the mutation data
</font>

In [3]:
input_data = {
    "File1":"Mut_site_GDSC.csv",
    "File2":"Mut_site_TCGA.csv",
    "File_out": "Mut_similarity.csv",
    "Disease_type":"LUAD",
    "Sample_label":"sample_barcode_tumor",
    "Gene_label":"Hugo_Symbol"
}

In [4]:
output_dir = directories['output_dir']
if os.path.exists(output_dir) == False:
    os.mkdir(output_dir)

### Processing the data Step 1: 
<font color='grey'> 
    Get the highly frequently mutated genes in TCGA (File2)
</font>

In [5]:
mut_GDSC = pd.read_csv(directories['input_dir']+'/'+input_data['File1'])  #Read files
#Mut_TCGA = pd.read_csv(directories['input_dir']+'/'+input_data['File2'])

Mut_TCGA_matrix = Docket_integration.generate_mutation_matrix(Mut_TCGA,input_data['Sample_label'],input_data['Gene_label'],'') #Generate gene mutation matrix 
x = Mut_TCGA_matrix.sum()/Mut_TCGA_matrix.shape[0] #Calculate the mutation frequency for each gene
x_df = pd.DataFrame({'Gene':x.index.values,'Freq':x.values}) #Create a dataframe with genes and their mutation frequency
x_df = x_df.sort_values(by = 'Freq',ascending = False) #

FileNotFoundError: [Errno 2] File b'../Data_input_for_BRCA/Mut_site_GDSC.csv' does not exist: b'../Data_input_for_BRCA/Mut_site_GDSC.csv'

### Processing the data Step 2: 
<font color='grey'> 
    Compare the similarity of genes in their mutation sites between GDSC (File1) and  TCGA (File2)<br/>
    Three similarity scores are calculated:<br/>
    <font color='green'> 
    Sim1_TCGA_GDSC: len(A_and_B) / (len(A)+ len(B)) <br/>
    </font>
    It measures the percentage of shared mutation site bewteen two datasets<br/>
    <font color='green'> 
    Sim2_TCGA_GDSC: len(A_and_B) / len(A)  <br/>
    </font>
    It measures the ratio of shared mutation site in GDSC dataset<br/>
    <font color='green'> 
    Sim3_TCGA_GDSC: len(set(A_and_B)) / len(set(list(A) + list(B)))<br/>
    </font>
    It measures the jaccard coeffiency of unique varients between the GDSC dataset and TCGA dataset
</font>

In [None]:
def shared_gene(A,B):
    shared_list = []
    for i in A:
        if i in B:
            shared_list.append(i)
    return(shared_list)

In [None]:
Genelist = []
coef = []
coef2 = []
coef3 = []
for gene in list(x_df.loc[x_df['Freq']>0.03]['Gene'].values):
    A = mut_GDSC.loc[mut_GDSC['HGNC_gene_symbol'] == gene]['new_id'].values
    B = Mut_TCGA.loc[Mut_TCGA['Hugo_Symbol'] == gene]['new_id'].values
    A_and_B = shared_gene(A,B)
    if len(A) > 0:
        similarity1 = len(A_and_B) / (len(A)+ len(B))
        similarity2 = len(A_and_B) / len(A)
        similarity3 = len(set(A_and_B)) / len(set(list(A) + list(B)))

        Genelist.append(gene)
        coef.append(similarity1)
        coef2.append(similarity2)
        coef3.append(similarity3)
    

### Processing the data Step 3: 
<font color='grey'> 
    Select genes which show similarity of mutation sites between two datasets
</font>

In [None]:
result = pd.DataFrame({"Gene":Genelist,"Sim1_TCGA_GDSC":coef,'Sim2_TCGA_GDSC':coef2, "Sim3_TCGA_GDSC":coef3})
result = result.sort_values(by = ['Sim1_TCGA_GDSC'], ascending = False)
result = result.loc[result['Sim2_TCGA_GDSC'] > 0.1]
result = result.loc[result['Sim3_TCGA_GDSC'] > 0.05]
result = x_df.merge(result, left_on='Gene', right_on='Gene')
result['Disease'] = input_data['Disease_type'] 
result_sort = result.sort_values(by = ['Sim2_TCGA_GDSC'], ascending=False)

### Processing the data Step 3: 
<font color='grey'> 
    Output the knowledge graph table about the gene similarity in the level of mutation for the disease type
</font>

In [None]:
result_sort.to_csv(directories['output_dir']+'/'+input_data['File_out'])