# General rules whichever the source of the interaction graph you want to use:

## Nature of the input for BoNesis:

### - BoNesis can consider an interaction graph saved as a list of pairwise interactions in python:

In [3]:
interaction_graph = [
("gene1","gene2",dict(sign=-1)),
("gene2","gene1",dict(sign=-1)),
("gene1","gene3",dict(sign=-1)),
("gene2","gene3",dict(sign=1)),
]

Example : `domain = bonesis.InfluenceGraph(interaction_graph)`

### - BoNesis can consider an interaction graph saved as a file under the [SIF format (Simple Interaction File)](http://manual.cytoscape.org/en/stable/Supported_Network_File_Formats.html#sif-format)

Example: `domain = bonesis.InfluenceGraph.from_sif(<path_SIF_file>)`

#### Such a file can be directly extracted from the database DoRothEA (given a confidence level on the edges), via its R package as follows:
1. **[R](https://www.r-project.org/) needs to be installed on the machine, in order to access DoRothEA via its R package [`dorothea`](http://bioconductor.org/packages/release/data/experiment/html/dorothea.html)** that you can directly install from python with the following code:


In [None]:
import rpy2.robjects as robjects
robjects.r('''
    if (!requireNamespace("BiocManager", quietly = TRUE))
        install.packages("BiocManager")
    BiocManager::install("dorothea")
''')

2. **Extract the interaction graph from DoRothEA using the function:**  
`dorothea_extraction(*organism*, *confidence level of the edges*, *path to the output directory*)`  
Example: `dorothea_extraction(organism="mouse", confidence="ABC")`

 * *INPUT*
     + **organism**: string that can be human or mouse.
     + **confidence**: string that can be A, AB (default), ABC, ABCD, ABCDE.
     + **output directory**: the current one by default.
 * *OUTPUT* 
     + **SIF file** (in the directory given in argument) named under the format "YYYY-MM-DD_dorotheaX_organism.sif"
         * with X the confidence given in argument, 
         * organism being human or mouse.

In [2]:
import os
import datetime

def dorothea_extraction(organism: str="human", confidence: str="AB", directory_output: str="./"):
    ''' Store in a SIF file the subnetwork from DoRothEA database about mus musculus given the confidence on edges.
    INPUT
        organism: string that can be human or mouse
        confidence: string in the set A, AB (default), ABC, ABCD, ABCDE
        output directory: the current one by default
    OUTPUT
        SIF file in the directory_output, under the format "YYYY-MM-DD_dorotheaX.sif" with X the confidence given in argument
    '''
    
    assert organism == 'human' or organism == 'mouse', f"organism must be human or mouse"
    
    date = datetime.datetime.now()
    
    import rpy2.robjects as robjects
    import rpy2.robjects.packages as rpackages

    if confidence == "A":
        confidenceR = '"A"'
    elif confidence == "AB":
        confidenceR = '"A","B"'
    elif confidence == "ABC":
        confidenceR = '"A","B","C"'
    elif confidence == "ABCD":
        confidenceR = '"A","B","C","D"'
    elif confidence == "ABCDE":
        confidenceR = '"A","B","C","D","E"'
    else:
        raise InputError("Incorrect argument: confidence for edges can be A, AB, ABC, ABCD, ABCDE")

    #robjects.r('''
    #    if (!requireNamespace("BiocManager", quietly = TRUE))
    #        install.packages("BiocManager")
    #    BiocManager::install("dorothea")
    #''')
    
    robjects.r('''
        library(dorothea)
        subset_dth = dorothea_{0}[dorothea_{0}$confidence %in% c({1}), ]
        '''.format('hs' if organism=='human' else 'mm', confidenceR))
    robjects.r('''
        df = data.frame(source = subset_dth$tf,
                        sign = subset_dth$mor,
                        target = subset_dth$target)
        write.table(df, file="{0}{1}_dorothea{2}_{3}.sif", sep = "\t", col.names = FALSE, row.names = FALSE, quote = FALSE)
        '''.format(directory_output, date.strftime("%Y-%m-%d"), confidence, organism))

In [None]:
# Example :
dorothea_extraction(organism="mouse", confidence="ABC")

## Data preprocessing before using BoNesis: gene name standardization
For clearing up confusion in order to match data from different sources (interaction graph vs observations), we advise standardization based on NCBI gene data, as follows:
1. Download the file with gene data about the particular organism you are interested in: https://ftp.ncbi.nlm.nih.gov/gene/DATA/GENE_INFO/Mammalia/. You get a TSV file (Tab Separated Values), with notably the following columns:

|Column number|Description of data in the column|
|:---:|:---|
|2 | GeneID: an integer used as the unique identifier for a gene in NCBI|
|**3** | **NCBI Symbol**: the default symbol for the gene at NCBI|
|**5** | **Symbol Synonyms**: bar-delimited set of unofficial symbols for the gene|
|**11** | **Official Symbol** for this gene designated by the nomenclature authority if it exists (HGNC for human)| 
|9 | NCBI Named Description: the default full name for this gene at NCBI|
|12 | Full Name for this gene designated by the nomenclature authority if it exists (HGNC for human)|
|14 | Other full names & designations: pipe-delimited set of some alternate descriptions (‘-‘ indicates none is being reported)|

+ **If your interaction graph is a list of pairwise interactions in python, use the following function to *standardize* this list before importing it in BoNesis:**
`standardize_genename_in_list_of_interactions(*list of pairwise interactions*,*NCBI gene data TSV file*)`  
Example:  
`standardized_interaction_graph = standardize_genename_in_list_of_interactions(interaction_graph, "Mus_musculus.gene_info")`

+ **If your interaction graph is in the format SIF, use the following function to *standardize* the file before importing it in BoNesis:**
`standardize_genename_in_file(*input file*, *NCBI gene data TSV file*, *set of column(s) containing the genenames to standardize*, *field separator*)`  
Example:  
`standardize_genename_in_file("2022-10-04_dorotheaABC.sif", "Mus_musculus.gene_info", (0,2), "\t")`  
in order to get an output SIF file which is a *standardized* interaction graph (each gene named by its NCBI symbol), `(0,2)` being the columns containing the genes in a SIF file.
   * *INPUT*
       1. **path_input**: path to the input file in which the names must be standardized.
       2. **path_NCBIgenedata**: path to the NCBI gene data.
       3. **columns_to_standardize** : the columns into the input file which contain the gene names we want to standardize. Columns must start at index 0.
       4. **sep**: the field separator into the input SIF file (the gene data file provided by NCBI is a TSV).
   * *OUTPUT*
       + copy of the input file but with each gene into the columns_to_standardize named by its reference gene name (capitalized NCBI symbol). Named as the input file with at its end the extension "_reference-gene-names".

In [None]:
import os
from typing import List, Set, Dict, Tuple


def get_dict_synonyms(path_NCBIgenedata: str) -> Dict:
    """
    Create a dictionary matching each possible gene name to its NCBI symbol.
    
    Particularity:
    Creation of a temporary file for speeding up the task facing a large matrix from NCBI, the parsing of the NCBI gene data is run with awk. A temporary file is then created.
    
    INPUT
        path_NCBIgenedata: path to the NCBI gene data
    OUTPUT
        dictionary (key: gene name, value: reference gene name (being the NCBI symbol))
    """
    
    # Parse the downloaded NCBI gene data:
    path_NCBIgenedata_cut = f"{path_NCBIgenedata}_cut"
    command_parsing = "awk -F'\t' '{print $3 \"\t\" $5 \"\t\" $11}' " + path_NCBIgenedata + " | tr \| '\t' > " + path_NCBIgenedata_cut + " ; sed -i 1d " + path_NCBIgenedata_cut
    os.system(command_parsing)
    
    # Extract gene data information:    
    gene_synonyms_dict = dict()
    symbols = set()

    with open (path_NCBIgenedata_cut, "r") as file_synonyms:
        for gene in file_synonyms:
            gene = gene.strip().upper()
            gene_symbols_list = gene.split("\t")
            #extract reference gene symbol:
            ncbi_symbol = gene_symbols_list.pop(0)
            #delete non-informative synonyms:
            res = [syn for syn in gene_symbols_list if (syn != "-" and syn != ncbi_symbol)]

            #create the dictionnary matching each symbol to its reference gene symbol:
            gene_synonyms_dict[ncbi_symbol] = ncbi_symbol
            symbols.add(ncbi_symbol)

            for gene in res:
                if gene not in symbols:
                    # Warning with NCBI list of synonyms:
                    # A noun can be the synonym of several symbols.
                    # Arbitrary, the choosen one is the first.
                    gene_synonyms_dict[gene] = ncbi_symbol
                    
    os.system(f"rm {path_NCBIgenedata_cut}")
    return gene_synonyms_dict


def get_reference_gene_name(gene_name: str, dict_synonyms: dict) -> str:
    """
    Given a gene name, return its reference name.
    INPUT
        dict_synonyms
        gene_name: the gene name you want its reference name
    OUTPUT
        the synonym considered as the reference name
    """
    gene_name = gene_name.upper()
    if gene_name in dict_synonyms:
        return dict_synonyms[gene_name]
    return gene_name

In [None]:
def standardize_genename_in_list_of_interactions(interactions_list: List[Tuple[str, str, Dict]], path_NCBIgenedata: str):
    """
    Create a copy of the input list of pairwise interactions, with each gene name replaced by its reference name (NCBI symbol).
    
    Require the following functions:
        get_dict_synonyms
        get_reference_gene_name
        
    INPUT
        interactions_list: list of tuples containing string (source) + string (target) + dict (sign = 1 or -1)
        path_NCBIgenedata: path to the NCBI gene data
    OUTPUT
        list of tuples containing string (source) + string (target) + dict (sign = 1 or -1)
    """
    # Get gene data information:
    gene_synonyms_dict = get_dict_synonyms(path_NCBIgenedata)
    
    # Copy the interactions list by replacing each genename by its reference genename into it:
    standardized_interactions_list = list()
    for interaction in interactions_list:
        source = get_reference_gene_name(interaction[0], gene_synonyms_dict)
        target = get_reference_gene_name(interaction[1], gene_synonyms_dict)
        standardized_interactions_list.append((source, target, interaction[2])) 
    
    return standardized_interactions_list

In [None]:
# Example:

interaction_graph = [
("AR","ALPG",dict(sign=-1)),
("UGT1A6","AHR",dict(sign=-1)),
("ZNF217","ACP3",dict(sign=1)),
]

standardize_genename_in_list_of_interactions(interaction_graph, "gene_data")

In [None]:
from typing import List

def standardize_genename_in_file(path_input: str, path_NCBIgenedata: str, columns_to_standardize: List or Set[str] = [0], sep = "\t"):
    """
    Create a copy of the input file, with each gene name replaced by its reference (NCBI symbol) in the column precised in argument.
    
    Require the following functions:
        get_dict_synonyms
        get_reference_gene_name
       
    INPUT
        path_input: path to the input file in which the names must be standardized.
        path_NCBIgenedata: path to the NCBI gene data.
        columns_to_standardize : the columns containing gene names we want to standardize. Columns must start at index 0.
        sep: the field separator into the input SIF file (the gene data file provided by NCBI is a TSV).
    OUTPUT
        copy of the input file but with each gene into the columns_to_standardize named by its reference gene name (capitalized NCBI symbol). Named as the input file with at its end the extension "_reference-gene-names".
    """
    
    # Get gene data information:
    gene_synonyms_dict = get_dict_synonyms(path_NCBIgenedata)
    
    # Replace gene name with reference gene name into the columns_to_standardize of the input file:
    cols_check = set() #put all elements of columns_to_standardize in a set for complexity 
    for c in columns_to_standardize:
        cols_check.add(c)
    
    with open(path_input, "r") as inputfile:
        to_write = []
        for ligne in inputfile.read().split("\n"):
            if len(ligne) > 1:
                cols = ligne.split(sep)
                ligne_output = ""
                id_col = 0
                for col in cols[:-1]:
                    if id_col in cols_check:
                        ligne_output += get_reference_gene_name(col, gene_synonyms_dict) + sep
                    else:
                        ligne_output += col + sep
                    id_col += 1
                if id_col in cols_check:
                    ligne_output += get_reference_gene_name(cols[-1], gene_synonyms_dict)
                else:
                    ligne_output += cols[-1]
                ligne_output += "\n"
                to_write.append(ligne_output)
    
    with open(f"{path_input}_reference-gene-names", "x") as outputfile:
        for ligne in to_write:
            outputfile.write(ligne)

In [None]:
# Example:

standardize_genename_in_file("2022-10-04_dorotheaABC_mouse.sif", "Mus_musculus.gene_info", (0,2), "\t")