# General rules whichever the source of the interaction graph you want to use:

------------
## Summary
* [**Nature of the interaction graph**](#h1)
    + [Pairwise interaction list (as a dictionary)](#h11)
    + [SIF file](#h12)
        + [Automatic download from the DoRothEA database](#dorotheatuto)
* [**Gene name standardization**](#h2)
    1. [Download gene info from NCBI](#ncbidownload)
    2. [Standardize your interaction graph](#igstandardization)
    
## Required modules
* [dorothea.py](dorothea.py)
* [gene_name_standardization.py](gene_name_standardization.py)
------------

## Nature of the interaction graph (IG) that BoNesis can consider <a class="anchor" id="h1"></a>

### - IG saved as a list of pairwise interactions in python: <a class="anchor" id="h11"></a>

In [1]:
interaction_graph = [
("gene1","gene2",dict(sign=-1)),
("gene2","gene1",dict(sign=-1)),
("gene1","gene3",dict(sign=-1)),
("gene2","gene3",dict(sign=1)),
]

Example : `domain = bonesis.InfluenceGraph(interaction_graph)`

### - IG saved as a file under the [SIF format (Simple Interaction File)](http://manual.cytoscape.org/en/stable/Supported_Network_File_Formats.html#sif-format) <a class="anchor" id="h12"></a>

Example: `domain = bonesis.InfluenceGraph.from_sif(<path_SIF_file>)`

#### Such a file can be directly extracted from the database DoRothEA (given a confidence level on the edges), via its R package as follows:
1. **[R](https://www.r-project.org/) needs to be installed on the machine, in order to access DoRothEA via its R package [`dorothea`](http://bioconductor.org/packages/release/data/experiment/html/dorothea.html)** that you can directly install from python with the following code:


In [None]:
import rpy2.robjects as robjects
robjects.r('''
    if (!requireNamespace("BiocManager", quietly = TRUE))
        install.packages("BiocManager")
    BiocManager::install("dorothea")
''')

2. **Extract the interaction graph from DoRothEA using the function:** <a class="anchor" id="dorotheatuto"></a>  
`dorothea_extraction(<organism>, <confidence level of the edges>, <path to the output directory>)`  
Example: `dorothea_extraction(organism="mouse", confidence="ABC")`

 * *INPUT*
     + **organism**: string that can be human or mouse.
     + **confidence**: string that can be A, AB (default), ABC, ABCD, ABCDE.
     + **output directory**: the current one by default.
 * *OUTPUT* 
     + **SIF file** (in the directory given in argument) named under the format "dorothea_*confidence*\_*organism*_YYYYMMDD.sif"
         * with *confidence* the confidence levels given in argument
         * with *organism* the organism given in argument (human or mouse)

In [2]:
import dorothea

In [None]:
help(dorothea)

In [None]:
# Example :
dorothea.extraction(organism="mouse", confidence="ABC", directory_output="./data/")

## Data preprocessing before using BoNesis: gene name standardization <a class="anchor" id="h2"></a>
For clearing up confusion in order to match data from different sources (interaction graph vs observations), we advise standardization based on NCBI gene data, as follows:

**1. Download gene information from NCBI** <a class="anchor" id="ncbidownload"></a>

Depending on the organism you are interested in, download the corresponding gene info file there: https://ftp.ncbi.nlm.nih.gov/gene/DATA/GENE_INFO/Mammalia/.

You get a TSV file (Tab Separated Values), with notably the following columns:

|Column number|Description of data in the column|
|:---:|:---|
|2 | GeneID: an integer used as the unique identifier for a gene in NCBI|
|**3** | **NCBI Symbol**: the default symbol for the gene at NCBI|
|**5** | **Symbol Synonyms**: bar-delimited set of unofficial symbols for the gene|
|**11** | **Official Symbol** for this gene designated by the nomenclature authority if it exists (HGNC for human)| 
|9 | NCBI Named Description: the default full name for this gene at NCBI|
|12 | Full Name for this gene designated by the nomenclature authority if it exists (HGNC for human)|
|14 | Other full names & designations: pipe-delimited set of some alternate descriptions (‘-‘ indicates none is being reported)|

**2. Standardize your interaction graph:** <a class="anchor" id="igstandardization"></a>

Require the module [gene_name_standardization](gene_name_standardization.py).

+ If your interaction graph is **a list of pairwise interactions in python**, use the following function to *standardize* this list before importing it in BoNesis:
`interaction_list_standardization(<list of pairwise interactions>, <NCBI gene data TSV file>)`  
|  
*Example:*  
`standardized_interaction_graph = interaction_list_standardization(interaction_graph, "Mus_musculus.gene_info")`  


+ If your interaction graph is stored in **a SIF file**, you can choose to create a *standardized* file (and then import it in BoNesis):
`file_standardization(<input file>, <NCBI gene data TSV file>, <set of column(s) containing the genenames to standardize>, <field separator>)`  
|  
*Example:*  
`file_standardization("2022-10-04_dorotheaABC.sif", "Mus_musculus.gene_info", (0,2), "\t")`  
in order to get an output SIF file which is a *standardized* interaction graph (each gene named by its NCBI symbol), `(0,2)` being the columns containing the genes in a SIF file.
   * *INPUT*
       1. **path_input**: path to the input file in which the names must be standardized.
       2. **path_NCBIgenedata**: path to the NCBI gene info file.
       3. **columns_to_standardize** : the columns into the input file which contain the gene names we want to standardize. Columns must start at index 0.
       4. **sep**: the field separator into the input SIF file (the gene data file provided by NCBI is a TSV).
   * *OUTPUT*
       + copy of the input file, with genes in columns_to_standardize replaced by their reference names (capitalized NCBI symbol). The file is named like the input file with, at its end, the extension "_standardized".

In [3]:
import gene_name_standardization as gns

In [None]:
help(gns)

In [4]:
# Example:

interaction_graph = [
("AR","ALPG",dict(sign=-1)),
("UGT1A6","AHR",dict(sign=-1)),
("ZNF217","ACP3",dict(sign=1)),
]

gns.interaction_list_standardization(interaction_graph, "data/Mus_musculus.gene_info")

[('AR', 'ALPPL2', {'sign': -1}),
 ('UGT1A6A', 'AHR', {'sign': -1}),
 ('ZFP217', 'ACPP', {'sign': 1})]

In [None]:
# Example:

gns.file_standardization("data/dorothea_ABC_mouse_20230214.sif", "data/Mus_musculus.gene_info", (0,2), "\t")