# General rules whichever the type of data you want to use as observations:

## Nature of the input for BoNesis:

**BoNesis considers boolean observations, an observation being a set of associations component ↔ boolean value.**

Observations are specified by a Python dictionnary linking each observation identifier to a dictionary being the set of associations component ↔ boolean value for this observation:

In [1]:
import pandas as pd

observations = {
    "obs1": {"gene1": 1, "gene2": 0, "gene3": 0, "stimulus1": 0, "phenotype1": 0, "phenotype2": 0},
    "obs2": {"gene1": 1, "gene2": 0, "gene3": 
             1, "stimulus1": 1, "phenotype1": 0, "phenotype2": 0},
    "obs3": {"gene1": 1, "gene3": 1, "stimulus1": 0, "phenotype1": 1, "phenotype2": 0},
    "obs4": {"gene1": 0, "gene2": 1, "gene3": 1, "stimulus1": 0, "phenotype1": 0, "phenotype2": 1},    
}
pd.DataFrame.from_dict(observations, orient="index").fillna('')

Unnamed: 0,gene1,gene2,gene3,stimulus1,phenotype1,phenotype2
obs1,1,0.0,0,0,0,0
obs2,1,0.0,1,1,0,0
obs3,1,,1,0,1,0
obs4,0,1.0,1,0,0,1


You can **directly import observations saved as matrix in a file under the TSV format**:

In [2]:
observations = pd.read_csv("observations.tsv", sep="\t", index_col = [0]).to_dict(orient="index")

In [3]:
observations

{'obs1': {'gene1': 1,
  'gene2': 0.0,
  'gene3': 0,
  'stimulus1': 0,
  'phenotype1': 0,
  'phenotype2': 0},
 'obs2': {'gene1': 1,
  'gene2': 0.0,
  'gene3': 1,
  'stimulus1': 1,
  'phenotype1': 0,
  'phenotype2': 0},
 'obs3': {'gene1': 1,
  'gene2': nan,
  'gene3': 1,
  'stimulus1': 0,
  'phenotype1': 1,
  'phenotype2': 0},
 'obs4': {'gene1': 0,
  'gene2': 1.0,
  'gene3': 1,
  'stimulus1': 0,
  'phenotype1': 0,
  'phenotype2': 1}}

## Data preprocessing before using BoNesis: gene name standardization

For clearing up confusion in order to match data from different sources (interaction graph vs observations), we advise standardization based on [NCBI gene data](https://ncbiinsights.ncbi.nlm.nih.gov/2016/11/07/clearing-up-confusion-with-human-gene-symbols-names-using-ncbi-gene-data/), as follows:

**1. Download gene information from NCBI** <a class="anchor" id="ncbidownload"></a>

Depending on the organism you are interested in, download the corresponding gene info file there: https://ftp.ncbi.nlm.nih.gov/gene/DATA/GENE_INFO/Mammalia/.

You get a TSV file (Tab Separated Values), with notably the following columns:

|Column number|Description of data in the column|  
|:---:|:---|
|2 | GeneID: an integer used as the unique identifier for a gene in NCBI|  
|**3** | **NCBI Symbol**: the default symbol for the gene at NCBI|  
|**5** | **Symbol Synonyms**: bar-delimited set of unofficial symbols for the gene|  
|**11** | **Official Symbol** for this gene designated by the nomenclature authority if it exists (HGNC for human)|   
|9 | NCBI Named Description: the default full name for this gene at NCBI|  
|12 | Full Name for this gene designated by the nomenclature authority if it exists (HGNC for human)|  
|14 | Other full names & designations: pipe-delimited set of some alternate descriptions (‘-‘ indicates none is being reported)|

**2. Use the following function to *standardize* the dictionary containing the observations before importing it in BoNesis:**
`observations_standardization(*dict of observations*, *NCBI gene data TSV file*)`  

*Example:*  
`standardized_observations = observations_standardization(observations, "Mus_musculus.gene_info")`

In [4]:
import gene_name_standardization as gns

In [5]:
# Example:
import pandas as pd

observations = {
    "obs1": {"ALPG": 1, "AR": 0, "ACP3": 0, "ZNF217": 0, "AHR": 0, "UGT1A6": 0},
    "obs2": {"ALPG": 1, "AR": 0, "ACP3": 1, "ZNF217": 1, "AHR": 0, "UGT1A6": 0},
    "obs3": {"ALPG": 1, "AR": 1, "ZNF217": 0, "AHR": 1, "UGT1A6": 0},
    "obs4": {"ALPG": 0, "AR": 1, "ACP3": 1, "ZNF217": 0, "AHR": 0, "UGT1A6": 1},    
}
pd.DataFrame.from_dict(observations, orient="index").fillna('')

Unnamed: 0,ALPG,AR,ACP3,ZNF217,AHR,UGT1A6
obs1,1,0,0.0,0,0,0
obs2,1,0,1.0,1,0,0
obs3,1,1,,0,1,0
obs4,0,1,1.0,0,0,1


In [6]:
# Continuation of the example:

standardized_observations = gns.observations_standardization(observations, "Mus_musculus.gene_info")

pd.DataFrame.from_dict(standardized_observations, orient="index").fillna('')

Unnamed: 0,ALPPL2,AR,ACPP,ZFP217,AHR,UGT1A6A
obs1,1,0,0.0,0,0,0
obs2,1,0,1.0,1,0,0
obs3,1,1,,0,1,0
obs4,0,1,1.0,0,0,1
