# General rules whichever the type of data you want to use as observations:

## Nature of the input for BoNesis:

**BoNesis considers boolean observations, an observation being a set of associations component ↔ boolean value.**

Observations are specified by a Python dictionnary linking each observation identifier to a dictionary being the set of associations component ↔ boolean value for this observation:

In [1]:
import pandas as pd
observations = {
    "obs1": {"gene1": 1, "gene2": 0, "gene3": 0, "stimulus1": 0, "phenotype1": 0, "phenotype2": 0},
    "obs2": {"gene1": 1, "gene2": 0, "gene3": 1, "stimulus1": 1, "phenotype1": 0, "phenotype2": 0},
    "obs3": {"gene1": 1, "gene3": 1, "stimulus1": 0, "phenotype1": 1, "phenotype2": 0},
    "obs4": {"gene1": 0, "gene2": 1, "gene3": 1, "stimulus1": 0, "phenotype1": 0, "phenotype2": 1},    
}
pd.DataFrame.from_dict(observations, orient="index").fillna('')

Unnamed: 0,gene1,gene2,gene3,stimulus1,phenotype1,phenotype2
obs1,1,0.0,0,0,0,0
obs2,1,0.0,1,1,0,0
obs3,1,,1,0,1,0
obs4,0,1.0,1,0,0,1


You can **directly import observations saved as matrix in a file under the TSV format**, thanks to the parsing function  
`get_observations_from_file(<path matrix tsv>, <field separator>)`

In [2]:
from typing import Dict
def get_observations_from_file(path_data: str, sep="\t") -> Dict:
    """
    file format input:
     - 1st line: observed components names
     - 1st column: observations identifiers
    """
    observations_dict = dict()
    with open(path_data, 'r') as observations_file:
        for i, line in enumerate(observations_file):
            if i == 0:
                observed_components = line.replace('"','').strip("\n").split(sep)
            elif len(line) < 2:
                continue
            else:
                observation = line.replace('"','').strip().split(sep)
                observations_dict[observation[0]] = dict()
                for j, component in enumerate(observed_components):
                    if j == 0:
                        continue #First column contains the observation identifier.
                    if "NA" not in observation[j]:
                        observations_dict[observation[0]][component] = int(observation[j])
    return observations_dict

In [3]:
# Example:

observations_from_file = get_observations_from_file("observations.tsv")
pd.DataFrame.from_dict(observations_from_file, orient="index").fillna('')

Unnamed: 0,gene1,gene2,gene3,stimulus1,phenotype1,phenotype2
obs1,1,0.0,0,0,0,0
obs2,1,0.0,1,1,0,0
obs3,1,,1,0,1,0
obs4,0,1.0,1,0,0,1


## Data preprocessing before using BoNesis: gene name standardization

For clearing up confusion in order to match data from different sources (interaction graph vs observations), we advise standardization based on [NCBI gene data](https://ncbiinsights.ncbi.nlm.nih.gov/2016/11/07/clearing-up-confusion-with-human-gene-symbols-names-using-ncbi-gene-data/), as follows:

1. **Download the file with gene data about the particular organism you are interested in:** https://ftp.ncbi.nlm.nih.gov/gene/DATA/GENE_INFO/Mammalia/. You get a TSV file (Tab Separated Values), with notably the following columns:

|Column number|Description of data in the column|  
|:---:|:---|
|2 | GeneID: an integer used as the unique identifier for a gene in NCBI|  
|**3** | **NCBI Symbol**: the default symbol for the gene at NCBI|  
|**5** | **Symbol Synonyms**: bar-delimited set of unofficial symbols for the gene|  
|**11** | **Official Symbol** for this gene designated by the nomenclature authority if it exists (HGNC for human)|   
|9 | NCBI Named Description: the default full name for this gene at NCBI|  
|12 | Full Name for this gene designated by the nomenclature authority if it exists (HGNC for human)|  
|14 | Other full names & designations: pipe-delimited set of some alternate descriptions (‘-‘ indicates none is being reported)|

2. **Use the following function to *standardize* the dictionary containing the observations before importing it in BoNesis:**
`standardize_genename_in_dict_of_observations(<dict of observations>,<NCBI gene data TSV file>)`  
Example:  
`standardized_observations = standardize_genename_in_dict_of_observations(observations, "Mus_musculus.gene_info")`

In [4]:
import os
from typing import List, Set, Dict, Tuple


def get_dict_matching_synonyms_to_refgenename(path_NCBIgenedata: str) -> Dict:
    """
    Create a dictionary matching each possible gene name to its NCBI symbol.
    
    Particularity:
    Creation of a temporary file for speeding up the task facing a large matrix from NCBI, the parsing of the NCBI gene data is run with awk. A temporary file is then created.
    
    INPUT
        path_NCBIgenedata: path to the NCBI gene data
    OUTPUT
        dictionary (key: gene name, value: reference gene name (being the NCBI symbol))
    """
    
    # Parse the downloaded NCBI gene data:
    path_NCBIgenedata_cut = f"{path_NCBIgenedata}_cut"
    command_parsing = "awk -F'\t' '{print $3 \"\t\" $5 \"\t\" $11}' " + path_NCBIgenedata + " | tr \| '\t' > " + path_NCBIgenedata_cut + " ; sed -i 1d " + path_NCBIgenedata_cut
    os.system(command_parsing)
    
    # Extract gene data information:    
    gene_synonyms_dict = dict()
    symbols = set()

    with open (path_NCBIgenedata_cut, "r") as file_synonyms:
        for gene in file_synonyms:
            gene = gene.strip().upper()
            gene_symbols_list = gene.split("\t")
            #extract reference gene symbol:
            ncbi_symbol = gene_symbols_list.pop(0)
            #delete non-informative synonyms:
            res = [syn for syn in gene_symbols_list if (syn != "-" and syn != ncbi_symbol)]

            #create the dictionnary matching each symbol to its reference gene symbol (NCBI symbol):
            gene_synonyms_dict[ncbi_symbol] = ncbi_symbol
            symbols.add(ncbi_symbol)

            for gene in res:
                if gene not in symbols:
                    # Warning with NCBI list of synonyms:
                    # A noun can be the synonym of several symbols.
                    # Arbitrary, the choosen one is the first.
                    gene_synonyms_dict[gene] = ncbi_symbol
                    
    os.system(f"rm {path_NCBIgenedata_cut}")
    return gene_synonyms_dict


def get_reference_gene_name(gene_name: str, dict_synonyms: dict) -> str:
    """
    Given a gene name, return its reference name.
    INPUT
        dict_synonyms
        gene_name: the gene name you want its reference name
    OUTPUT
        the synonym considered as the reference name
    """
    gene_name = gene_name.upper()
    if gene_name in dict_synonyms:
        return dict_synonyms[gene_name]
    return gene_name

In [5]:
def standardize_genename_in_dict_of_observations(observations_dict: Dict, path_NCBIgenedata: str) -> Dict:
    """
    Create a copy of the input dict of observations, with each gene name replaced by its reference name (NCBI symbol).
    
    Require the following functions:
        get_dict_matching_synonyms_to_refgenename
        get_reference_gene_name
        
    INPUT
        observations_dict: dict (key = observation identifier, value = dict (key = genename, value = gene status))
        path_NCBIgenedata: path to the NCBI gene data
    OUTPUT
        dict (key = observation identifier, value = dict (key = reference genename, value = gene status))
    """
    
    # Get gene data information:
    gene_synonyms_dict = get_dict_matching_synonyms_to_refgenename(path_NCBIgenedata)
    
    # Copy the dict of observations by replacing each genename by its reference genename (NCBI symbol) into it:
    standardized_observations_dict = dict()
    
    for k,v in observations_dict.items():
        standardized_observations_dict[k] = dict()
        for component, status in v.items():
            standardized_component = get_reference_gene_name(component, gene_synonyms_dict)
            standardized_observations_dict[k][standardized_component] = status
    
    return standardized_observations_dict

In [6]:
# Example:

observations = {
    "obs1": {"ALPG": 1, "AR": 0, "ACP3": 0, "ZNF217": 0, "AHR": 0, "UGT1A6": 0},
    "obs2": {"ALPG": 1, "AR": 0, "ACP3": 1, "ZNF217": 1, "AHR": 0, "UGT1A6": 0},
    "obs3": {"ALPG": 1, "AR": 1, "ZNF217": 0, "AHR": 1, "UGT1A6": 0},
    "obs4": {"ALPG": 0, "AR": 1, "ACP3": 1, "ZNF217": 0, "AHR": 0, "UGT1A6": 1},    
}
pd.DataFrame.from_dict(observations, orient="index").fillna('')

Unnamed: 0,ALPG,AR,ACP3,ZNF217,AHR,UGT1A6
obs1,1,0,0.0,0,0,0
obs2,1,0,1.0,1,0,0
obs3,1,1,,0,1,0
obs4,0,1,1.0,0,0,1


In [7]:
# Continuation of the example:

standardized_observations = standardize_genename_in_dict_of_observations(observations, "Mus_musculus.gene_info")

pd.DataFrame.from_dict(standardized_observations, orient="index").fillna('')

Unnamed: 0,ALPPL2,AR,ACPP,ZFP217,AHR,UGT1A6A
obs1,1,0,0.0,0,0,0
obs2,1,0,1.0,1,0,0
obs3,1,1,,0,1,0
obs4,0,1,1.0,0,0,1
