# Parse MultiXrank results

MultiXrank is a Random Walk with Restart algorithm designed to explore multilayer networks. Starting from a seed node, it assigns scores to all nodes in the network with respect to the seed. Theses scores indicate how close a network node is to the seed.

Our multilayer network consists of two layers: 
- the **Rare-X layer** contains disease (Rare-X name format), patient, and symptom nodes. The file is named `RARE_X_layer.tsv`.
- the **Orphanet layer** contains disease (Orphanet name format) and Human Phenotype Ontology (HPO) nodes. The file is named `DiseaseDisease_PhenotypeOntology.tsv`.

Our hypothesis is that MultiXrank can uncover previously unknown phenotypes associated with rare diseases. Taking iteratively each Rare-X disease as a seed, we hypothesise that MultiXrank can highlight the symptoms that have a high score with respect to the seed disease but do not have similar HPO terms with also a high score. These uncorrelated symptoms/HPO terms might indicate new and unrecognized aspects of the disease's phenotype, potentially leading to valuable insights for diagnosis and treatment.

In [2]:
import pandas as pd
import xml.etree.ElementTree as ET
from pyhpo import Ontology
import numpy as np
import os

## 1) Retreive HPO ontologies

First, we retreive the HPO ontologies, stored in `en_product4.xml` file. 

In [3]:
tree = ET.parse("../data/en_product4.xml")
root = tree.getroot()

# initilize the Ontology ()
_ = Ontology()

## 2) Create mapping between disease names and seeds

We create a mapping table between disease names and the numbers used in multiXrank. We use the `create_table_diseases_seeds()` to generate this mapping table. 

In [5]:
def create_table_diseases_seeds(mapping_file: str, table_name: str) -> dict:
    """Function that generates a mapping table of disease names and the 
    numbers used in multixrank to idenfity diseases

    Args:
        mapping_file (str): name of the mapping file
        table_name (str): name of the output table

    Returns:
        dict: a dictionary of correspondances between rare-x diseases names,
        orphanet diseases names and seed numbers used in multixrank
    """
    dico_diseases_seeds = dict()
    df_mapping_file = pd.read_csv(mapping_file, sep=";", header=0)
    df_table = pd.DataFrame(columns=["RARE-X", "ORPHANET", "SEED NUMBER"])
    df_table["RARE-X"] = df_mapping_file["Rx"]
    diseases = df_table["RARE-X"].tolist()
    df_table["ORPHANET"] = df_mapping_file["Orphanet"]
    seed_numbers = [i for i in range(1, 28)]
    df_table["SEED NUMBER"] = seed_numbers
    df_table.to_csv(table_name, sep="\t", header=True, index=False)
    for disease, seed in zip(diseases, seed_numbers):
        dico_diseases_seeds[disease] = seed
    return dico_diseases_seeds

dico_diseases_seeds = create_table_diseases_seeds(mapping_file="../data/Diseases_Rx_orpha_corres.csv", table_name="../Diseases_names_and_seeds_numbering.tsv")

## 3) The create result table function

The `find_orpha_name()` function gives the disease orphanet name for a given orphanet code. 

In [None]:
def find_orpha_name(orpha_code: str) -> str:
    """Function that returns the orphanet
    name of a disease given its orphanet
    code

    Args:
        orpha_code (str): orphanet code of 
        the disease

    Returns:
        str: the orphanet name of the disease
    """
    for disorder in root.iter('Disorder'):
        orpha_code_in_tree = disorder.find('OrphaCode').text
        orpha_name = disorder.find('Name').text
        if orpha_code_in_tree == orpha_code:
            return orpha_name

The `create_results_table()` function reads first the multiXrank results files for each seed:
- multiXrank results are stored in the `multiplex_Rare_X_layer.tsv` file for the Rare-X layer.
- multiXrank results are stored in the `multiplex_Orpha_layer.tsv` file for the Orphanet layer.

Only the first 1000 lines are loaded. Files are big, and read the entire files can take time because of the number of files. You can change this number using the `input_nrow` parameter.

The function adds the orphanet disease name and the phenotype name for each corresponding orphanet and phenotype code that are in the Orphanet layer. 

We select the top 20 of results, to simplify the results analysis. You can change this top number using the `output_top` parameter. 

Finally, the function concatenated the results into one tsv file. You can give the output directory name using the `resultsdir` parameter. 

In [121]:
def create_results_table(dico_diseases_seeds: dict, input_nrow: int, output_top: int, outdir: str, resultsdir: str) -> None:
    """Function that generates for each mutlixrank output = 
    each rarex disease, a results file that recapitulates/concatenates 
    all the scores of the 2 layers in a single file

    Args:
        dico_diseases_seeds (dict): a dictionary of the rarex 
        diseases and their seed numbers used in multixrank
        input_nrow (int): number of lines to read in multiwrank outputs
        output_top (int): number of top results to select for output
        outdir (str): multixrank output directory
        resultsdir (str): results output directory

    Remark: read full output file could take time
    """
    
    os.makedirs(f"../multixrank_RARE_X_diseases/{resultsdir}/", exist_ok=True)
    for disease, seed in dico_diseases_seeds.items():
        # Read layer 1 (rarex) output: no terms description to add
        # because this layer contains RARE-X disease names, symptomes
        # names and patients IDs
        multiplex_layer1 = pd.read_csv(f"../multixrank_RARE_X_diseases/{outdir}/output_{seed}/multiplex_Rare_X_layer.tsv", header=0, sep="\t", nrows=input_nrow)
        
        # Read layer 2 (orpha-hpo) output
        multiplex_layer2 = pd.read_csv(f"../multixrank_RARE_X_diseases/{outdir}/output_{seed}/multiplex_Orpha_layer.tsv", header=0, sep="\t", nrows=input_nrow)
        # get the nodes into a list
        nodes_layer2 = multiplex_layer2[multiplex_layer2.columns[1]].to_list()
        # initialize empty list to store the descriptions (orpha names and phenotyes names) for each node
        list_description_layer2 = list()
        # browse nodes in mutliplex 2 to add description
        for term in nodes_layer2:
            if term[:5] == "ORPHA":
                orpha_code = term[6:]
                orpha_name = find_orpha_name(orpha_code=orpha_code)
                list_description_layer2.append(orpha_name)
            elif term[:2] == "HP":
                try:
                    hpo_phenotype = Ontology.get_hpo_object(term)
                    list_description_layer2.extend([str(hpo_phenotype)[13:]])
                # if there is no match of HPO phenotype name
                except RuntimeError:
                    list_description_layer2.append("None")
        # check that the description list and the dataframe have the same length !
        assert len(list_description_layer2) == len(multiplex_layer2.index)
        # create new description columns for the terms
        description_layer2 = pd.DataFrame(list_description_layer2, columns=['description'])
        # create new dataframe for layer 2 containing the ranking of the nodes + their description (orpha names and phenotypes names)
        multiplex_2_with_description = pd.concat([multiplex_layer2.reindex(range(len(multiplex_layer2))), description_layer2.reindex(range(len(multiplex_layer2)))], axis=1)
        
        # Select top of results
        multiplex_1_head = multiplex_layer1.reset_index(drop=True).head(21)
        multiplex_1_head["empty"] = ""
        multiplex_2_head = multiplex_2_with_description.reset_index(drop=True).head(21)
        
        # concatenate the two dataframes and generate table output
        table_results = pd.concat([multiplex_1_head, multiplex_2_head], axis=1)
        table_results.to_csv(f"../multixrank_RARE_X_diseases/{resultsdir}/results_disease_{seed}.tsv", sep="\t", header=True, index=False)


## Multixrank on Disease-Disease phenotype with phenotype ontology network

We run multiXrank on two layers: the **Rare-X disease layer** and the **Orphanet disease layer**. 

The **Rare-X disease layer** (`RARE_X_layer.tsv`) contains diseases, patients and symptoms nodes and two types of edges (disease-patient and patient-symptoms). Edges are **not weighted** in this layer.


The **Orphanet disease layer** (`DiseaseDisease_PhenotypeOntology.tsv`) contains diseases and phenotypes nodes, and connections between diseases and diseases (if they shared at least one mutated genes), connections between diseases and phenotypes and connections between phenotypes (HPO ontology). Connections are **weighted** in this layer:

- between **diseases**: weight = 1
- between **diseases and pehnotypes**, weight is based on the association frequency:
    - obligate (100%) -> weight = 1
    - very frequent (99-80%) -> weight = 4/5
    - frequent (79-30%) -> weight = 3/5
    - occasional (29-5%) -> weight = 2/5
    - very rare (<4-1%) -> weight = 1/5
    - excluded (0%) -> weight = 0
- between **phenotypes**: weight = 0.2

We run multiXrank analysis for each Rare-X diseases (27 diseases) defined as seed. You can retreived the corresponding seed number in the `Diseases_names_and_seeds_numbering.tsv` file. So, we have two results files (one for each layer) for each 

We have two result files (one for each layer analysed) for each seed. We create a summary file for each multiXrank analysis. 

In [115]:
## Parameters
outdir = "output_DiseaseDisease_PhenotypeOntology_Weighted"
resultsdir = "results_output_DiseaseDisease_PhenotypeOntology_Weighted"
input_nrow = 1000
output_top = 21

## Results integration
create_results_table(dico_diseases_seeds=dico_diseases_seeds,
                     input_nrow=input_nrow, 
                     output_top=output_top,
                     outdir=outdir, 
                     resultsdir=resultsdir)

Finally, we concatenate results from each seed analysis into one file. 

In [None]:
# To concatenate all files into one, with an empty line between each result
!rm results_output_DiseaseDisease_PhenotypeOntology_Weighted.tsv
!for i in `ls -v`; do echo $i >> results_output_DiseaseDisease_PhenotypeOntology_Weighted.tsv; sed -s -e $'$a\\\n' $i >> results_output_DiseaseDisease_PhenotypeOntology_Weighted.tsv ; done