# Analyze MultiXrank results

MultiXrank is a Random Walk with Restart algorithm designed to explore multilayer networks. Starting from a seed node, it assigns scores to all nodes in the network with respect to the seed. Theses scores indicate how close a network node is to the seed.

Our multilayer network consists of two layers: the Rare-X layer containing disease (Rare-X name format), patient, and symptom nodes, and the Orphanet layer containing disease (Orphanet name format) and Human Phenotype Ontology (HPO) nodes.

Our hypothesis is that MultiXrank can uncover previously unknown phenotypes associated with rare diseases. Taking iteratively each Rare-X disease as a seed, we hypothesise that MultiXrank can highlight the symptoms that have a high score with respect to the seed disease but do not have similar HPO terms with also a high score. These uncorrelated symptoms/HPO terms might indicate new and unrecognized aspects of the disease's phenotype, potentially leading to valuable insights for diagnosis and treatment.

In [116]:
import pandas as pd
import xml.etree.ElementTree as ET
from pyhpo import Ontology
import numpy as np
import os

In [117]:
tree = ET.parse("../data/en_product4.xml")
root = tree.getroot()

# initilize the Ontology ()
_ = Ontology()

In [118]:
def find_orpha_name(orpha_code: str) -> str:
    """Function that returns the orphanet
    name of a disease given its orphanet
    code

    Args:
        orpha_code (str): orphanet code of 
        the disease

    Returns:
        str: the orphanet name of the disease
    """
    for disorder in root.iter('Disorder'):
        orpha_code_in_tree = disorder.find('OrphaCode').text
        orpha_name = disorder.find('Name').text
        if orpha_code_in_tree == orpha_code:
            return orpha_name

In [119]:
def create_table_diseases_seeds(mapping_file: str, table_name: str) -> dict:
    """Function that generates a mapping table of disease names and the 
    numbers used in multixrank to idenfity diseases

    Args:
        mapping_file (str): name of the mapping file
        table_name (str): name of the output table

    Returns:
        dict: a dictionary of correspondances between rare-x diseases names,
        orphanet diseases names and seed numbers used in multixrank
    """
    dico_diseases_seeds = dict()
    df_mapping_file = pd.read_csv(mapping_file, sep=";", header=0)
    df_table = pd.DataFrame(columns=["RARE-X", "ORPHANET", "SEED NUMBER"])
    df_table["RARE-X"] = df_mapping_file["Rx"]
    diseases = df_table["RARE-X"].tolist()
    df_table["ORPHANET"] = df_mapping_file["Orphanet"]
    seed_numbers = [i for i in range(1, 28)]
    df_table["SEED NUMBER"] = seed_numbers
    df_table.to_csv(table_name, sep="\t", header=True, index=False)
    for disease, seed in zip(diseases, seed_numbers):
        dico_diseases_seeds[disease] = seed
    return dico_diseases_seeds

dico_diseases_seeds = create_table_diseases_seeds(mapping_file="../data/Diseases_Rx_orpha_corres.csv", table_name="../Diseases_names_and_seeds_numbering.tsv")

In [120]:
# Make dictionnary with the correspondance table (orpha vs rare x disease names)
dico_mapping = dict()
mapping = pd.read_csv(f"../network/bipartite/bipartite_RARE_X_orpha_diseases.tsv", header=None, sep="\t")
for index, row in mapping.iterrows():
    dico_mapping[str(row[0])] = row[1]

In [121]:
def create_results_table(dico_diseases_seeds: dict, input_nrow: int, output_top: int, outdir: str, resultsdir: str) -> None:
    """Function that generates for each mutlixrank output = 
    each rarex disease, a results file that recapitulates/concatenates 
    all the scores of the 2 layers in a single file

    Args:
        dico_diseases_seeds (dict): a dictionary of the rarex 
        diseases and their seed numbers used in multixrank
        input_nrow (int): number of lines to read in multiwrank outputs
        output_top (int): number of top results to select for output
        outdir (str): multixrank output directory
        resultsdir (str): results output directory

    Remark: read full output file could take time
    """
    
    os.makedirs(f"../multixrank_RARE_X_diseases/{resultsdir}/", exist_ok=True)
    for disease, seed in dico_diseases_seeds.items():
        # Read layer 1 (rarex) output: no terms description to add
        # because this layer contains RARE-X disease names, symptomes
        # names and patients IDs
        multiplex_layer1 = pd.read_csv(f"../multixrank_RARE_X_diseases/{outdir}/output_{seed}/multiplex_Rare_X_layer.tsv", header=0, sep="\t", nrows=input_nrow)
        ## Remove patient nodes
        multiplex_1_selected = multiplex_layer1[multiplex_layer1['node'].str.contains('[A-Za-z]+', regex=True)]        
        
        # Read layer 2 (orpha-hpo) output
        multiplex_layer2 = pd.read_csv(f"../multixrank_RARE_X_diseases/{outdir}/output_{seed}/multiplex_Orpha_layer.tsv", header=0, sep="\t", nrows=input_nrow)
        # get the nodes into a list
        nodes_layer2 = multiplex_layer2[multiplex_layer2.columns[1]].to_list()
        # initialize empty list to store the descriptions (orpha names and phenotyes names) for each node
        list_description_layer2 = list()
        # browse nodes in mutliplex 2 to add description
        for term in nodes_layer2:
            if term[:5] == "ORPHA":
                orpha_code = term[6:]
                orpha_name = find_orpha_name(orpha_code=orpha_code)
                list_description_layer2.append(orpha_name)
            elif term[:2] == "HP":
                try:
                    hpo_phenotype = Ontology.get_hpo_object(term)
                    list_description_layer2.extend([str(hpo_phenotype)[13:]])
                # if there is no match of HPO phenotype name
                except RuntimeError:
                    list_description_layer2.append("None")
        # check that the description list and the dataframe have the same length !
        assert len(list_description_layer2) == len(multiplex_layer2.index)
        # create new description columns for the terms
        description_layer2 = pd.DataFrame(list_description_layer2, columns=['description'])
        # create new dataframe for layer 2 containing the ranking of the nodes + their description (orpha names and phenotypes names)
        multiplex_2_with_description = pd.concat([multiplex_layer2.reindex(range(len(multiplex_layer2))), description_layer2.reindex(range(len(multiplex_layer2)))], axis=1)
        
        # Select top of results
        multiplex_1_head = multiplex_1_selected.reset_index(drop=True).head(21)
        multiplex_1_head["empty"] = ""
        multiplex_2_head = multiplex_2_with_description.reset_index(drop=True).head(21)
        
        # concatenate the two dataframes and generate table output
        table_results = pd.concat([multiplex_1_head, multiplex_2_head], axis=1)
        table_results.to_csv(f"../multixrank_RARE_X_diseases/{resultsdir}/results_disease_{seed}.tsv", sep="\t", header=True, index=False)


# Multixrank on Disease-Disease phenotype network:

## Without weights

In [113]:
## Parameters
outdir = "output_DiseaseDisease_Phenotype"
resultsdir = "results_output_DiseaseDisease_Phenotype"
input_nrow = 1000
output_top = 21

## Results integration
create_results_table(dico_diseases_seeds=dico_diseases_seeds,
                     input_nrow=input_nrow, 
                     output_top=output_top,
                     outdir=outdir, 
                     resultsdir=resultsdir)

In [None]:
# To concatenate all files into one, with an empty line between each result
# rm results_output_DiseaseDisease_Phenotype.tsv
# for i in `ls -v`; do echo $i >> results_output_DiseaseDisease_Phenotype.tsv; sed -s -e $'$a\\\n' $i >> results_output_DiseaseDisease_Phenotype.tsv ; done

## With weighted between disease orphanet and hpo phenotype

Association between disease and phenotype:

- obligate (100%) -> weight = 1
- very frequent (99-80%) -> weight = 4/5
- frequent (79-30%) -> weight = 3/5
- occasional (29-5%) -> weight = 2/5
- very rare (<4-1%) -> weight = 1/5
- excluded (0%) -> weight = 0

Association between disease and disease (association if one mutated genes shared) -> weight = 1

In [115]:
## Parameters
outdir = "output_DiseaseDisease_Phenotype_Weighted"
resultsdir = "results_output_DiseaseDisease_Phenotype_Weighted"
input_nrow = 1000
output_top = 21

## Results integration
create_results_table(dico_diseases_seeds=dico_diseases_seeds,
                     input_nrow=input_nrow, 
                     output_top=output_top,
                     outdir=outdir, 
                     resultsdir=resultsdir)

In [None]:
# To concatenate all files into one, with an empty line between each result
# rm results_output_DiseaseDisease_Phenotype_Weighted.tsv
# for i in `ls -v`; do echo $i >> results_output_DiseaseDisease_Phenotype_Weighted.tsv; sed -s -e $'$a\\\n' $i >> results_output_DiseaseDisease_Phenotype_Weighted.tsv ; done


## With weighted inverse between disease orphanet and hpo phenotype

Association between disease and phenotype:

- obligate (100%) -> weight = 1/5
- very frequent (99-80%) -> weight = 2/5
- frequent (79-30%) -> weight = 3/5
- occasional (29-5%) -> weight = 4/5
- very rare (<4-1%) -> weight = 1
- excluded (0%) -> weight = 0

/!\ Weights are inversed compared to the previous analysis

Association between disease and disease (association if one mutated genes shared) -> weight = 1

In [122]:
## Parameters
outdir = "output_DiseaseDisease_Phenotype_WeightedInverse"
resultsdir = "results_output_DiseaseDisease_Phenotype_WeightedInverse"
input_nrow = 1000
output_top = 21

## Results integration
create_results_table(dico_diseases_seeds=dico_diseases_seeds,
                     input_nrow=input_nrow, 
                     output_top=output_top,
                     outdir=outdir, 
                     resultsdir=resultsdir)

In [None]:
# To concatenate all files into one, with an empty line between each result
# rm results_output_DiseaseDisease_Phenotype_WeightedInverse.tsv
# for i in `ls -v`; do echo $i >> results_output_DiseaseDisease_Phenotype_WeightedInverse.tsv; sed -s -e $'$a\\\n' $i >> results_output_DiseaseDisease_Phenotype_WeightedInverse.tsv ; done

# Multixrank on Disease-Disease phenotype with phenotype ontology network

## Without weights

In [123]:
## Parameters
outdir = "output_DiseaseDisease_PhenotypeOntology"
resultsdir = "results_output_DiseaseDisease_PhenotypeOntology"
input_nrow = 1000
output_top = 21

## Results integration
create_results_table(dico_diseases_seeds=dico_diseases_seeds,
                     input_nrow=input_nrow, 
                     output_top=output_top,
                     outdir=outdir, 
                     resultsdir=resultsdir)

In [None]:
# To concatenate all files into one, with an empty line between each result
# rm results_output_DiseaseDisease_PhenotypeOntology.tsv
# for i in `ls -v`; do echo $i >> results_output_DiseaseDisease_PhenotypeOntology.tsv; sed -s -e $'$a\\\n' $i >> results_output_DiseaseDisease_PhenotypeOntology.tsv ; done

## With weighted between disease orphanet and hpo phenotype

Association between disease and phenotype:

- obligate (100%) -> weight = 1
- very frequent (99-80%) -> weight = 4/5
- frequent (79-30%) -> weight = 3/5
- occasional (29-5%) -> weight = 2/5
- very rare (<4-1%) -> weight = 1/5
- excluded (0%) -> weight = 0

Association between phenotype and phenotype (phenotype hierarchy) -> weight = 0.2

Association between disease and disease (association if one mutated genes shared) -> weight = 1

In [124]:
## Parameters
outdir = "output_DiseaseDisease_PhenotypeOntology_Weighted"
resultsdir = "results_output_DiseaseDisease_PhenotypeOntology_Weighted"
input_nrow = 1000
output_top = 21

## Results integration
create_results_table(dico_diseases_seeds=dico_diseases_seeds,
                     input_nrow=input_nrow, 
                     output_top=output_top,
                     outdir=outdir, 
                     resultsdir=resultsdir)

In [None]:
# To concatenate all files into one, with an empty line between each result
# rm results_output_DiseaseDisease_PhenotypeOntology_Weighted.tsv
# for i in `ls -v`; do echo $i >> results_output_DiseaseDisease_PhenotypeOntology_Weighted.tsv; sed -s -e $'$a\\\n' $i >> results_output_DiseaseDisease_PhenotypeOntology_Weighted.tsv ; done

## With weighted inverse between disease orphanet and hpo phenotype

Association between disease and phenotype:

- obligate (100%) -> weight = 1/5
- very frequent (99-80%) -> weight = 2/5
- frequent (79-30%) -> weight = 3/5
- occasional (29-5%) -> weight = 4/5
- very rare (<4-1%) -> weight = 1
- excluded (0%) -> weight = 0

/!\ Weights are inversed compared to the previous analysis

Association between phenotype and phenotype (phenotype hierarchy) -> weight = 0.2

Association between disease and disease (association if one mutated genes shared) -> weight = 1

In [125]:
## Parameters
outdir = "output_DiseaseDisease_PhenotypeOntology_WeightedInverse"
resultsdir = "results_output_DiseaseDisease_PhenotypeOntology_WeightedInverse"
input_nrow = 1000
output_top = 21

## Results integration
create_results_table(dico_diseases_seeds=dico_diseases_seeds,
                     input_nrow=input_nrow, 
                     output_top=output_top,
                     outdir=outdir, 
                     resultsdir=resultsdir)

In [None]:
# To concatenate all files into one, with an empty line between each result
# rm results_output_DiseaseDisease_PhenotypeOntology_WeightedInverse.tsv
# for i in `ls -v`; do echo $i >> results_output_DiseaseDisease_PhenotypeOntology_WeightedInverse.tsv; sed -s -e $'$a\\\n' $i >> results_output_DiseaseDisease_PhenotypeOntology_WeightedInverse.tsv ; done
