# Build the Orphanet/HPO layer

The following code extract informations from the Orphanet and Human Phenotype Ontology (HPO) databases:
* **disease-phenotype** associations from Orphanet (weights ranging from 0 to 1, according to the reported phenotype frequency in Orphanet)
* **disease-disease** associations from Orphanet (two diseases are connected if they share at least a common causal gene, all weights set as 1)
* **phenotype-phenotype** associations from HPO (full ontology, all weights set as 1/5)

These associations are then stored in a network and saved in file **`network/multiplex/Orpha/DiseaseDisease_PhenotypeOntology.tsv`**.

All Orphanet and HPO ids are explicited in files:
* `data/OrphaDisease_HPO_extract.tsv` contains OrphaCode, disease name, HPO ids and HPO terms 
* `data/hpo_terms.tsv` contains all HPO ids and terms


## 1) Build disease-phenotype network from Orphanet Data

This code extract disease-phenotype associations from Orphanet. These are used to create a weight network in which diseases and phenotypes are represented as nodes. The associations are represented as edges, and are weighted according to the phrenotype reported frequency:
* Obligate: 1
* Very frequent: 4/5
* Frequent: 3/5
* Occasional: 2/5
* Very rare: 1/5
* Excluded: 0

The Orphanet data is downloaded from: https://www.orphadata.com/data/xml/en_product4.xml

The computed network of Disease-Phenotype associations is stored in file `data/Disease_Phenotype.tsv`.

In [1]:
import xml.etree.ElementTree as ET
import csv

In [2]:
tree = ET.parse("../data/en_product4.xml")
root = tree.getroot()

In [3]:
# Create TSV file to store the extracted data
info_file = open('../data/OrphaDisease_HPO_extract.tsv', 'w', newline='')
info_writer = csv.writer(info_file, delimiter ='\t')

# Create the network file for MultiXrank
net_file = open('../data/Disease_Phenotype.tsv', 'w', newline='')
net_writer = csv.writer(net_file, delimiter ='\t')

# Iterate over diseases and HPOs
for disorder in root.iter('Disorder'):
    orpha_code = "ORPHA:"+disorder.find('OrphaCode').text
    orpha_name = disorder.find('Name').text
    for hpo, freq in zip(disorder.iter("HPO"), disorder.iter("HPOFrequency")):
        hpo_id = hpo.find("HPOId").text
        hpo_term = hpo.find("HPOTerm").text
        hpo_freq_name = freq.find("Name").text
        if hpo_freq_name == "Obligate (100%)":
            hpo_freq = 1
        elif hpo_freq_name == "Very frequent (99-80%)":
            hpo_freq = 4/5
        elif hpo_freq_name == "Frequent (79-30%)":
            hpo_freq = 3/5
        elif hpo_freq_name == "Occasional (29-5%)":
            hpo_freq = 2/5
        elif hpo_freq_name == "Very rare (<4-1%)":
            hpo_freq = 1/5
        elif hpo_freq_name == "Excluded (0%)":
            hpo_freq = 0
        info_writer.writerow([orpha_code, orpha_name, hpo_id, hpo_term, hpo_freq])
        net_writer.writerow([orpha_code, hpo_id, hpo_freq])

# Close files
info_file.close()
net_file.close()

## 2) Add HPOntology

This code completes the previously computed network of Disease-Phenotype associations (`data/Disease_Phenotype.tsv`) with the Phenotype-Phenotype associations contained in the HPO.
We set the weights for those associations as 1/5.

HPO data is downloaded from: https://hpo.jax.org/app/data/ontology

The computed network of Disease-Phenotype and Phenotype-Phenotype associations is stored in file `data/Disease_PhenotypeOntology`.

In [4]:
import obonet
import networkx
import pandas as pd

In [5]:
# Read previously computed Disease-HPO associations
dis_hpo_net = pd.read_csv('../data/Disease_Phenotype.tsv', sep = '\t', header=None)
dis_hpo_net

Unnamed: 0,0,1,2
0,ORPHA:58,HP:0000256,0.8
1,ORPHA:58,HP:0001249,0.8
2,ORPHA:58,HP:0001250,0.8
3,ORPHA:58,HP:0001257,0.8
4,ORPHA:58,HP:0001274,0.8
...,...,...,...
111760,ORPHA:397596,HP:0011110,0.4
111761,ORPHA:397596,HP:0012758,0.4
111762,ORPHA:397596,HP:0031692,0.4
111763,ORPHA:397596,HP:0031693,0.4


Load the HPO ontology, append it to `dis_hpo_net`, and store in file `data/Disease_PhenotypeOntology.tsv`

In [6]:
# Read obo file
graph = obonet.read_obo("../data/hp.obo")

# Extract edges
ontology = networkx.to_pandas_edgelist(graph)

# Add a weight column (all set to 1/5)
ontology["weight"] = 1/5
ontology.columns = [0,1,2]

# Append to dis_hpo_net
full_net = pd.concat([dis_hpo_net, ontology])
print(full_net)

# Write to tsv
full_net.to_csv("../data/Disease_PhenotypeOntology.tsv", sep = '\t', header=None, index=False)

                0           1    2
0        ORPHA:58  HP:0000256  0.8
1        ORPHA:58  HP:0001249  0.8
2        ORPHA:58  HP:0001250  0.8
3        ORPHA:58  HP:0001257  0.8
4        ORPHA:58  HP:0001274  0.8
...           ...         ...  ...
21551  HP:5201010  HP:0000204  0.2
21552  HP:5201011  HP:0100336  0.2
21553  HP:5201012  HP:0100336  0.2
21554  HP:5201013  HP:0100336  0.2
21555  HP:5201014  HP:0100336  0.2

[133321 rows x 3 columns]


Store HPO ids and HPO terms correspondance in tsv file `data/hpo_terms.tsv`.

In [7]:
id_to_name = {id_: data.get('name') for id_, data in graph.nodes(data=True)}

hpo_file = open('../data/hpo_terms.tsv', 'w', newline='')
hpo_writer = csv.writer(hpo_file, delimiter ='\t')

for hpo in id_to_name:
    hpo_writer.writerow([hpo, id_to_name[hpo]])

hpo_file.close()

## 3) Build and add Disease-Disease associations

Finally, we add Disease-Disease associations to the network by looking for diseases common causative genes in Oprhanet. In the final network, two diseases share an edge if they share at least one common causative gene according to Orphanet. All Disease-Disease associations are weighted with weight 1.

The data is available here: https://www.orphadata.com/data/xml/en_product6.xml

The final network of Disease-Phenotype, Phenotype-Phenotype, and Disease-Disease associations is stored in file **`network/multiplex/Orpha/DiseaseDisease_PhenotypeOntology.tsv`**.

In [None]:
tree = ET.parse("../data/en_product6.xml")
root = tree.getroot()

The function bellow generates a dictionary containing causal genes for each disease in Orphanet. We store this information in `dico_diseases_genes`.

In [None]:
def generate_dico_diseases_genes(root) -> dict():
    """Function that creates a dictionary of 
    Orphanet diseases and their associated genes

    Args:
        root : the root of the xml tree extracted
        from an xml file describing the gene-diseases
        associations from Orphanet

    Returns:
        dict: a dictionary with the keys being the
        Orphanet code of diseases and the values being
        their associated genes
    """
    dico_diseases_genes = dict()
    # for each Orphanet disease, find its Orphanet code 
    for disorder in root.iter('Disorder'):
        orpha_code = "ORPHA:"+disorder.find('OrphaCode').text
        dico_diseases_genes[orpha_code] = list()
        # then find the genes associated to the disease
        for gda in disorder.iter('DisorderGeneAssociation'):
                for gene in gda.iter('Gene'):
                    dico_diseases_genes[orpha_code] += [gene.find('Symbol').text]
    return dico_diseases_genes

dico_diseases_genes = generate_dico_diseases_genes(root=root)

The two functions bellow compare causal genes associated with each disease and create a dictionary of diseases sharing at least one causal gene. This information is stored in `dico_diseases_similarity`.

In [None]:
def compare_two_gene_sets(gene_set_1: set, gene_set_2: set) -> bool():
    """Function that compares the composition of two sets

    Args:
        gene_set_1 (set): the first set
        gene_set_2 (set): the second set

    Returns:
        bool: True if there is at least one common element between the
        two sets, False either.
    """
    return len(gene_set_1.intersection(gene_set_2)) > 0

In [None]:
def compare_gene_sets_in_dict(dico_diseases_genes: dict) -> dict():
    """Function that compares the sets of genes between each pair
    of Orphanet diseases and generates a dictionary of diseases that
    are similar if they possess at least one common mutated gene

    Args:
        dico_diseases_genes (dict): dicitonary containing as keys
        the Orphanet codes of Orphanet diseases and as values their
        associated genes

    Returns:
        dict: a dictionary with the keys being the Orphanet diseases 
        codes and the values being the list of other Orphanet diseases 
        that are considered similar to the key disease because they
        share at least one mutated gene.
    """
    dico_diseases_similarity = dict()
    # extract all Orphanet diseases
    diseases = list(dico_diseases_genes.keys())
    # compare the associated genes for all pairs of Orphanet diseases
    for i in range(len(diseases)):
        disease_1 = diseases[i]
        dico_diseases_similarity[disease_1] = list()
        for j in range(i + 1, len(diseases)):
            disease_2 = diseases[j]
            list_genes_1 = dico_diseases_genes[disease_1]
            list_genes_2 = dico_diseases_genes[disease_2]
            # if the two diseases possess at least one common mutated gene, 
            # we associated them in the dictionary
            if compare_two_gene_sets(set(list_genes_1), set(list_genes_2)):
                dico_diseases_similarity[disease_1] += [disease_2]
    return dico_diseases_similarity

dico_diseases_similarity = compare_gene_sets_in_dict(dico_diseases_genes=dico_diseases_genes)

The function below creates the disease-disease network. We execute it and store the results in the variable `dis_dis_net`.

In [None]:
def generate_disease_sim_network(dico_diseases_similarity: dict) -> pd.DataFrame():
    """Function that generates the disease-disease similarity network
    from gene-diseases associations of Orphanet. In the resulting 
    network, Orphanet diseases are connected by an edge if they share
    at least one common mutated gene.

    Args:
        dico_diseases_similarity (dict): dictionary containing as keys
        the Orphanet diseases, and as values the list of other Orphanet
        diseases that are considered similar to the key disease because
        they have at least one common associated gene.
        
    Returns:
        pd.DataFrame() : the disease-disease similarity network as
        a pandas dataframe
    """
    # initialize empty dataframe for the network
    network = pd.DataFrame(columns=["source", "target"])
    associations = list()
    index = 0
    for disease in dico_diseases_similarity.keys():
        for associated_diseases in dico_diseases_similarity[disease]:
            # make sure that each disease-disease association is only reported once in the final network
            if not (disease, associated_diseases) in associations or not (associated_diseases, disease) in associations:
                network._set_value(index, "source", disease)
                network._set_value(index, "target", associated_diseases)
                associations.append((disease, associated_diseases))
                index += 1
    #network.to_csv(network_path, sep="\t", header=None, index=False)
    return(network)

dis_dis_net = generate_disease_sim_network(dico_diseases_similarity=dico_diseases_similarity)
print(dis_dis_net)

We load the Disease-Phenotype and Phenotype-Phenotype associations and add the Disease-Disease associations to create the final networks, stored in file **`network/multiplex/Orpha/DiseaseDisease_PhenotypeOntology.tsv`**.

In [None]:
# Read previously computed Disease-HPOntology monoplex
dis_hpo_net = pd.read_csv('../data/Disease_PhenotypeOntology.tsv', sep = '\t', header=None)

# Set weight 1 for all disease-disease associations and change colnames of dis_dis_net so that it match dis_hpo_net
dis_dis_net['weight'] = 1
dis_dis_net.rename(columns={'source': 0, 'target': 1, 'weight': 2}, inplace=True)
# Append to disease-disease associations to dis_hpo_net
full_net = pd.concat([dis_hpo_net, dis_dis_net])
print(full_net)

# Write to tsv
full_net.to_csv("../network/multiplex/Orpha/DiseaseDisease_PhenotypeOntology.tsv", sep = '\t', header=None, index=False)