# Build the Orphanet/HPO layer

The following code extract informations from the Orphanet and Human Phenotype Ontology databases:
* **disease-phenotype** associations from Orphanet (weighted according to their reported frequency in Oprhanet)
* **disease-disease** associations from Orphanet (diseases sharing a common causal gene)
* **phenotype-phenotype** associations from HPO

Various monoplex networks are built based on combinations of those three types of associatons:
* a monoplex network with disease-phenotype associations: **`network/multiplex/Orpha/Disease_Phenotype.tsv`**
* a monoplex network with disease-phenotype and phenotype-phenotype associations: **`network/multiplex/Orpha/Disease_PhenotypeOntology.tsv`**
* a monoplex network with disease-phenotype and disease-disease associations: **`network/multiplex/Orpha/DiseaseDisease_Phenotype.tsv`**
* a monoplex network with disease-phenotype, phenotype-phenotype and disease-disease associations: **`network/multiplex/Orpha/DiseaseDisease_PhenotypeOntology.tsv`**

All ids are explicited in files:
* **`data/OrphaDisease_HPO_extract.tsv`** contains OrphaCode, disease name, HPO ids and HPO terms 
* **`data/hpo_terms.tsv`** contains all HPO ids and terms


## 1) Build disease-phenotype network from Orphanet Data

To get data from Orphanet: https://www.orphadata.com/data/xml/en_product4.xml

In [54]:
import xml.etree.ElementTree as ET
import csv

In [55]:
tree = ET.parse("../data/en_product4.xml")
root = tree.getroot()

In [56]:
# Create TSV file to store the extracted data
info_file = open('../data/OrphaDisease_HPO_extract.tsv', 'w', newline='')
info_writer = csv.writer(info_file, delimiter ='\t')

# Create the network file for MultiXrank
net_file = open('../network/multiplex/Orpha/Disease_Phenotype.tsv', 'w', newline='')
net_writer = csv.writer(net_file, delimiter ='\t')

# Iterate over disorders and HPOs
for disorder in root.iter('Disorder'):
    orpha_code = "ORPHA:"+disorder.find('OrphaCode').text
    orpha_name = disorder.find('Name').text
    for hpo, freq in zip(disorder.iter("HPO"), disorder.iter("HPOFrequency")):
        hpo_id = hpo.find("HPOId").text
        hpo_term = hpo.find("HPOTerm").text
        hpo_freq_name = freq.find("Name").text
        if hpo_freq_name == "Obligate (100%)":
            hpo_freq = 1
        elif hpo_freq_name == "Very frequent (99-80%)":
            hpo_freq = 4/5
        elif hpo_freq_name == "Frequent (79-30%)":
            hpo_freq = 3/5
        elif hpo_freq_name == "Occasional (29-5%)":
            hpo_freq = 2/5
        elif hpo_freq_name == "Very rare (<4-1%)":
            hpo_freq = 1/5
        elif hpo_freq_name == "Excluded (0%)":
            hpo_freq = 0
        info_writer.writerow([orpha_code, orpha_name, hpo_id, hpo_term, hpo_freq])
        net_writer.writerow([orpha_code, hpo_id, hpo_freq])

# Close files
info_file.close()
net_file.close()

## 2) Add HP ontology

Create a new tsv file containing previously extracted Disease-HPO associations AND the full HP ontology.

Download HPO data in obo format from https://hpo.jax.org/app/data/ontology

In [57]:
import obonet
import networkx
import pandas as pd

In [58]:
# Read previously computed Disease-HPO monoplex
dis_hpo_net = pd.read_csv('../network/multiplex/Orpha/Disease_Phenotype.tsv', sep = '\t', header=None)
dis_hpo_net

Unnamed: 0,0,1,2
0,ORPHA:58,HP:0000256,0.8
1,ORPHA:58,HP:0001249,0.8
2,ORPHA:58,HP:0001250,0.8
3,ORPHA:58,HP:0001257,0.8
4,ORPHA:58,HP:0001274,0.8
...,...,...,...
111760,ORPHA:397596,HP:0011110,0.4
111761,ORPHA:397596,HP:0012758,0.4
111762,ORPHA:397596,HP:0031692,0.4
111763,ORPHA:397596,HP:0031693,0.4


Load the HPO ontology, append it to `dis_hpo_net`, and store in **`network/multiplex/Orpha/Disease_PhenotypeOntology.tsv`**

In [59]:
# Read obo file
graph = obonet.read_obo("../data/hp.obo")

# Extract edges
ontology = networkx.to_pandas_edgelist(graph)

# Add a weight column (all set to 1)
ontology["weight"] = 1
ontology.columns = [0,1,2]

# Append to dis_hpo_net
full_net = pd.concat([dis_hpo_net, ontology])
print(full_net)

# Write to tsv
full_net.to_csv("../network/multiplex/Orpha/Disease_PhenotypeOntology.tsv", sep = '\t', header=None, index=False)

                0           1    2
0        ORPHA:58  HP:0000256  0.8
1        ORPHA:58  HP:0001249  0.8
2        ORPHA:58  HP:0001250  0.8
3        ORPHA:58  HP:0001257  0.8
4        ORPHA:58  HP:0001274  0.8
...           ...         ...  ...
21551  HP:5201010  HP:0000204  1.0
21552  HP:5201011  HP:0100336  1.0
21553  HP:5201012  HP:0100336  1.0
21554  HP:5201013  HP:0100336  1.0
21555  HP:5201014  HP:0100336  1.0

[133321 rows x 3 columns]


Store HPO ids and HPO terms correspondance in tsv file **`data/hpo_terms.tsv`**

In [60]:
id_to_name = {id_: data.get('name') for id_, data in graph.nodes(data=True)}

hpo_file = open('../data/hpo_terms.tsv', 'w', newline='')
hpo_writer = csv.writer(hpo_file, delimiter ='\t')

for hpo in id_to_name:
    hpo_writer.writerow([hpo, id_to_name[hpo]])

hpo_file.close()

## 3) Add Disease-Disease associations

This code builds a disease-disease similarity network that connects Orphanet diseases if they share at least one common causative gene

Data can be accessed here: https://www.orphadata.com/data/xml/en_product6.xml

In [61]:
tree = ET.parse("../data/en_product6.xml")
root = tree.getroot()

The function bellow generates a dictionary containing causal genes for each disease in Orphanet. We store this information in `dico_diseases_genes`.

In [62]:
def generate_dico_diseases_genes(root) -> dict():
    dico_diseases_genes = dict()
    for disorder in root.iter('Disorder'):
        orpha_code = "ORPHA:"+disorder.find('OrphaCode').text
        dico_diseases_genes[orpha_code] = list()
        for gda in disorder.iter('DisorderGeneAssociation'):
                for gene in gda.iter('Gene'):
                    dico_diseases_genes[orpha_code] += [gene.find('Symbol').text]
    #print(dico_diseases_genes)
    return dico_diseases_genes

dico_diseases_genes = generate_dico_diseases_genes(root=root)

The two functions bellow compare causal genes associated with each disease and create a dictionary of diseases sharing at least one causal gene. This information is stored in `dico_diseases_similarity`.

In [63]:
def compare_two_gene_sets(gene_set_1: set, gene_set_2: set) -> bool():
    return len(gene_set_1.intersection(gene_set_2)) > 0

In [64]:
def compare_gene_sets_in_dict(dico_diseases_genes: dict) -> dict():
    dico_diseases_similarity = dict()
    diseases = list(dico_diseases_genes.keys())
    for i in range(len(diseases)):
        disease_1 = diseases[i]
        dico_diseases_similarity[disease_1] = list()
        for j in range(i + 1, len(diseases)):
            disease_2 = diseases[j]
            list_genes_1 = dico_diseases_genes[disease_1]
            list_genes_2 = dico_diseases_genes[disease_2]
            if compare_two_gene_sets(set(list_genes_1), set(list_genes_2)):
                dico_diseases_similarity[disease_1] += [disease_2]
    return dico_diseases_similarity

dico_diseases_similarity = compare_gene_sets_in_dict(dico_diseases_genes=dico_diseases_genes)

The function below creates the disease-disease network. We execute it and store the results in the variable `dis_dis_net`.

In [65]:
def generate_disease_sim_network(dico_diseases_similarity: dict):
    network = pd.DataFrame(columns=["source", "target"])
    associations = list()
    index = 0
    for disease in dico_diseases_similarity.keys():
        for associated_diseases in dico_diseases_similarity[disease]:
            if not (disease, associated_diseases) in associations or not (associated_diseases, disease) in associations:
                network._set_value(index, "source", disease)
                network._set_value(index, "target", associated_diseases)
                associations.append((disease, associated_diseases))
                index += 1
    #network.to_csv(network_path, sep="\t", header=None, index=False)
    return(network)

dis_dis_net = generate_disease_sim_network(dico_diseases_similarity=dico_diseases_similarity)
print(dis_dis_net)

            source        target
0     ORPHA:166024      ORPHA:36
1     ORPHA:166024    ORPHA:2189
2     ORPHA:166024    ORPHA:2754
3     ORPHA:166063    ORPHA:2524
4     ORPHA:166078  ORPHA:166084
...            ...           ...
7980  ORPHA:631076  ORPHA:641353
7981  ORPHA:631106  ORPHA:619367
7982  ORPHA:636970  ORPHA:636965
7983  ORPHA:642976  ORPHA:642945
7984  ORPHA:633021  ORPHA:633024

[7985 rows x 2 columns]


Now we can load the previously computed Disease-HPO monoplex and add Disease-Disease associations to it. Store the results in file **`../network/multiplex/Orpha/DiseaseDisease_Phenotype.tsv`**.

In [66]:
# Read previously computed Disease-HPO monoplex
dis_hpo_net = pd.read_csv('../network/multiplex/Orpha/Disease_Phenotype.tsv', sep = '\t', header=None)
dis_hpo_net

Unnamed: 0,0,1,2
0,ORPHA:58,HP:0000256,0.8
1,ORPHA:58,HP:0001249,0.8
2,ORPHA:58,HP:0001250,0.8
3,ORPHA:58,HP:0001257,0.8
4,ORPHA:58,HP:0001274,0.8
...,...,...,...
111760,ORPHA:397596,HP:0011110,0.4
111761,ORPHA:397596,HP:0012758,0.4
111762,ORPHA:397596,HP:0031692,0.4
111763,ORPHA:397596,HP:0031693,0.4


In [67]:
# Set weight 1 for all disease-disease associations and change colnames of dis_dis_net so that it match dis_hpo_net
dis_dis_net['weight'] = 1
dis_dis_net.rename(columns={'source': 0, 'target': 1, 'weight': 2}, inplace=True)

# Append to disease-disease associations to dis_hpo_net
full_net = pd.concat([dis_hpo_net, dis_dis_net])
print(full_net)

# Write to tsv
full_net.to_csv("../network/multiplex/Orpha/DiseaseDisease_Phenotype.tsv", sep = '\t', header=None, index=False)

                 0             1    2
0         ORPHA:58    HP:0000256  0.8
1         ORPHA:58    HP:0001249  0.8
2         ORPHA:58    HP:0001250  0.8
3         ORPHA:58    HP:0001257  0.8
4         ORPHA:58    HP:0001274  0.8
...            ...           ...  ...
7980  ORPHA:631076  ORPHA:641353  1.0
7981  ORPHA:631106  ORPHA:619367  1.0
7982  ORPHA:636970  ORPHA:636965  1.0
7983  ORPHA:642976  ORPHA:642945  1.0
7984  ORPHA:633021  ORPHA:633024  1.0

[119750 rows x 3 columns]


We do the same to create the Disease-Disease, Disease-Pĥenotype, Phenotype-Phenotype network that we store in **`../network/multiplex/Orpha/DiseaseDisease_PhenotypeOntology.tsv`**.

In [68]:
# Read previously computed Disease-HPOntology monoplex
dis_hpo_net = pd.read_csv('../network/multiplex/Orpha/Disease_PhenotypeOntology.tsv', sep = '\t', header=None)
dis_hpo_net

Unnamed: 0,0,1,2
0,ORPHA:58,HP:0000256,0.8
1,ORPHA:58,HP:0001249,0.8
2,ORPHA:58,HP:0001250,0.8
3,ORPHA:58,HP:0001257,0.8
4,ORPHA:58,HP:0001274,0.8
...,...,...,...
133316,HP:5201010,HP:0000204,1.0
133317,HP:5201011,HP:0100336,1.0
133318,HP:5201012,HP:0100336,1.0
133319,HP:5201013,HP:0100336,1.0


In [69]:
# Append to disease-disease associations to dis_hpo_net
full_net = pd.concat([dis_hpo_net, dis_dis_net])
print(full_net)

# Write to tsv
full_net.to_csv("../network/multiplex/Orpha/DiseaseDisease_PhenotypeOntology.tsv", sep = '\t', header=None, index=False)

                 0             1    2
0         ORPHA:58    HP:0000256  0.8
1         ORPHA:58    HP:0001249  0.8
2         ORPHA:58    HP:0001250  0.8
3         ORPHA:58    HP:0001257  0.8
4         ORPHA:58    HP:0001274  0.8
...            ...           ...  ...
7980  ORPHA:631076  ORPHA:641353  1.0
7981  ORPHA:631106  ORPHA:619367  1.0
7982  ORPHA:636970  ORPHA:636965  1.0
7983  ORPHA:642976  ORPHA:642945  1.0
7984  ORPHA:633021  ORPHA:633024  1.0

[141306 rows x 3 columns]
