In [1]:
import pandas as pd
import numpy as np
from scipy.spatial import distance
import requests
import json

pd.set_option('display.float_format', lambda x: '%.3f' % x)

## An example of a method to fix disease terms missing MeSH codes

One problem found when generated the health knowledge graph datasets is that some of the disease terms (names) could not be matched to a MeSH code, or were matched to the wrong MeSH code.   

For the project these were fixed by hand. To fix these terms in an automated way, the machine would need to understand what each disease name "means".  

One theorised method to do this would be by representing it as a vector of its symptoms, and then calculating its similarity to other diseases in the dataset.

## Representing the diseases in the PubMed dataset as vectors

Importing both the original full PubMed dataset, and the diseases that failed to be mapped to a MeSH code by the API.

In [2]:
pm_data = pd.read_csv("data/pubmed_data.csv", sep='\t')

In [4]:
unmatched = pd.read_csv("code_maps/terms_to_mesh_pm_fail.csv")["MeSH Term"]

Creating a dataframe that will store each disease, represented as a vector of its symptoms, with the value taken from the weight between disease and symptom.

In [5]:
symptom_vectors_pm = pd.DataFrame(pm_data.sort_values("MeSH Symptom Term")["MeSH Symptom Term"].unique(), columns=["MeSH Symptom Term"]).set_index("MeSH Symptom Term")

In [6]:
symptom_vectors_pm

"Abdomen, Acute"
Abdominal Pain
Acute Coronary Syndrome
Aerophagy
Ageusia
...
Vomiting
"Vomiting, Anticipatory"
Waterhouse-Friderichsen Syndrome
Weight Gain
Weight Loss


Taking each disease in the dataset by itself, and representing it as a vector:

In [7]:
disease_groups_pm = pm_data.sort_values("MeSH Disease Term").groupby("MeSH Disease Term")

In [8]:
symptom_vectors_pm = pd.concat([symptom_vectors_pm] 
                            + [disease_groups_pm.get_group(disease).set_index("MeSH Symptom Term")["TFIDF score"] for disease, _ in disease_groups_pm], axis=1)

Forming the vectors dataset, and filling in unconnected diseases and symptoms with 0:

In [9]:
symptom_vectors_pm.columns = list(disease_groups_pm.groups.keys())

In [10]:
symptom_vectors_pm = symptom_vectors_pm.fillna(0)

Each of the 4219 diseases is represented as a row vector of length 322, (one value for each symptom).

In [11]:
symptom_vectors_pm = symptom_vectors_pm.transpose()
symptom_vectors_pm

MeSH Symptom Term,"Abdomen, Acute",Abdominal Pain,Acute Coronary Syndrome,Aerophagy,Ageusia,"Aging, Premature",Agnosia,Agraphia,"Akathisia, Drug-Induced",Albuminuria,...,Virilism,Vision Disorders,"Vision, Low",Vocal Cord Paralysis,Voice Disorders,Vomiting,"Vomiting, Anticipatory",Waterhouse-Friderichsen Syndrome,Weight Gain,Weight Loss
22q11 Deletion Syndrome,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,...,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000
"46, XX Disorders of Sex Development",0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,...,2.227,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000
"46, XY Disorders of Sex Development",0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,...,4.454,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000
"ACTH Syndrome, Ectopic",0.000,0.848,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,...,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,1.487,1.000
ACTH-Secreting Pituitary Adenoma,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,...,0.000,0.663,0.000,0.000,0.000,0.000,0.000,0.000,1.487,0.000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
beta-Mannosidosis,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,...,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000
beta-Thalassemia,2.875,2.545,2.598,0.000,0.000,0.000,0.000,0.000,0.000,6.195,...,0.000,2.654,2.256,0.000,0.000,0.000,0.000,0.000,0.000,0.000
von Hippel-Lindau Disease,0.000,0.848,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,...,0.000,8.625,4.512,0.000,0.000,0.000,0.000,0.000,0.000,1.000
"von Willebrand Disease, Type 2",0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,...,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000


## Example: matching a disease from the PubMed dataset to other diseases

Defining a function to calculate the distances (similarity) between a given disease that was failed to be mapped to an appropriate MeSH code, and all of the other diseases in the given dataset.

In [12]:
def disease_distance(unmatched_disease, symptom_vectors, metric):
    distances = distance.cdist([symptom_vectors.loc[unmatched_disease, :]], symptom_vectors, metric=metric)[0]
    distances_df = pd.DataFrame(distances, columns=[unmatched_disease]).set_index(symptom_vectors.index).sort_values(unmatched_disease)
    return distances_df

In [13]:
disease_distance(unmatched[10], symptom_vectors_pm, metric="cosine").head(10)

Unnamed: 0,Sleep Disorders
Sleep Disorders,0.0
Night Terrors,0.022
"Hypersomnolence, Idiopathic",0.096
Somnambulism,0.103
Aspartylglucosaminuria,0.129
Restless Legs Syndrome,0.133
Chronobiology Disorders,0.149
Disorders of Excessive Somnolence,0.156
Nocturnal Myoclonus Syndrome,0.176
"Epilepsy, Frontal Lobe",0.255


The term "Sleep Disorders", which could not be identified by the MeSH API, was able to be successfully matched closely to other similar terms in the PubMed dataset, such as to the terms "Night Terrors" and "Hypersomnolence". This information could be used to help understand the type of disease the term represents and thus place it appropriately in the knowledge graph. 

In [14]:
disease_distance(unmatched[42], symptom_vectors_pm, metric="cosine").head(10)

Unnamed: 0,Eating Disorders
Eating Disorders,0.0
Compulsive Personality Disorder,0.244
Bulimia Nervosa,0.271
Anorexia Nervosa,0.281
Kleine-Levin Syndrome,0.313
Tooth Erosion,0.319
Impulse Control Disorders,0.398
Tooth Wear,0.405
Feeding and Eating Disorders of Childhood,0.411
Dependent Personality Disorder,0.414


However the term "Eating Disorders" matches to "Bulimia Nervosa" and "Anorexia Nervosa" only as the second and third more similar terms. The most similar disease "Compulsive Personality Disorder" would not be an appropriate replacement for "Eating Disorders".

In [15]:
disease_distance(unmatched[42], symptom_vectors_pm, metric="cityblock").head(10)

Unnamed: 0,Eating Disorders
Eating Disorders,0.0
Malnutrition,4686.084
Bulimia Nervosa,4688.087
Insulin Resistance,4693.421
Starvation,4745.361
Prader-Willi Syndrome,4774.456
Glucose Intolerance,4775.955
Protein-Energy Malnutrition,4810.812
Hyperinsulinism,4831.158
Metabolic Diseases,4836.264


Using a different distance metric retrieves more appropriate similarities for this particular disease.

Therefore, for this approach to be a viable automated method to fix disease terms that are missing MeSH codes, the best methodology would need to be refined.

## Representing the diseases from the DBPedia dataset as vectors

Importing the dataset as retrieved from DBpedia, before the fixes to the diseases/symptoms have been applied:

In [17]:
unfixed_db_data_codes = pd.read_csv("data/dbpedia_data_unfixed.csv")
unfixed_db_data_codes["disease"] = unfixed_db_data_codes["disease"].str.strip("<>").str.replace("_", " ")
unfixed_db_data_codes["symptom"] = unfixed_db_data_codes["symptom"].str.strip("<>").str.replace("_", " ")
unfixed_db_data = unfixed_db_data_codes.copy()[["disease", "symptom"]]

In [18]:
unfixed_db_data = unfixed_db_data.drop_duplicates().reset_index(drop=True)

Each association between disease and symptom has no particular weight, therefore each weight is assigned as '1'.

In [19]:
unfixed_db_data["weight"] = 1

In [20]:
unfixed_db_data 

Unnamed: 0,disease,symptom,weight
0,Abscess,Erythema,1
1,Achondroplasia,Macrocephaly,1
2,Acne,Scar,1
3,Acrocallosal syndrome,Psychomotor retardation,1
4,Acrocallosal syndrome,Polydactyly,1
...,...,...,...
1164,Yellow fever,Jaundice,1
1165,Yellow fever,Chills,1
1166,Yellow fever,Myalgia,1
1167,Yellow fever,Headache,1


Creating a dataframe that will store each disease, represented as a vector of its symptoms, with the value taken from the weight between disease and symptom.

In [21]:
symptom_vectors_db = pd.DataFrame(unfixed_db_data.sort_values("symptom")["symptom"].unique(), columns=["symptom"]).set_index("symptom")

In [22]:
symptom_vectors_db

Abdominal pain
Abscess
Acanthosis nigricans
Acne
Acromegaly
...
Ventricular tachycardia
Vertigo
Vomiting
Weakness
Xerostomia


Taking each disease in the dataset by itself, and representing it as a vector:

In [23]:
disease_groups_db = unfixed_db_data.sort_values("disease").groupby("disease")

In [24]:
symptom_vectors_db = pd.concat([symptom_vectors_db] 
                            + [disease_groups_db.get_group(disease).set_index("symptom")["weight"] for disease, _ in disease_groups_db], axis=1)

Forming the vectors dataset, and filling in unconnected diseases and symptoms with 0:

In [25]:
symptom_vectors_db.columns = list(disease_groups_db.groups.keys())

In [26]:
symptom_vectors_db = symptom_vectors_db.fillna(0)

Each of the 473 diseases is represented as a row vector of length 238, (one value for each symptom).

In [27]:
symptom_vectors_db = symptom_vectors_db.transpose()
symptom_vectors_db

symptom,Abdominal pain,Abscess,Acanthosis nigricans,Acne,Acromegaly,Acute liver failure,Albuminuria,Allergic conjunctivitis,Amenorrhea,Amnesia,...,Urinary incontinence,Urinary retention,Urinary tract infection,Vaginal discharge,Vasculitis,Ventricular tachycardia,Vertigo,Vomiting,Weakness,Xerostomia
Abscess,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,...,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000
Achondroplasia,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,...,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000
Acne,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,...,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000
Acrocallosal syndrome,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,...,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000
Acute bronchitis,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,...,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Williams syndrome,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,...,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000
X-linked lymphoproliferative disease,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,...,0.000,0.000,0.000,0.000,1.000,0.000,0.000,0.000,0.000,0.000
XYY syndrome,0.000,0.000,0.000,1.000,0.000,0.000,0.000,0.000,0.000,0.000,...,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000
Xeroderma pigmentosum,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,...,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000


## Example: matching a disease from the DBpedia dataset to other diseases

As the values in the vector are either 0 or 1, the jaccard metric of similarity is used. As diseases in the DBpedia dataset often have few symptoms associated, this metric will emphasise the symptoms that each disease has in common, rather than those that they don't.

In the original dataset, the term "lightheadedness" had been misapplied and had to be manually fixed. By calculating similarity measures, we can successfully match it to other similar conditions, such as vertigo and dizziness.

In [28]:
disease_distance("Lightheadedness", symptom_vectors_db, metric="jaccard").head(5)

Unnamed: 0,Lightheadedness
Benign paroxysmal positional vertigo,0.0
Lightheadedness,0.0
Dizziness,0.5
Orthostatic hypotension,0.5
Air embolism,0.75


However this method does not always work. Because many diseases in the DBpedia dataset are associated with so few symptoms, different types of diseases are not able to be so well separated as in the PubMed dataset.

"Panic attack" is associated as closely with "Idiopathic pulmonary fibrosis" as it is to "Panic disorder".

In [29]:
disease_distance("Panic attack", symptom_vectors_db, metric="jaccard").head(5)

Unnamed: 0,Panic attack
Panic attack,0.0
Idiopathic pulmonary fibrosis,0.5
Ascites,0.5
Mesothelioma,0.5
Panic disorder,0.5


For datasets like that from DBpedia, additional methods may be required to match up the diseases.

### Fixing the MeSH codes

Given the disease similarity results, one way to "fix" the data is by simply assigning an unmatched term the same MeSH code as another disease most similar to it.

For example, "panic attack" could be given the same MeSH code as "panic disorder", which was discovered to be one of the most similar diseases to "panic attack". This approach was previously done manually, but here it could be automated.

In [30]:
unfixed_db_data_codes[unfixed_db_data_codes["disease"]=="Panic disorder"][["disease", "disease_mesh"]]

Unnamed: 0,disease,disease_mesh
850,Panic disorder,D016584


The previous method presumes that very similar diseases are essentially equivalent. However this may not always be the case: when looking at the PubMed dataset, "Eating Disorders" falls under the same category as its similar diseases "Bulimia Nervosa" and "Anorexia Nervosa", however they are not exactly equivalent.

Therefore a different approach would be to find the "parent tree" category of the most similar disease. This more general category should then also contain the unmatched term.

In [31]:
api = "https://id.nlm.nih.gov/mesh/lookup/descriptor"
disease_mesh = requests.get(api, params={"label": "Bulimia Nervosa", "match":"exact", "limit":1}).json()[0]["resource"][27:]

Using a SPARQL query to request the name of the parent tree:

In [32]:
api = "https://id.nlm.nih.gov/mesh/sparql"

query = """PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX meshv: <http://id.nlm.nih.gov/mesh/vocab#>
PREFIX mesh: <http://id.nlm.nih.gov/mesh/>

SELECT ?diseaseLabel ?treeNum ?ancestorTreeNum ?ancestorTreeLabel
FROM <http://id.nlm.nih.gov/mesh>

WHERE {
mesh:D052018 rdfs:label ?diseaseLabel .
mesh:D052018 meshv:treeNumber ?treeNum .
?treeNum meshv:parentTreeNumber ?ancestorTreeNum .
?meshCode meshv:treeNumber ?ancestorTreeNum  .
?meshCode rdfs:label ?ancestorTreeLabel
}"""

In [33]:
response = requests.get(api, params={"query":query, "format":"JSON", "inference":True}).json()

In [34]:
pd.DataFrame(response["results"]["bindings"][0]).loc["value", :]

diseaseLabel                                Bulimia Nervosa
treeNum              http://id.nlm.nih.gov/mesh/F03.400.250
ancestorTreeNum          http://id.nlm.nih.gov/mesh/F03.400
ancestorTreeLabel              Feeding and Eating Disorders
Name: value, dtype: object

"Bulimia Nervosa" falls under the tree heading "Feeding and Eating Disorders" which has mesh code "F03.400". 

The term "Eating Disorders" naturally falls under this category, therefore it could also be assigned the code "F03.400" as a unique identifier in the graph. 

## References

Zhou, X., Menche, J., Barabási, A. L., & Sharma, A. (2014). Human symptoms–disease network. Nature communications, 5(1), 1-10. Dataset released under Creative Commons Attribution v4.0 International licence. https://static-content.springer.com/esm/art%3A10.1038%2Fncomms5212/MediaObjects/41467_2014_BFncomms5212_MOESM1045_ESM.txt  
DBPedia dataset (https://dbpedia.org/) released under Creative Commons Attribution-ShareAlike 3.0  

MeSH RDF SPAQRL API: https://id.nlm.nih.gov/mesh/swagger/ui    
MeSH RDF API documentation: https://hhs.github.io/meshrdf/sparql-and-uri-requests