In [2]:
import pandas as pd
import numpy as np

### Importing the data

In [3]:
pm_data = pd.read_csv("data/pubmed_data_final.csv")
pm_data = pm_data[["MeSH Disease Term", "MeSH Symptom Term", "UMLS Disease Code", "UMLS Symptom Code", "TFIDF score"]]
pm_data.columns=["disease", "symptom", "disease_umls", "symptom_umls", "weight"]

db_data = pd.read_csv("data/dbpedia_data.csv")[["disease", "symptom", "disease_umls", "symptom_umls"]]
h_data = pd.read_csv("data/hospital_data.csv")
c_data = pd.read_csv("data/combined_data.csv")

Finding a set of all unique disease/symptom associations for each dataset.

In [4]:
pm_ass = {frozenset(entry) for entry in zip(pm_data["disease_umls"], pm_data["symptom_umls"])}
db_ass = {frozenset(entry) for entry in zip(db_data["disease_umls"], db_data["symptom_umls"])}
h_ass = {frozenset(entry) for entry in zip(h_data["disease_umls"], h_data["symptom_umls"])}
c_ass = {frozenset(entry) for entry in zip(c_data["disease_umls"], c_data["symptom_umls"])}
all_ass = [pm_ass, db_ass, h_ass, c_ass]
all_ass_names = ["PubMed", "DBPedia", "Hospital", "Combined"]

### How many associations are shared between the datasets?

Finding the intersection of the associations between each dataset:

In [100]:
shared_ass = pd.DataFrame(columns=["PubMed", "DBPedia", "Hospital", "Combined"], index=["PubMed", "DBPedia", "Hospital", "Combined"])

In [104]:
for base_set, base_name in zip(all_ass, all_ass_names):
    for comparison_set, comparison_name in zip(all_ass, all_ass_names):
        if base_name != comparison_name:
            shared_ass.at[comparison_name, base_name] = len(base_set.intersection(comparison_set))

In [105]:
shared_ass

Unnamed: 0,PubMed,DBPedia,Hospital,Combined
PubMed,,663.0,190.0,35430.0
DBPedia,663.0,,39.0,1158.0
Hospital,190.0,39.0,,1309.0
Combined,35430.0,1158.0,1309.0,


The "combined" dataset is dominated by associations taken from the PubMed dataset, due to its much larger size.  
The PubMed and DBPedia datasets have many more shared associations than PubMed and Hospital, even though the hospital dataset is larger than the DBpedia dataset.

### Looking at precision and recall

Creating functions to calculate recall and precision values to compare the datasets.

True positive: number of associations existing in both datasets. (Set intersection).  
False negative: association exists in base dataset, doesn't exist in comparison dataset. (Set difference: base set minus comparison set).  
False positive: association exists in comparison dataset, doesn't exist in base dataset. (Set difference: Comparison set minus base set).  

In [158]:
def recall(base, comparison):
    # true positive / (true positive + false negative)
    tp = len(base.intersection(comparison))
    fn = len(base - comparison)
    return tp / (tp + fn)

In [159]:
def precision(base, comparison):
    # true positive / (true positive + false positive)
    tp = len(base.intersection(comparison))
    fp = len(comparison - base)
    return tp / (tp + fp)

Setting up dataframes to store the results, where the column is the base dataset, and the index is the comparison.

In [163]:
recall_prec_df = pd.DataFrame(columns=["PubMed", "DBPedia", "Hospital", "Combined"], index=["PubMed", "DBPedia", "Hospital", "Combined"])

Calculating recall and precision values between all of the datasets:

In [164]:
for base_set, base_name in zip(all_ass, all_ass_names):
    for comparison_set, comparison_name in zip(all_ass, all_ass_names):
        if base_name != comparison_name:
            recall_prec_df.at[comparison_name, base_name] = str(np.round(recall(base_set, comparison_set)*100, 2)) + "%"

In [165]:
recall_prec_df.columns = pd.MultiIndex.from_product([["Base"], ["PubMed", "DBPedia", "Hospital", "Combined"]])
recall_prec_df.index = pd.MultiIndex.from_product([["Comparison"], ["PubMed", "DBPedia", "Hospital", "Combined"]])

Recall: given the associations in the "base" health knowledge graph, how many (% of base) also exist in "comparison"?  

In [185]:
recall_df = recall_prec_df.copy()
recall_df

Unnamed: 0_level_0,Unnamed: 1_level_0,Base,Base,Base,Base
Unnamed: 0_level_1,Unnamed: 1_level_1,PubMed,DBPedia,Hospital,Combined
Comparison,PubMed,,57.25%,10.25%,95.6%
Comparison,DBPedia,0.49%,,2.1%,3.12%
Comparison,Hospital,0.14%,3.37%,,3.53%
Comparison,Combined,26.09%,100.0%,70.6%,


Precision: given the associations in the "comparison" health knowledge graph, how many (% of comparison) also exist in "base"?  

In [186]:
prec_df = recall_prec_df.copy().transpose()
prec_df.columns = pd.MultiIndex.from_product([["Base"], ["PubMed", "DBPedia", "Hospital", "Combined"]])
prec_df.index = pd.MultiIndex.from_product([["Comparison"], ["PubMed", "DBPedia", "Hospital", "Combined"]])
prec_df

Unnamed: 0_level_0,Unnamed: 1_level_0,Base,Base,Base,Base
Unnamed: 0_level_1,Unnamed: 1_level_1,PubMed,DBPedia,Hospital,Combined
Comparison,PubMed,,0.49%,0.14%,26.09%
Comparison,DBPedia,57.25%,,3.37%,100.0%
Comparison,Hospital,10.25%,2.1%,,70.6%
Comparison,Combined,95.6%,3.12%,3.53%,


The PubMed dataset is much larger than the other two, containing far more associations than either of them.  
However more than 40% of the DBPedia dataset is not captured by the PubMed dataset, and DBPedia has very few shared associations to the Hospital Dataset.

## References

Zhou, X., Menche, J., Barabási, A. L., & Sharma, A. (2014). Human symptoms–disease network. Nature communications, 5(1), 1-10. Dataset released under Creative Commons Attribution v4.0 International licence.  
DBPedia dataset (https://dbpedia.org/) released under Creative Commons Attribution-ShareAlike 3.0   
Wang X, Chused A, Elhadad N, Friedman C, Markatou M. Automated knowledge acquisition from clinical narrative reports. AMIA Annu Symp Proc. 2008 Nov 6;2008:783-7. PMID: 18999156; PMCID: PMC2656103.