In [1]:
import pandas as pd
# link to datasets to download
#https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/IXA7BM
kg = pd.read_csv('import/kg.csv', low_memory=False)
nodes = pd.read_csv('import/nodes.csv', low_memory=False)
drug_features = pd.read_csv('import/drug_features.csv', low_memory=False)
disease_features = pd.read_csv('import/disease_features.csv', low_memory=False)


In [2]:
print(kg.columns.tolist())
print(nodes.columns.tolist())
print(drug_features.columns.tolist())
print(disease_features.columns.tolist())

['relation', 'display_relation', 'x_index', 'x_id', 'x_type', 'x_name', 'x_source', 'y_index', 'y_id', 'y_type', 'y_name', 'y_source']
['node_index', 'node_id', 'node_type', 'node_name', 'node_source']
['node_index', 'description', 'half_life', 'indication', 'mechanism_of_action', 'protein_binding', 'pharmacodynamics', 'state', 'atc_1', 'atc_2', 'atc_3', 'atc_4', 'category', 'group', 'pathway', 'molecular_weight', 'tpsa', 'clogp']
['node_index', 'mondo_id', 'mondo_name', 'group_id_bert', 'group_name_bert', 'mondo_definition', 'umls_description', 'orphanet_definition', 'orphanet_prevalence', 'orphanet_epidemiology', 'orphanet_clinical_description', 'orphanet_management_and_treatment', 'mayo_symptoms', 'mayo_causes', 'mayo_risk_factors', 'mayo_complications', 'mayo_prevention', 'mayo_see_doc']


#### KG
- **relation**: The technical name of the link (e.g., indication, off-label use).
- **display_relation**: The human-readable version (e.g., "treats", "causes side effect").
- **x_index / y_index**: The unique numerical IDs for the two entities being linked.
- **x_id / y_id**: The external database IDs (e.g., a DrugBank ID or a Mondo ID).
- **x_type / y_type**: What kind of things they are (e.g., drug, disease, phenotype).
- **x_name / y_name**: The common names (e.g., "Aspirin", "Headache").
- **x_source / y_source** : Where this specific link was discovered (e.g., DrugBank, CTD).
#### Nodes 
- **node_index** : The primary key used to link to the other files.
- **node_id**: The official medical code (HPO, MONDO, etc.).
- **node_type**: The category (Drug, Disease, etc.).
- **node_name**: The official name of the concept.
- **node_source**: The vocabulary source (e.g., MSH for MeSH).
#### drug_features
- **description**: A general overview of the drug.
- **half_life** : How long the drug stays in the body (critical for dosage reasoning).
- **indication** : The specific medical condition the drug is approved to treat.
- **mechanism_of_action** : (CRITICAL) The biological "how"â€”which receptors it hits.
- **protein_binding** : How the drug hitches a ride in the blood.
- **pharmacodynamics** : The effect the drug has on the body.
- **state** : Physical state (Solid, Liquid, etc.).
- **atc_1 through atc_4** : Anatomical Therapeutic Chemical classification (a hierarchy of what organ system the drug targets).
- **category / group** : Labels like "Analgesic" or "Approved".
- **pathway**: The biological "road" the drug travels in the cell.
- **molecular_weight / tpsa / clogp**: Chemical properties (less relevant for diagnosis, more for drug discovery).

#### disease_features
- **mondo_id / mondo_name** : Official identifier and name in the Mondo Disease Ontology.

- **group_id_bert / group_name_bert**: AI-clustered disease groups (useful for machine learning tasks).

- **mondo_definition**: (CRITICAL) The textbook definition of the disease.

- **umls_description**: A secondary description from the Unified Medical Language System.

- **orphanet_definition / epidemiology / clinical_description**: Deep data for rare diseases (Orphanet focus).

- **orphanet_prevalence**: How common the disease is (essential for calculating probability).

- **mayo_symptoms**: (CRITICAL for DDXPlus) A list of symptoms used to diagnose the disease.

- **mayo_causes**: What triggers the disease (environmental, genetic, etc.).

- **mayo_risk_factors / complications**: What makes the disease worse and what it can lead to.

- **mayo_prevention**: How to avoid it.

- **mayo_see_doc**: "Red flag" signs that mean the patient needs an emergency room.


In [4]:
kg.columns = [c.strip() for c in kg.columns]
drug_features.columns = [c.strip() for c in drug_features.columns]
disease_features.columns = [c.strip() for c in disease_features.columns]

In [7]:
## drop duplicates is done for having unique (relation, display_relation) pairs, and sort_values is for better readability
mapping = kg[['relation', 'display_relation']].drop_duplicates().sort_values('relation')
relation_counts = kg['relation'].value_counts() #it creates new  series "unique relation- number of row for that relation"
print("number of relation types: ",len(relation_counts))
print(mapping)

number of relation types:  30
                           relation         display_relation
3834678             anatomy_anatomy             parent-child
5366913      anatomy_protein_absent        expression absent
3848710     anatomy_protein_present       expression present
3413119       bioprocess_bioprocess             parent-child
3637356          bioprocess_protein           interacts with
3479579           cellcomp_cellcomp             parent-child
3553954            cellcomp_protein           interacts with
346728             contraindication         contraindication
3315993             disease_disease             parent-child
3084053  disease_phenotype_negative         phenotype absent
3085246  disease_phenotype_positive        phenotype present
3235582             disease_protein          associated with
389359                    drug_drug  synergistic interaction
3348335                 drug_effect              side effect
321075                 drug_protein                  ca

In [8]:
examples = nodes.groupby('node_type').first().reset_index()

print(f"{'NODE TYPE':<25} | {'EXAMPLE NAME (node_name)'}")
print("-" * 60)

for _, row in examples.iterrows():
    print(f"{row['node_type']:<25} | {row['node_name']}")

NODE TYPE                 | EXAMPLE NAME (node_name)
------------------------------------------------------------
anatomy                   | uterine cervix
biological_process        | negative regulation of neurotransmitter uptake
cellular_component        | cellular anatomical entity
disease                   | osteogenesis imperfecta
drug                      | Copper
effect/phenotype          | Growth abnormality
exposure                  | 1-hydroxyphenanthrene
gene/protein              | PHYHIP
molecular_function        | methyltransferase activity
pathway                   | Apoptosis


In [9]:

clinical_relations = ['disease_phenotype_positive', 'indication', 'contraindication']
filtered_df = kg[kg['relation'].isin(clinical_relations)].copy()


In [10]:
nodes_map = filtered_df[['relation', 'x_type', 'y_type']].drop_duplicates().reset_index(drop=True)
print("--- Clinical Relationship Node Map ---")
print(nodes_map)

print("\n--- Connection Type Counts ---")
counts = filtered_df.groupby(['relation', 'x_type', 'y_type']).size().reset_index(name='count')
print(counts)

--- Clinical Relationship Node Map ---
                     relation            x_type            y_type
0            contraindication              drug           disease
1                  indication              drug           disease
2  disease_phenotype_positive           disease  effect/phenotype
3  disease_phenotype_positive  effect/phenotype           disease
4            contraindication           disease              drug
5                  indication           disease              drug

--- Connection Type Counts ---
                     relation            x_type            y_type   count
0            contraindication           disease              drug   30675
1            contraindication              drug           disease   30675
2  disease_phenotype_positive           disease  effect/phenotype  150317
3  disease_phenotype_positive  effect/phenotype           disease  150317
4                  indication           disease              drug    9388
5                  indi

In [11]:
# take x_index from the KG, take  node_index from the drug features to 
enriched = pd.merge(
    filtered_df, 
    drug_features[['node_index', 'mechanism_of_action', 'description']], 
    left_on='x_index', 
    right_on='node_index', 
    how='left'
).rename(columns={'mechanism_of_action': 'x_mechanism', 'description': 'x_description'}).drop(columns=['node_index'])

In [12]:
# take y_index from the KG, take  node_index from the drug features to 
enriched = pd.merge(
    enriched, 
    drug_features[['node_index', 'mechanism_of_action', 'description']], 
    left_on='y_index', 
    right_on='node_index', 
    how='left'
).rename(columns={'mechanism_of_action': 'y_mechanism', 'description': 'y_description'}).drop(columns=['node_index'])

In [13]:
enriched = pd.merge(
    enriched, 
    disease_features[['node_index', 'mondo_definition', 'mayo_symptoms']], 
    left_on='x_index', 
    right_on='node_index', 
    how='left'
).rename(columns={'mondo_definition': 'x_definition', 'mayo_symptoms': 'x_symptoms'}).drop(columns=['node_index'])

In [14]:
enriched = pd.merge(
    enriched, 
    disease_features[['node_index', 'mondo_definition', 'mayo_symptoms']], 
    left_on='y_index', 
    right_on='node_index', 
    how='left'
).rename(columns={'mondo_definition': 'y_definition', 'mayo_symptoms': 'y_symptoms'}).drop(columns=['node_index'])

In [15]:
display_table = enriched[[
    'x_name', 
    'relation', 
    'y_name', 
    'x_mechanism', 
    "x_definition",
    "y_mechanism",
    'y_definition'
]].drop_duplicates().head(10).fillna('N/A')

pd.set_option('display.max_colwidth', None)
display_table

Unnamed: 0,x_name,relation,y_name,x_mechanism,x_definition,y_mechanism,y_definition
0,Rotigotine,contraindication,hypertensive disorder,"Rotigotine, a member of the dopamine agonist class of drugs, is delivered continuously through the skin (transdermal) using a silicone-based patch that is replaced every 24 hours. A dopamine agonist works by activating dopamine receptors in the body, mimicking the effect of the neurotransmitter dopamine. The precise mechanism of action of rotigotine as a treatment for Restless Legs Syndrome is unknown but is thought to be related to its ability to stimulate dopamine",,,"Persistently high systemic arterial blood pressure. Based on multiple readings (blood pressure determination), hypertension is currently defined as when systolic pressure is consistently greater than 140 mm Hg or when diastolic pressure is consistently 90 mm Hg or more."
13,Rotigotine,contraindication,hypertension,"Rotigotine, a member of the dopamine agonist class of drugs, is delivered continuously through the skin (transdermal) using a silicone-based patch that is replaced every 24 hours. A dopamine agonist works by activating dopamine receptors in the body, mimicking the effect of the neurotransmitter dopamine. The precise mechanism of action of rotigotine as a treatment for Restless Legs Syndrome is unknown but is thought to be related to its ability to stimulate dopamine",,,High blood pressure caused by an underlying medical condition.
15,Rotigotine,contraindication,hypertension,"Rotigotine, a member of the dopamine agonist class of drugs, is delivered continuously through the skin (transdermal) using a silicone-based patch that is replaced every 24 hours. A dopamine agonist works by activating dopamine receptors in the body, mimicking the effect of the neurotransmitter dopamine. The precise mechanism of action of rotigotine as a treatment for Restless Legs Syndrome is unknown but is thought to be related to its ability to stimulate dopamine",,,Hypertension that presents without an identifiable cause.
27,Rotigotine,contraindication,hypertension,"Rotigotine, a member of the dopamine agonist class of drugs, is delivered continuously through the skin (transdermal) using a silicone-based patch that is replaced every 24 hours. A dopamine agonist works by activating dopamine receptors in the body, mimicking the effect of the neurotransmitter dopamine. The precise mechanism of action of rotigotine as a treatment for Restless Legs Syndrome is unknown but is thought to be related to its ability to stimulate dopamine",,,An instance of hypertension that is caused by a modification of the individual's genome.
28,Rotigotine,contraindication,hypertension,"Rotigotine, a member of the dopamine agonist class of drugs, is delivered continuously through the skin (transdermal) using a silicone-based patch that is replaced every 24 hours. A dopamine agonist works by activating dopamine receptors in the body, mimicking the effect of the neurotransmitter dopamine. The precise mechanism of action of rotigotine as a treatment for Restless Legs Syndrome is unknown but is thought to be related to its ability to stimulate dopamine",,,"Increased blood pressure in the portal venous system. It is most commonly caused by cirrhosis. Other causes include portal vein thrombosis, Budd-Chiari syndrome, and right heart failure. Complications include ascites, esophageal varices, encephalopathy, and splenomegaly."
36,Rotigotine,contraindication,hypertension,"Rotigotine, a member of the dopamine agonist class of drugs, is delivered continuously through the skin (transdermal) using a silicone-based patch that is replaced every 24 hours. A dopamine agonist works by activating dopamine receptors in the body, mimicking the effect of the neurotransmitter dopamine. The precise mechanism of action of rotigotine as a treatment for Restless Legs Syndrome is unknown but is thought to be related to its ability to stimulate dopamine",,,"A severe medical condition which is estimated to appear in 9-18% of hypertensive patients, in which treatement with 3 or more antihypertensive drugs including diuretics are ineffective."
37,Fosinopril,indication,hypertensive disorder,"There are two isoforms of ACE: the somatic isoform, which exists as a glycoprotein comprised of a single polypeptide chain of 1277; and the testicular isoform, which has a lower molecular mass and is thought to play a role in sperm maturation and binding of sperm to the oviduct epithelium. Somatic ACE has two functionally active domains, N and C, which arise from tandem gene duplication. Although the two domains have high sequence similarity, they play distinct physiological roles. The C-domain is predominantly involved in blood pressure regulation while the N-domain plays a role in hematopoietic stem cell differentiation and proliferation. ACE inhibitors bind to and inhibit the activity of both domains, but have much greater affinity for and inhibitory activity against the C-domain. Fosinoprilat, the active metabolite of fosinopril, competes with ATI for binding to ACE and inhibits and enzymatic proteolysis of ATI to ATII. Decreasing ATII levels in the body decreases blood pressure by inhibiting the pressor effects of ATII as described in the Pharmacology section above. Fosinoprilat also causes an increase in plasma renin activity likely due to a loss of feedback inhibition mediated by ATII on the release of renin and/or stimulation of reflex mechanisms via baroreceptors.",,,"Persistently high systemic arterial blood pressure. Based on multiple readings (blood pressure determination), hypertension is currently defined as when systolic pressure is consistently greater than 140 mm Hg or when diastolic pressure is consistently 90 mm Hg or more."
50,Fosinopril,indication,hypertension,"There are two isoforms of ACE: the somatic isoform, which exists as a glycoprotein comprised of a single polypeptide chain of 1277; and the testicular isoform, which has a lower molecular mass and is thought to play a role in sperm maturation and binding of sperm to the oviduct epithelium. Somatic ACE has two functionally active domains, N and C, which arise from tandem gene duplication. Although the two domains have high sequence similarity, they play distinct physiological roles. The C-domain is predominantly involved in blood pressure regulation while the N-domain plays a role in hematopoietic stem cell differentiation and proliferation. ACE inhibitors bind to and inhibit the activity of both domains, but have much greater affinity for and inhibitory activity against the C-domain. Fosinoprilat, the active metabolite of fosinopril, competes with ATI for binding to ACE and inhibits and enzymatic proteolysis of ATI to ATII. Decreasing ATII levels in the body decreases blood pressure by inhibiting the pressor effects of ATII as described in the Pharmacology section above. Fosinoprilat also causes an increase in plasma renin activity likely due to a loss of feedback inhibition mediated by ATII on the release of renin and/or stimulation of reflex mechanisms via baroreceptors.",,,High blood pressure caused by an underlying medical condition.
52,Fosinopril,indication,hypertension,"There are two isoforms of ACE: the somatic isoform, which exists as a glycoprotein comprised of a single polypeptide chain of 1277; and the testicular isoform, which has a lower molecular mass and is thought to play a role in sperm maturation and binding of sperm to the oviduct epithelium. Somatic ACE has two functionally active domains, N and C, which arise from tandem gene duplication. Although the two domains have high sequence similarity, they play distinct physiological roles. The C-domain is predominantly involved in blood pressure regulation while the N-domain plays a role in hematopoietic stem cell differentiation and proliferation. ACE inhibitors bind to and inhibit the activity of both domains, but have much greater affinity for and inhibitory activity against the C-domain. Fosinoprilat, the active metabolite of fosinopril, competes with ATI for binding to ACE and inhibits and enzymatic proteolysis of ATI to ATII. Decreasing ATII levels in the body decreases blood pressure by inhibiting the pressor effects of ATII as described in the Pharmacology section above. Fosinoprilat also causes an increase in plasma renin activity likely due to a loss of feedback inhibition mediated by ATII on the release of renin and/or stimulation of reflex mechanisms via baroreceptors.",,,Hypertension that presents without an identifiable cause.
64,Fosinopril,indication,hypertension,"There are two isoforms of ACE: the somatic isoform, which exists as a glycoprotein comprised of a single polypeptide chain of 1277; and the testicular isoform, which has a lower molecular mass and is thought to play a role in sperm maturation and binding of sperm to the oviduct epithelium. Somatic ACE has two functionally active domains, N and C, which arise from tandem gene duplication. Although the two domains have high sequence similarity, they play distinct physiological roles. The C-domain is predominantly involved in blood pressure regulation while the N-domain plays a role in hematopoietic stem cell differentiation and proliferation. ACE inhibitors bind to and inhibit the activity of both domains, but have much greater affinity for and inhibitory activity against the C-domain. Fosinoprilat, the active metabolite of fosinopril, competes with ATI for binding to ACE and inhibits and enzymatic proteolysis of ATI to ATII. Decreasing ATII levels in the body decreases blood pressure by inhibiting the pressor effects of ATII as described in the Pharmacology section above. Fosinoprilat also causes an increase in plasma renin activity likely due to a loss of feedback inhibition mediated by ATII on the release of renin and/or stimulation of reflex mechanisms via baroreceptors.",,,An instance of hypertension that is caused by a modification of the individual's genome.


In [5]:
drug_nodes = nodes[nodes['node_type'] == 'drug']
disease_nodes = nodes[nodes['node_type'] == 'disease']
print(f"Drug nodes: {len(drug_nodes)}, with features: {drug_nodes['node_index'].isin(drug_features['node_index']).sum()}")
print(f"Disease nodes: {len(disease_nodes)}, with features: {disease_nodes['node_index'].isin(disease_features['node_index']).sum()}")

Drug nodes: 7957, with features: 7957
Disease nodes: 17080, with features: 17080
