# Hypertension compounds

We'll use pretrained vectors to find new edges between FDA approved compounds and the hypertension disease entity.

In [25]:
import pandas as pd
import numpy as np
import torch
import csv
import torch.nn.functional as fn

Load the DRKG knowledge graph:

In [None]:
drkg_file = './drkg.tsv'
df = pd.read_csv(drkg_file, sep="\t",header = None, names = ["h","r","t"])
triples = np.array(df.values.tolist())

The data consists of semantic triples with head and tail containing the entity type:

In [7]:
df.head()

Unnamed: 0,h,r,t
0,Gene::2157,bioarx::HumGenHumGen:Gene:Gene,Gene::2157
1,Gene::2157,bioarx::HumGenHumGen:Gene:Gene,Gene::5264
2,Gene::2157,bioarx::HumGenHumGen:Gene:Gene,Gene::2158
3,Gene::2157,bioarx::HumGenHumGen:Gene:Gene,Gene::3309
4,Gene::2157,bioarx::HumGenHumGen:Gene:Gene,Gene::28912


The edges we are interested in are the ones between compound and disease

In [23]:
allowed_labels = ['Hetionet::CtD::Compound:Disease','GNBR::T::Compound:Disease']

representing around 54K edges:

In [45]:
print("amount: ",df[df.r=="GNBR::T::Compound:Disease"].shape[0]+df[df.r=="Hetionet::CtD::Compound:Disease"].shape[0])

amount:  54775


 [Hypertension](https://en.wikipedia.org/wiki/Hypertension) has node id "Disease::DOID:10763" but you can include multiple id's. Note that some diseases like COVID have variations and have separate id's in the graph.

In [46]:
what_diseases = ["Disease::DOID:10763"]

We will only include FDA approved compounds:

In [47]:
allowed_drug = []
with open("./FDAApproved.tsv", newline='', encoding='utf-8') as csvfile:
    reader = csv.DictReader(csvfile, delimiter='\t', fieldnames=['drug','ids'])
    for row_val in reader:
        allowed_drug.append(row_val['drug'])

This represents around 5000 different substances:

In [18]:
len(allowed_drug)

8104

The pretrained vectors use numeric identifiers rather than the DRKG labels, so we need load the dictionaries to convert back and forth:

In [48]:
entity_to_id = './entityToId.tsv'
relation_to_id = './relationToId.tsv'

In [49]:
entity_name_to_id = {}
entity_id_to_name = {}
relation_name_to_id = {}

with open(entity_to_id, newline='', encoding='utf-8') as csvfile:
    reader = csv.DictReader(csvfile, delimiter='\t', fieldnames=['name','id'])
    for row_val in reader:
        entity_name_to_id[row_val['name']] = int(row_val['id'])
        entity_id_to_name[int(row_val['id'])] = row_val['name']

with open(relation_to_id, newline='', encoding='utf-8') as csvfile:
    reader = csv.DictReader(csvfile, delimiter='\t', fieldnames=['name','id'])
    for row_val in reader:
        relation_name_to_id[row_val['name']] = int(row_val['id'])


allowed_drug_ids = []
disease_ids = []
for drug in allowed_drug:
    allowed_drug_ids.append(entity_name_to_id[drug])

for disease in what_diseases:
    disease_ids.append(entity_name_to_id[disease])

allowed_relation_ids = [relation_name_to_id[treat]  for treat in allowed_labels]

In [51]:
entity_emb = np.load('./entity_vectors.npy')
rel_emb = np.load('./relation_vectors.npy')

allowed_drug_ids = torch.tensor(allowed_drug_ids).long()
disease_ids = torch.tensor(disease_ids).long()
allowed_relation_ids = torch.tensor(allowed_relation_ids)

allowed_drug_tensors = torch.tensor(entity_emb[allowed_drug_ids])
allowed_relation_tensors = [torch.tensor(rel_emb[rel]) for rel in allowed_relation_ids]

In [43]:
print("node vectors: ",len(entity_emb),",edge vectors: ",len(rel_emb))

node vectors:  97238 ,edge vectors:  107


Neighborhood deficit computation based on the embedding vectors:

In [62]:
threshold= 20
def score(h, r, t):
    return fn.logsigmoid(threshold - torch.norm(h + r - t, p=2, dim=-1))

allowed_drug_scores = []
drug_ids = []
for relation_tensor in range(len(allowed_relation_tensors)):
    rel_vector = allowed_relation_tensors[relation_tensor]
    for disease_id in disease_ids:
        disease_vector = entity_emb[disease_id]
        drug_score = score(allowed_drug_tensors, rel_vector, disease_vector)
        allowed_drug_scores.append(drug_score)
        drug_ids.append(allowed_drug_ids)
scores = torch.cat(allowed_drug_scores)
drug_ids = torch.cat(drug_ids)

In [63]:
idx = torch.flip(torch.argsort(scores), dims=[0])
scores = scores[idx].numpy()
drug_ids = drug_ids[idx].numpy()
_, unique_indices = np.unique(drug_ids, return_index=True)
# top 10
topk_indices = np.sort(unique_indices)[:10]
proposed_dids = drug_ids[topk_indices]
proposed_scores = scores[topk_indices]

In [65]:
for i in range(10):
    drug = int(proposed_dids[i])
    score = proposed_scores[i]

    print("{}\t{}".format(entity_id_to_name[drug], score))

Compound::DB00584	-1.883488948806189e-05
Compound::DB00521	-2.2053474822314456e-05
Compound::DB00492	-2.586808113846928e-05
Compound::DB00275	-2.6464111215318553e-05
Compound::DB00869	-2.6702524337451905e-05
Compound::DB00524	-2.8609820219571702e-05
Compound::DB00421	-2.8967437174287625e-05
Compound::DB00722	-2.9682672902708873e-05
Compound::DB00528	-3.0397906812140718e-05
Compound::DB00612	-3.0874729418428615e-05


[Enalapril](https://go.drugbank.com/drugs/DB00584) is the most likely treatment and it can be checked.