# COVID-19 knowledge graph

## Abstract

This work focuses on extracting relations between chemicals, genes and disease from research articles to build a biomedical knowledge graph related to COVID-19. Entities are collected from [BioSNAP](http://snap.stanford.edu/biodata/index.html) to identify instances of chemical, gene and disease names in abstracts.

To understand what relationship types are possible and map unstructured natural language descriptions onto these structured classes, labeled sentences in [Percha B 2018](https://academic.oup.com/bioinformatics/article/34/15/2614/4911883) are used to classify relations between entities.
The source code for classification model can be seen [here](https://github.com/jxzly/Biomedical-Relation-Classification).

In [None]:
!pip install pyecharts

In [None]:
import numpy as np 
import pandas as pd
from tqdm import tqdm_notebook as tqdm
from pyecharts import options as opts
from pyecharts.charts import Graph
import torch
import torch.nn.functional as F
from torch.utils.data import Dataset,TensorDataset,DataLoader
from keras.preprocessing import sequence
from transformers import BertTokenizer, BertForSequenceClassification

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.filterwarnings('ignore')

# Load marked sentences

These sentences containing more than two biomedical entities were extracted from abstracts. We combined the pairwise entities in each sentence to build relations.


First,a vocabulary is created which includes drugs, genes, diseases name from SNAP and then some entities related to COVID-19 are added manually. Then we check for those sentences in abstract which contain more than two entities and mark them.

> example:
> vocab: [A,B,C...]
> sentence: A and C are effective for B.
> marked_sentence: 
> 1. start_entity and end_entity are effective for B.
> 2. start_entity and C are effective for end_entity.
> 3. A and start_entity are effective for end_entity .

In [None]:
marked_sentence_df = pd.read_csv('/kaggle/input/covid19-knowledge-graph/marked_sentence.csv')

In [None]:
marked_sentence_df.sample(5)

In [None]:
marked_sentence_df["marked_sentence"][1]

In [None]:
marked_sentence_df.head(5)

In [None]:
def Build_graph(df,relation=False,repulsion=40,title='COVID-19 knowledge graph',labelShow=False):
    entity_type_dic = dict(df.drop_duplicates(['start_entity']).set_index(['start_entity'])['start_entity_type'])
    entity_type_dic.update(dict(df.drop_duplicates(['end_entity']).set_index(['end_entity'])['end_entity_type']))
    color = {'Disease':'#FF7F50','Gene':'#48D1CC','Chemical':'#B3EE3A'}
    cate =  {'Disease':0,'Gene':1,'Chemical':2}
    categories = [{'name':'Disease','itemStyle': {'normal': {'color': color['Disease']}}},{'name':'Gene','itemStyle': {'normal': {'color': color['Gene']}}},{'name':'Chemical','itemStyle': {'normal': {'color': color['Chemical']}}}]
    nodes = []
    for entity in list(set(df['start_entity'])|set(df['end_entity'])):
        nodes.append({'name': entity, 'symbolSize': max(10,np.log1p(df.loc[(df['start_entity']==entity)|(df['end_entity']==entity)].shape[0])*10//1),
                     'category':cate[entity_type_dic[entity]]})
    links = []
    for i in df.index:
        if not relation:
            links.append({'source': df.loc[i,'start_entity'], 'target': df.loc[i,'end_entity']})
        else:
            links.append({'source': df.loc[i,'start_entity'], 'target': df.loc[i,'end_entity'], 'value':df.loc[i,'pred']})
    g = (
        Graph()
        .add('', nodes, links,categories, repulsion=repulsion,label_opts=opts.LabelOpts(is_show=labelShow))
        .set_global_opts(title_opts=opts.TitleOpts(title=title),legend_opts=opts.LegendOpts(orient='vertical', pos_left='2%', pos_top='40%',legend_icon='circle'))
        .render_notebook()
        )
    return g

In [None]:
g = Build_graph(marked_sentence_df.sample(100),title='subsample of topology graph')
g

# Classify relations

[Percha B 2018](https://academic.oup.com/bioinformatics/article/34/15/2614/4911883) revealed 10 broad themes for chemical-gene relations, 7 for chemical-disease, 10 for gene-disease and 9 for gene–gene in Medline abstracts. These labeled sentences were used to train a model to classify sentences in this work. As order of entities cannot be confirmed, two possibilities(init_pred and reverse_pred) were predicted while retrieving the higher one. The details can be seen [here](https://github.com/jxzly/Biomedical-Relation-Classification)


In [None]:
class Args:
    task_type = 'chemical-disease'
    max_seq_len = 64
    bs = 64

class Conf:
    # some information can be found in:
    # Percha B, Altman R B. A global network of biomedical relationships derived from text[J]. Bioinformatics, 2018, 34(15): 2614-2624.
    relation_type = {'chemical-disease':['T', 'C', 'Sa', 'Pr', 'Pa', 'J'],
                     'disease-chemical':['Mp'],
                     'chemical-gene':['A+', 'A-', 'B', 'E+', 'E-', 'E', 'N'],
                     'gene-chemical':['O', 'K', 'Z'],
                     'gene-disease':['U', 'Ud', 'D', 'J', 'Te', 'Y', 'G'],
                     'disease-gene':['Md', 'X', 'L'],
                     'gene-gene':['B', 'W', 'V+', 'E+', 'E', 'I', 'H', 'Rg', 'Q'],
                     }

args = Args()
conf = Conf()
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
device

In [None]:
# load pretrained Bert model
def Bert_model(taskType,bertPath):
    label_df = pd.read_csv('/kaggle/input/covid19-knowledge-graph/%s_label.csv'%taskType)
    tokenizer = BertTokenizer.from_pretrained(bertPath,do_lower_case=False)
    model = BertForSequenceClassification.from_pretrained(bertPath, num_labels=label_df['label'].nunique())
    return label_df,tokenizer,model

# bulid data loader
def Data_loader(x,y=None,bs=128,shuffle=False,numWorkers=0):
    if y is not None:
        data = TensorDataset(x,y)
    else:
        data = TensorDataset(x)
    data_loader = DataLoader(dataset=data,batch_size=bs,shuffle=shuffle,num_workers=numWorkers)
    return data_loader

def Prepare_predict_data(tokenizer,bs):
    marked_sentences = marked_sentence_df.loc[(marked_sentence_df['start_entity_type'].apply(lambda x:x.lower())==args.task_type.split('-')[0])&\
                                              (marked_sentence_df['end_entity_type'].apply(lambda x:x.lower())==args.task_type.split('-')[1]),'marked_sentence']
    # convert tokens to ids
    ids = marked_sentences.apply(lambda x:tokenizer.convert_tokens_to_ids(tokenizer.tokenize(x))).tolist()
    # padding ids
    ids = sequence.pad_sequences(ids,args.max_seq_len, truncating='post', padding='post')
    # we cannot confirm order of entities, so predict two possibilities
    reverse_marked_sentences = marked_sentence_df.loc[(marked_sentence_df['start_entity_type'].apply(lambda x:x.lower())==args.task_type.split('-')[0])&\
                                              (marked_sentence_df['end_entity_type'].apply(lambda x:x.lower())==args.task_type.split('-')[1]),'marked_sentence']\
                                              .apply(lambda x:x.replace('start_entity','init_start_entity').replace('end_entity','start_entity').replace('init_start_entity','end_entity'))
    reverse_ids = reverse_marked_sentences.apply(lambda x:tokenizer.convert_tokens_to_ids(tokenizer.tokenize(x))).tolist()
    reverse_ids = sequence.pad_sequences(reverse_ids,args.max_seq_len, truncating='post', padding='post')
    predict_data_loader = Data_loader(torch.LongTensor(ids),torch.LongTensor(reverse_ids),bs=bs)
    return marked_sentences.values,predict_data_loader

def Predict():
    reverse_task_type = args.task_type.split('-')[1] + '-' + args.task_type.split('-')[0]
    def Filter(x):
        if x['init_pred'] in conf.relation_type[args.task_type]:
            if x['reverse_pred'] not in conf.relation_type[reverse_task_type]:
                # init_pred is a correct relation but reverse_pred not
                return 'init_pred'
            else:
                # init_pred and reverse_pred both are correct relations
                if x['init_pred_prob'] >= x['reverse_pred_prob']:
                    # init_pred_prob greater than or equal to reverse_pred_prob
                    return 'init_pred'
                else:
                    return 'reverse_pred'
        else:
            if x['reverse_pred'] not in conf.relation_type[reverse_task_type]:
                # init_pred and reverse_pred both are uncorrect relations
                return 'uncorrect'
            else:
                # reverse_pred is a correct relation but init_pred not
                return 'reverse_pred'
    label_df,tokenizer,model = Bert_model(args.task_type,'/kaggle/input/covid19-knowledge-graph/%s/'%args.task_type)
    marked_sentences,predict_data_loader = Prepare_predict_data(tokenizer,args.bs)
    model = model.to(device)
    preds = []
    preds_prob = []
    reverse_preds = []
    reverse_preds_prob = []
    for data in tqdm(predict_data_loader):
        ids,reverse_ids = [t.to(device) for t in data]
        outputs = model(input_ids=ids)
        logits = outputs[0]
        pred_prob, pred = torch.max(F.softmax(logits.data,1), 1)
        preds.extend(list(pred.cpu().detach().numpy()))
        preds_prob.extend(list(pred_prob.cpu().detach().numpy()))
        reverse_outputs = model(input_ids=reverse_ids)
        reverse_logits = reverse_outputs[0]
        reverse_pred_prob, reverse_pred = torch.max(F.softmax(reverse_logits.data,1), 1)
        reverse_preds.extend(list(reverse_pred.cpu().detach().numpy()))
        reverse_preds_prob.extend(list(reverse_pred_prob.cpu().detach().numpy()))

    pred_df = pd.DataFrame({'marked_sentence':marked_sentences,'init_pred':preds,'init_pred_prob':preds_prob,'reverse_pred':reverse_preds,'reverse_pred_prob':reverse_preds_prob})
    # map label(0, 1, 2...) to raw label(T, C, Sa...) 
    pred_df['init_pred'] = pred_df['init_pred'].replace(dict(label_df.set_index(['label'])['label_raw']))
    pred_df['reverse_pred'] = pred_df['reverse_pred'].replace(dict(label_df.set_index(['label'])['label_raw']))
    # judge the order of a pair of entities
    pred_df['filter'] = pred_df.apply(lambda x:Filter(x), axis=1)
    pred_df['pred'] = pred_df['init_pred']
    pred_df['pred_prob'] = pred_df['init_pred_prob']
    pred_df.loc[pred_df['filter']=='reverse_pred','pred'] = pred_df.loc[pred_df['filter']=='reverse_pred','reverse_pred']
    pred_df.loc[pred_df['filter']=='reverse_pred','pred_prob'] = pred_df.loc[pred_df['filter']=='reverse_pred','reverse_pred_prob']
    pred_df = pred_df.loc[pred_df['filter']!='uncorrect']
    pred_df = marked_sentence_df.merge(pred_df,how='inner',on='marked_sentence')
    pred_df['init_start_entity'] = pred_df['start_entity']
    pred_df['init_start_entity_type'] = pred_df['start_entity_type']
    pred_df.loc[pred_df['filter']=='reverse_pred','start_entity'] = pred_df.loc[pred_df['filter']=='reverse_pred','end_entity']
    pred_df.loc[pred_df['filter']=='reverse_pred','start_entity_type'] = pred_df.loc[pred_df['filter']=='reverse_pred','end_entity_type']
    pred_df.loc[pred_df['filter']=='reverse_pred','end_entity'] = pred_df.loc[pred_df['filter']=='reverse_pred','init_start_entity']
    pred_df.loc[pred_df['filter']=='reverse_pred','end_entity_type'] = pred_df.loc[pred_df['filter']=='reverse_pred','init_start_entity_type']
    pred_df.drop(['init_start_entity','init_start_entity_type'],axis=1,inplace=True)
    torch.cuda.empty_cache()
    return label_df,pred_df

In [None]:
# chemical-disease relation prediction
args.task_type = 'chemical-disease'
c_d_label_df,c_d_pred_df = Predict()

In [None]:
# chemical-disease relation theme
c_d_label_df

In [None]:
# chemical-disease classification results
c_d_pred_df.sample(5)

In [None]:
# chemical-gene relation prediction
args.task_type = 'chemical-gene'
c_g_label_df,c_g_pred_df = Predict()

In [None]:
# chemical-gene relation theme
c_g_label_df

In [None]:
# chemical-gene classification results
c_g_pred_df.sample(5)

In [None]:
# gene-disease relation prediction
args.task_type = 'gene-disease'
g_d_label_df,g_d_pred_df = Predict()

In [None]:
# gene-disease relation theme
g_d_label_df

In [None]:
# gene-disease classification results
g_d_pred_df.sample(5)

In [None]:
# gene-gene relation prediction
args.task_type = 'gene-gene'
g_g_label_df,g_g_pred_df = Predict()

In [None]:
# gene-gene relation theme
g_g_label_df

In [None]:
# gene-gene classification results
g_g_pred_df.sample(5)

# Show relation between covid-19 and other entities

Pyecharts cannot show multiple relations between two entities, so the graph is incomplete. Complete relations can be seen in ***pred_df***.

In [None]:
# chemicl-COVID-19 relations
g = Build_graph(c_d_pred_df.loc[(c_d_pred_df['start_entity']=='COVID-19')|(c_d_pred_df['end_entity']=='COVID-19')],relation=True,repulsion=800,title='Chemical-COVID-19 knowledge graph',labelShow=True)
g

In [None]:
# gene-COVID-19 relations
g = Build_graph(g_d_pred_df.loc[(g_d_pred_df['start_entity']=='COVID-19')|(g_d_pred_df['end_entity']=='COVID-19')],relation=True,repulsion=60,title='Gene-COVID-19 knowledge graph',labelShow=False)
g

In [None]:
# gene-COVID-19 relations
g = Build_graph(marked_sentence_df.loc[(marked_sentence_df['start_entity']=='COVID-19')&(marked_sentence_df['end_entity_type']=='Disease')|(marked_sentence_df['start_entity_type']=='Disease')&(marked_sentence_df['end_entity']=='COVID-19')],relation=False,repulsion=60,title='Disease-COVID-19 topology graph',labelShow=False)
g

# Results

In [None]:
# merge all relation prediction and save results
cols = ['start_entity','end_entity','start_entity_type','end_entity_type','marked_sentence','pred','pred_prob']
relation_df = pd.concat([c_d_pred_df[cols],c_g_pred_df[cols],g_d_pred_df[cols],g_g_pred_df[cols]]).append(marked_sentence_df.loc[(marked_sentence_df['start_entity_type'].isin(['Chemical','Disease'])&(marked_sentence_df['start_entity_type']==marked_sentence_df['end_entity_type']))]).reset_index(drop=True)
relation_df.loc[(relation_df['pred'].isna())&(relation_df['start_entity_type']=='Chemical'),'pred'] = 'CC'
relation_df.loc[(relation_df['pred'].isna())&(relation_df['start_entity_type']=='Disease'),'pred'] = 'DD'
relation_df = relation_df[cols]
relation_df.to_csv('relation.csv',index=False)

In [None]:
relation_df

In [None]:
# subsample of knowledge graph
g = Build_graph(relation_df.sample(1000),relation=True,repulsion=15,title='subsample of COVID-19 knowledge graph',labelShow=False)
g

## 1. What genes have U(Causal mutations), Ud(Mutations affect disease course), Y(Polymorphisms alter risk) to COVID-19?


In [None]:
# 1. What genes have U(Causal mutations), Ud(Mutations affect disease course), Y(Polymorphisms alter risk) to COVID-19?


g_d_pred_df.loc[(g_d_pred_df['end_entity']=='COVID-19')&(g_d_pred_df['pred'].isin(['U','Ud','Y']))]

## 2. What diseases are associated to COVID-19 (like complication)

In [None]:
# 2. What diseases are associated to COVID-19 (like complication)
d_d_df = relation_df.loc[((relation_df['start_entity']=='COVID-19')&(relation_df['end_entity_type']=='Disease'))|((relation_df['start_entity_type']=='Disease')&(relation_df['end_entity']=='COVID-19'))]
value_counts_dic = dict(d_d_df['start_entity'].value_counts())
end_entity_value_counts_dic = dict(d_d_df['end_entity'].value_counts())
for key in end_entity_value_counts_dic:
    if key in value_counts_dic:
        value_counts_dic[key] += end_entity_value_counts_dic[key]
    else:
        value_counts_dic[key] = end_entity_value_counts_dic[key]
most_relevant_disease = []
for key in value_counts_dic:
    if value_counts_dic[key] > 10:
        most_relevant_disease.append(key)
relation_df.loc[((relation_df['start_entity'].isin(most_relevant_disease))&(relation_df['end_entity']=='COVID-19'))|((relation_df['start_entity']=='COVID-19')&(relation_df['end_entity'].isin(most_relevant_disease)))]

## What do we know about vaccines and therapeutics?

In [None]:
# 1. What chemicals have Pa(Alleviates, reduces), Pr(Prevents, suppresses), T(Treatment/therapy (incl. investigatory)) to COVID-19 and revelant disease?

c_d_pred_df.loc[(c_d_pred_df['end_entity']=='COVID-19')&(c_d_pred_df['pred'].isin(['Pa','Pr','T']))]

In [None]:
# 2. What genes have D(Drug targets), G(Promotes progression), J(Role in pathogenesis), Te(Possible therapeutic effect), X(Overexpression in disease) to COVID-19 and revelant disease?

g_d_pred_df.loc[(g_d_pred_df['end_entity']=='COVID-19')&(g_d_pred_df['pred'].isin(['D','G','J','Te','X']))]