#Loading of DrugProt dataset

Downloading and unzipping file

In [47]:
!wget https://zenodo.org/record/5042151/files/drugprot-gs-training-development.zip
!unzip drugprot-gs-training-development.zip

--2022-11-11 14:50:55--  https://zenodo.org/record/5042151/files/drugprot-gs-training-development.zip
Resolving zenodo.org (zenodo.org)... 188.184.117.155
Connecting to zenodo.org (zenodo.org)|188.184.117.155|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3796908 (3.6M) [application/octet-stream]
Saving to: ‘drugprot-gs-training-development.zip.1’


2022-11-11 14:50:55 (22.7 MB/s) - ‘drugprot-gs-training-development.zip.1’ saved [3796908/3796908]

Archive:  drugprot-gs-training-development.zip
replace drugprot-gs-training-development/README.pdf? [y]es, [n]o, [A]ll, [N]one, [r]ename: n
replace drugprot-gs-training-development/development/drugprot_development_relations.tsv? [y]es, [n]o, [A]ll, [N]one, [r]ename: n
replace drugprot-gs-training-development/development/drugprot_development_entities.tsv? [y]es, [n]o, [A]ll, [N]one, [r]ename: n
replace drugprot-gs-training-development/development/drugprot_development_abstracs.tsv? [y]es, [n]o, [A]ll, [N]one, [r]ename

Combining all three files together.

In [48]:
def loadDrugProt(abstracts_filename, entities_filename, relations_filename):

  docs = {}

  # Load the title/abstracts text in as documents
  with open(abstracts_filename, encoding='utf8') as f:
    for lineno,line in enumerate(f):
      split = line.strip('\n').split('\t')
      assert len(split) == 3, f"Expected 3 columns but got {len(split)} on line {lineno+1}"
      pubmed_id, title, abstract = split
      pubmed_id = int(pubmed_id)

      combined_text = title + '\n' + abstract
      docs[pubmed_id] = {'pubmed_id':pubmed_id, 'text':combined_text, 'entities':{}, 'relations':[]}

  # Load the entities and match them up with the documents
  with open(entities_filename, encoding='utf8') as f:
    for lineno,line in enumerate(f):
      split = line.strip('\n').split('\t')
      assert len(split) == 6, f"Expected 6 columns but got {len(split)} on line {lineno+1}"
      pubmed_id, entity_id, entity_type, start_coord, end_coord, entity_text = split

      pubmed_id = int(pubmed_id)
      start_coord = int(start_coord)
      end_coord = int(end_coord)

      assert pubmed_id in docs, f"Did not find matching document for pubmed_id={pubmed_id}"
      doc = docs[pubmed_id]

      assert doc['text'][start_coord:end_coord] == entity_text, f"Text for entity with coords {start_coord}:{end_coord} in document (pubmed_id={pubmed_id} does not match expected. 'f{doc['text'][start_coord:end_coord]}' != '{entity_text}'"

      entity = {'type':entity_type, 'start':start_coord, 'end':end_coord, 'text':entity_text}
      doc['entities'][entity_id] = entity

  if relations_filename is not None:
    # Load the relations and match them up with the entities in the corresponding document
    with open(relations_filename, encoding='utf8') as f:
      for lineno,line in enumerate(f):
        split = line.strip('\n').split('\t')
        assert len(split) == 4, f"Expected 4 columns but got {len(split)} on line {lineno+1}"

        pubmed_id, relation_type, arg1, arg2 = split

        pubmed_id = int(pubmed_id)
        assert arg1.startswith('Arg1:'), f"Relation argument should start with 'Arg1:'. Got: {arg1}"
        assert arg2.startswith('Arg2:'), f"Relation argument should start with 'Arg2:'. Got: {arg2}"

        # Remove arg1/arg2 from text
        arg1 = arg1.split(':')[1]
        arg2 = arg2.split(':')[1]

        assert pubmed_id in docs, f"Did not find matching document for pubmed_id={pubmed_id}"
        doc = docs[pubmed_id]

        assert arg1 in doc['entities'], f"Couldn't find entity with id={arg1} in document with pubmed_id={pubmed_id}"
        assert arg2 in doc['entities'], f"Couldn't find entity with id={arg2} in document with pubmed_id={pubmed_id}"

        relation = {'type':relation_type,'arg1':arg1,'arg2':arg2}
        doc['relations'].append(relation)

  # Convert the dictionary of documents (indexed by pubmed_id) into a simpler dictionary
  docs = sorted(docs.values(), key=lambda x:x['pubmed_id'])

  return docs

Applying function to dataset

In [49]:
docs = loadDrugProt('drugprot-gs-training-development/training/drugprot_training_abstracs.tsv',
                    'drugprot-gs-training-development/training/drugprot_training_entities.tsv',
                    'drugprot-gs-training-development/training/drugprot_training_relations.tsv')

Observation of doc structure


In [50]:
search = [ d for i,d in enumerate(docs) if d['pubmed_id'] == 1280065]
assert len(search) == 1, "Something went wrong and couldn't find the document we want"

doc = search[0]
doc

{'pubmed_id': 1280065,
 'text': 'Analysis of plasmin binding and urokinase activation of plasminogen bound to the Heymann nephritis autoantigen, gp330.\nPreviously, we demonstrated that the Heymann nephritis autoantigen, gp330, can serve as a receptor site for plasminogen. This binding was not significantly inhibited by the lysine analogue epsilon-amino caproic acid (EACA), indicating that plasminogen binding was not just through lysine binding sites as suggested for other plasminogen binding sites. We now report that once plasminogen is bound to gp330, it can be converted to its active form of plasmin by urokinase. This conversion of plasminogen to plasmin proceeds at a faster rate when plasminogen is first prebound to gp330. Although there is a proportional increase in the Vmax of the urokinase-catalyzed reaction with increasing gp330 concentrations, no change in Km was observed. Once activated, plasmin remains bound to gp330 in an active state capable of cleaving the chromogenic tri

Total number of relations

In [51]:
num = 0
for doc in docs:
  num += len(doc['relations'])

print(num)

17288


# Encoding Step

In [52]:
#!pip install torch
!pip install transformers
!pip install segtok

from transformers import BertTokenizer, BertModel
import pandas as pd
import numpy as np
import nltk
import torch
from segtok.segmenter import split_single

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


Adding in start and end entity tags to biomedical text, e.g. [E1] CHEMICAL [/E1] and [E2] GENE [/E2]

In [53]:
def isRelation(doc, key):
  relations = doc['relations']
  entities = doc['entities'].keys()
  for r in relations:
    if (key == r['arg1']) or (key == r['arg2']):
      return True
  return False

In [54]:
def addEntityTags(docs):
  for doc in docs:
    str = ''
    index = 0
    entity_dict = {}
    n = 1
    keys = doc['entities'].keys()
    key_list = sorted(keys, key= lambda x: (doc['entities'][x]['start']))
    for k in key_list:
      start = doc['entities'][k]['start']
      end = doc['entities'][k]['end']
      name = doc['entities'][k]['text']
      if isRelation(doc,k):
        if entity_dict.get(name) is not None:
          i = entity_dict[name]
          str += doc['text'][index:start] + f'[E{i}] ' + doc['text'][start:end] + f' [/E{i}]'
          index = end
        else:
          entity_dict[name] = n
          i = entity_dict[name]
          str += doc['text'][index:start] + f'[E{i}] ' + doc['text'][start:end] + f' [/E{i}]'
          n+=1
          index = end
    doc['tagged_text'] = str 

addEntityTags(docs)
print(docs[0]['tagged_text'])
      

Analysis of plasmin binding and urokinase activation of plasminogen bound to the Heymann nephritis autoantigen, gp330.
Previously, we demonstrated that the Heymann nephritis autoantigen, gp330, can serve as a receptor site for plasminogen. This binding was not significantly inhibited by the lysine analogue epsilon-amino caproic acid (EACA), indicating that plasminogen binding was not just through lysine binding sites as suggested for other plasminogen binding sites. We now report that once plasminogen is bound to gp330, it can be converted to its active form of plasmin by urokinase. This conversion of plasminogen to plasmin proceeds at a faster rate when plasminogen is first prebound to gp330. Although there is a proportional increase in the Vmax of the urokinase-catalyzed reaction with increasing gp330 concentrations, no change in Km was observed. Once activated, plasmin remains bound to gp330 in an active state capable of cleaving the chromogenic tripeptide, S-2251. The binding of pl

Reducing the text only to sentences potentially containing relations between chemicals and genes:

In [55]:
def relevantText():
  texts = [doc['tagged_text'] for doc in docs]

  split_text = []
  target_text = []

  for d in texts:
    split_text += split_single(d)

  for s in split_text:
    if '[/E' in s:
      target_text.append(s)

  return target_text

target_text = relevantText()
print(target_text[:5])

['Inhibition of binding of both [E1] plasminogen [/E1] and [E2] plasmin [/E2] to [E3] gp330 [/E3] by [E4] benzamidine [/E4] was similar, although [E5] EACA [/E5] inhibited the binding of [E2] plasmin [/E2] to [E3] gp330 [/E3] slightly more than the binding of [E1] plasminogen [/E1] to [E3] gp330 [/E3]', 'A cDNA encoding the complete [E1] amino acid [/E1] sequence of [E2] aminoacylase 1 [/E2] ([E3] N-acylamino acid aminohydrolase [/E3], [E4] ACY-1 [/E4]) [[E5] EC 3.5.1.14 [/E5]], a dimeric [E6] metalloprotein [/E6] having two Zn2+ in the molecule, which catalyzes the deacylation of [E7] N-acylated L-amino acids [/E7] except L-aspartic acid, has been isolated from porcine kidney lambda gt10 cDNA library and sequenced.', 'From sequence analysis of the cDNA and the [E8] N [/E8]- and [E9] C [/E9]-terminal [E1] amino acid [/E1] analyses of the purified protein, it is deduced that [E10] porcine kidney ACY-1 [/E10] consists of two identical subunits (M(r) 45,260), each of which consists of a s

There are around 17288 total relations, therefore there some sentences that contain entities but don't actually contain a relation

In [56]:
print(len(target_text))

6709


# Embedding Step

Loading pre-trained PubMedBERT model + tokenizer

In [57]:
model = BertModel.from_pretrained("microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext",
                                  output_hidden_states = True,
                                  )

tokenizer = BertTokenizer.from_pretrained("microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext") 

Some weights of the model checkpoint at microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.bias', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Adding more tags and tokenising input

In [58]:
def bert_text_prep(text, tokenizer):

  #CLS lets BERT know when start of sentence begins, SEP indicates start of second sentence.
  marked_text = "[CLS]" + text + "[SEP]"

  #Tokenizes text
  tokenized_text = tokenizer.tokenize(marked_text)

  #Converts to ids
  indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)

  segments_ids = [1]*len(indexed_tokens)

  # Converts inputs to tensors
  tokens_tensor = torch.tensor([indexed_tokens])
  segments_tensors = torch.tensor([segments_ids])

  return tokenized_text, tokens_tensor, segments_tensors

Converting input into embeddings

In [59]:
def get_bert_embeddings(tokens_tensor, segments_tensors, model):
  with torch.no_grad():
    outputs = model(tokens_tensor, segments_tensors)
    hidden_states = outputs[2][1:]

  token_embeddings = hidden_states[-1]
  #Collapsing tensor into 1 dimension
  token_embeddings = torch.squeeze(token_embeddings, dim=0)
  #converting tensors to lists
  list_token_embeddings = [token_embed.tolist() for token_embed in token_embeddings]

  return list_token_embeddings

Getting CLS embeddings for each sentence



In [60]:
test = target_text[:5]
cls_embeddings = []

for text in test:
    tokenized_text, tokens_tensor, segments_tensors = bert_text_prep(text, tokenizer)
    list_token_embeddings = get_bert_embeddings(tokens_tensor, segments_tensors, model)
    cls_embeddings.append(list_token_embeddings[0])

print(len(cls_embeddings))

5


## Classification Step

Establishing the relation labels

In [61]:
labels = list(set([r['type'] for doc in docs for r in doc['relations']]))