<a href="https://colab.research.google.com/github/ChloeMorgana/Dissertation-Project/blob/main/PubMedBERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Loading of DrugProt dataset

Downloading and unzipping file

In [1]:
!wget https://zenodo.org/record/5042151/files/drugprot-gs-training-development.zip
!unzip drugprot-gs-training-development.zip

--2022-11-28 09:16:21--  https://zenodo.org/record/5042151/files/drugprot-gs-training-development.zip
Resolving zenodo.org (zenodo.org)... 188.185.124.72
Connecting to zenodo.org (zenodo.org)|188.185.124.72|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3796908 (3.6M) [application/octet-stream]
Saving to: ‘drugprot-gs-training-development.zip’


2022-11-28 09:16:32 (399 KB/s) - ‘drugprot-gs-training-development.zip’ saved [3796908/3796908]

Archive:  drugprot-gs-training-development.zip
   creating: drugprot-gs-training-development/
  inflating: drugprot-gs-training-development/README.pdf  
   creating: drugprot-gs-training-development/development/
  inflating: drugprot-gs-training-development/development/drugprot_development_relations.tsv  
  inflating: drugprot-gs-training-development/development/drugprot_development_entities.tsv  
  inflating: drugprot-gs-training-development/development/drugprot_development_abstracs.tsv  
   creating: drugprot-gs-training

Combining all three files together.

In [2]:
def loadDrugProt(abstracts_filename, entities_filename, relations_filename):

  docs = {}

  # Load the title/abstracts text in as documents
  with open(abstracts_filename, encoding='utf8') as f:
    for lineno,line in enumerate(f):
      split = line.strip('\n').split('\t')
      assert len(split) == 3, f"Expected 3 columns but got {len(split)} on line {lineno+1}"
      pubmed_id, title, abstract = split
      pubmed_id = int(pubmed_id)

      combined_text = title + '\n' + abstract
      docs[pubmed_id] = {'pubmed_id':pubmed_id, 'text':combined_text, 'entities':{}, 'relations':[]}

  # Load the entities and match them up with the documents
  with open(entities_filename, encoding='utf8') as f:
    for lineno,line in enumerate(f):
      split = line.strip('\n').split('\t')
      assert len(split) == 6, f"Expected 6 columns but got {len(split)} on line {lineno+1}"
      pubmed_id, entity_id, entity_type, start_coord, end_coord, entity_text = split

      pubmed_id = int(pubmed_id)
      start_coord = int(start_coord)
      end_coord = int(end_coord)

      assert pubmed_id in docs, f"Did not find matching document for pubmed_id={pubmed_id}"
      doc = docs[pubmed_id]

      assert doc['text'][start_coord:end_coord] == entity_text, f"Text for entity with coords {start_coord}:{end_coord} in document (pubmed_id={pubmed_id} does not match expected. 'f{doc['text'][start_coord:end_coord]}' != '{entity_text}'"

      entity = {'type':entity_type, 'start':start_coord, 'end':end_coord, 'text':entity_text}
      doc['entities'][entity_id] = entity

  if relations_filename is not None:
    # Load the relations and match them up with the entities in the corresponding document
    with open(relations_filename, encoding='utf8') as f:
      for lineno,line in enumerate(f):
        split = line.strip('\n').split('\t')
        assert len(split) == 4, f"Expected 4 columns but got {len(split)} on line {lineno+1}"

        pubmed_id, relation_type, arg1, arg2 = split

        pubmed_id = int(pubmed_id)
        assert arg1.startswith('Arg1:'), f"Relation argument should start with 'Arg1:'. Got: {arg1}"
        assert arg2.startswith('Arg2:'), f"Relation argument should start with 'Arg2:'. Got: {arg2}"

        # Remove arg1/arg2 from text
        arg1 = arg1.split(':')[1]
        arg2 = arg2.split(':')[1]

        assert pubmed_id in docs, f"Did not find matching document for pubmed_id={pubmed_id}"
        doc = docs[pubmed_id]

        assert arg1 in doc['entities'], f"Couldn't find entity with id={arg1} in document with pubmed_id={pubmed_id}"
        assert arg2 in doc['entities'], f"Couldn't find entity with id={arg2} in document with pubmed_id={pubmed_id}"

        relation = {'type':relation_type,'arg1':arg1,'arg2':arg2}
        doc['relations'].append(relation)

  # Convert the dictionary of documents (indexed by pubmed_id) into a simpler dictionary
  docs = sorted(docs.values(), key=lambda x:x['pubmed_id'])

  return docs

Applying function to dataset

In [3]:
docs = loadDrugProt('drugprot-gs-training-development/training/drugprot_training_abstracs.tsv',
                    'drugprot-gs-training-development/training/drugprot_training_entities.tsv',
                    'drugprot-gs-training-development/training/drugprot_training_relations.tsv')

Observation of doc structure


In [4]:
search = [ d for i,d in enumerate(docs) if d['pubmed_id'] == 1280065]
assert len(search) == 1, "Something went wrong and couldn't find the document we want"

doc = search[0]
doc

{'pubmed_id': 1280065,
 'text': 'Analysis of plasmin binding and urokinase activation of plasminogen bound to the Heymann nephritis autoantigen, gp330.\nPreviously, we demonstrated that the Heymann nephritis autoantigen, gp330, can serve as a receptor site for plasminogen. This binding was not significantly inhibited by the lysine analogue epsilon-amino caproic acid (EACA), indicating that plasminogen binding was not just through lysine binding sites as suggested for other plasminogen binding sites. We now report that once plasminogen is bound to gp330, it can be converted to its active form of plasmin by urokinase. This conversion of plasminogen to plasmin proceeds at a faster rate when plasminogen is first prebound to gp330. Although there is a proportional increase in the Vmax of the urokinase-catalyzed reaction with increasing gp330 concentrations, no change in Km was observed. Once activated, plasmin remains bound to gp330 in an active state capable of cleaving the chromogenic tri

We can observe the total number of relations across all documents below:

In [5]:
num = 0
for doc in docs:
  num += len(doc['relations'])

print(num)

17288


# Encoding Step

The code below imports certain packages that will be used during this relation extraction task.

In [66]:
#!pip install torch
!pip install transformers
!pip install segtok

#import transformers as ppb
from transformers import BertModel, AutoTokenizer
import pandas as pd
import numpy as np
import nltk
import torch
from segtok.segmenter import split_single
import itertools

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


This is a function that checks if a particular entity tag is part of a relation or not. This is used to form labels for the data so that the classifier can learn which sentences contain relations and which do not.

In [54]:
def isRelation(doc, key):

  relations = doc['relations']
  entities = doc['entities'].keys()
  for r in relations:
    #Checks if the entity is involved in any part of a specified relation
    if (key == r['arg1']) or (key == r['arg2']):
      #Returns the name of the relation and a boolean value
      return True, r['type']
  return False, 0

This function adds in start and end entity tags to biomedical text. This ensures that every chemical mentioned is encased in [E1][/E1] tags, and every gene is encased in [E2][/E2] tags, so that we can extract the embeddings for these later.

In [95]:
def addEntityTags(docs):

  for doc in docs:
    #Contains the labels for each sentence (1 for positive result, 0 for no relation)
    doc['labels'] = []
    #Contains more explicit names for the relations.
    doc['classes'] = []


    str = ''
    index = 0

    #Contains every mention of chemicals and genes for a particular document
    keys = doc['entities'].keys()

    #Makes sure the indices of these mentions are in order
    key_list = sorted(keys, key= lambda x: (doc['entities'][x]['start']))

    for k in key_list:
      #Adds a 1 to the list of labels if it is a positive result, 0 if negative
      b,r = isRelation(doc,k)
      if b:
        doc['labels'].append(1)
      else:
        doc['labels'].append(0)
      doc['classes'].append(r)
      v = doc['entities'][k]

      #Start and end indexes of entity
      start = v['start']
      end = v['end']

      if v['type'] == 'CHEMICAL':
        str += doc['text'][index:start] + f'[E1] ' + doc['text'][start:end] + f' [/E1]'
      else:
        str += doc['text'][index:start] + f'[E2] ' + doc['text'][start:end] + f' [/E2]'
      index = end
    
    #Adds the tagged text to the dictionary of a particular document
    doc['tagged_text'] = str 

addEntityTags(docs)
print(docs[0]['tagged_text'])

      

Analysis of [E2] plasmin [/E2] binding and [E2] urokinase [/E2] activation of [E2] plasminogen [/E2] bound to the [E2] Heymann nephritis autoantigen [/E2], [E2] gp330 [/E2].
Previously, we demonstrated that the [E2] Heymann nephritis autoantigen [/E2], [E2] gp330 [/E2], can serve as a receptor site for [E2] plasminogen [/E2]. This binding was not significantly inhibited by the [E1] lysine [/E1] analogue [E1] epsilon-amino caproic acid [/E1] ([E1] EACA [/E1]), indicating that [E2] plasminogen [/E2] binding was not just through [E1] lysine [/E1] binding sites as suggested for other [E2] plasminogen [/E2] binding sites. We now report that once [E2] plasminogen [/E2] is bound to [E2] gp330 [/E2], it can be converted to its active form of [E2] plasmin [/E2] by [E2] urokinase [/E2]. This conversion of [E2] plasminogen [/E2] to [E2] plasmin [/E2] proceeds at a faster rate when [E2] plasminogen [/E2] is first prebound to [E2] gp330 [/E2]. Although there is a proportional increase in the Vmax o

Next, we can reduce the text to only sentences potentially containing relations between chemicals and genes by splitting the sentences and seeing if there are any entity tags present within the sentence. The findLabels function goes through the list of labels for a particular document and sees if there is a relation present within that sentence.

(Note for later: I didn't take into account relations across sentences in this implementation, will have to change something to ensure that both entities for one relation are in the same sentence)

In [96]:
from collections import Counter

def relevantText():

  target_text = []
  labels = []

  for doc in docs:
    txt = doc['tagged_text']
    lbls = doc['labels']

    split_text = []
    split_text += split_single(txt)
    offset = 0

    for s in split_text:
      if '[/E' in s:
        o, l, new_lbls = findLabels(s,lbls,offset)
        target_text.append(s)
        lbls = new_lbls
        offset += o
        labels.append(l)

  return target_text, labels

def findLabels(text, labels, offset=0):
    words = text.split(" ")
    counter = Counter(words)
    idx = counter['[E2]']+counter['[E1]']
    isRel = False
    for i in range(idx):
      if labels[i]==1:
        isRel = True
    offset+=idx
    if isRel:
      label = 1
    else:
      label = 0
    l = labels[idx:]
    return offset, label, l

target_text, labels= relevantText()
t = target_text[:49]
l = labels[:49]

#Both the target text and the labels must be the same length
print(len(target_text))
print(len(labels))


29234
29234


There are around 17288 total relations, therefore there some sentences that contain entities but don't actually contain a relation. The code below shows a snippet of the target text which should only contain sentences that contain tagged entities. It also contains the first five labels which are all 0 in this case. These seems sensible as there isn't a chemical in the first sentence, and therefore there is no relationship specified between a chemical and gene.

In [97]:
print(target_text[:5])
print(labels[:5])

['Analysis of [E2] plasmin [/E2] binding and [E2] urokinase [/E2] activation of [E2] plasminogen [/E2] bound to the [E2] Heymann nephritis autoantigen [/E2], [E2] gp330 [/E2].', 'Previously, we demonstrated that the [E2] Heymann nephritis autoantigen [/E2], [E2] gp330 [/E2], can serve as a receptor site for [E2] plasminogen [/E2].', 'This binding was not significantly inhibited by the [E1] lysine [/E1] analogue [E1] epsilon-amino caproic acid [/E1] ([E1] EACA [/E1]), indicating that [E2] plasminogen [/E2] binding was not just through [E1] lysine [/E1] binding sites as suggested for other [E2] plasminogen [/E2] binding sites.', 'We now report that once [E2] plasminogen [/E2] is bound to [E2] gp330 [/E2], it can be converted to its active form of [E2] plasmin [/E2] by [E2] urokinase [/E2].', 'This conversion of [E2] plasminogen [/E2] to [E2] plasmin [/E2] proceeds at a faster rate when [E2] plasminogen [/E2] is first prebound to [E2] gp330 [/E2].']
[0, 0, 0, 0, 0]


# Embedding Step

Next, a BERT model is defined that has been pre-trained using PubMed texts. This allows BERT to recognise biomedical terms within the text and understand their context better.

The tokeniser is also defined using pubmed texts, and special tokens are added in so that BERT can recognise them and not break them down when tokenising.

In [47]:
model = BertModel.from_pretrained("microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext",
                                  output_hidden_states = True,
                                  )

tokenizer = AutoTokenizer.from_pretrained("microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext") 

tokenizer.add_tokens(["[E1]", "[/E1]", "[E2]", "[/E2]"])

tokenizer.vocab["[E1]"]

Some weights of the model checkpoint at microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext were not used when initializing BertModel: ['cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.predictions.decoder.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


30522

This function adds [CLS] and [SEP] tokens to the beginning and end of sentences so that BERT knows when a new sentence is beginning. It then tokenises these sentences and converts all of the inputs into tensors.

In [48]:
def bert_text_prep(text, tokenizer):

  #CLS lets BERT know when start of sentence begins, SEP indicates start of second sentence.
  marked_text = "[CLS]" + text + "[SEP]"

  #Tokenizes text
  tokenized_text = tokenizer.tokenize(marked_text)

  #Converts to ids
  indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)

  segments_ids = [1]*len(indexed_tokens)

  # Converts inputs to tensors
  tokens_tensor = torch.tensor([indexed_tokens])
  segments_tensors = torch.tensor([segments_ids])

  return tokenized_text, tokens_tensor, segments_tensors

These tensors are then used to find the embeddings of all of the tokens.

In [49]:
def get_bert_embeddings(tokens_tensor, segments_tensors, model):
  with torch.no_grad():
    outputs = model(tokens_tensor, segments_tensors)
    hidden_states = outputs[2][1:]

  token_embeddings = hidden_states[-1]
  #Collapsing tensor into 1 dimension
  token_embeddings = torch.squeeze(token_embeddings, dim=0)
  #converting tensors to lists
  list_token_embeddings = [token_embed.tolist() for token_embed in token_embeddings]

  return list_token_embeddings

In order to find the relevant embeddings (entity start tag embeddings), findTags looks through the tokenised sentence and adds the index of that particular tag into a list of chemical indices or gene indices.

These are then used after the sentence has been converted into embeddings to find the embedding at these specific indexes. The embeddings are accompanied by the cartesian product of all of the chemical and gene entities found, so that the embeddings for the specified indexes are concatenated.

In [76]:
def findTags(tokens):
  chems = []
  genes = []

  for i in range(len(tokens)):
    #Adds the indexes to the relevant lists
    if tokens[i] == '[e1]':
      chems.append(i)
    elif tokens[i] == '[e2]':
      genes.append(i)
  
  #Returns empty lists if either chemicals or genes aren't present
  if (chems == []) or (genes == []):
    return [], []
  else:
    return chems, genes

def findEmbeddings(prods, embeddings):
  #Concatenates the embeddings
  entity_embeddings = []
  for elt in prods:
    if elt != ():
      chem = embeddings[elt[0]]
      gene = embeddings[elt[1]]
      entity_embeddings.append(chem+gene)
  return entity_embeddings


<class 'itertools.product'>


Getting CLS and entity embeddings for each sentence



In [93]:
test = target_text[:5]
cls_embeddings = []
entity_embeddings = []

for text in test:

    #Finds tokenized text and tensors for a particular sentence in a document
    tokenized_text, tokens_tensor, segments_tensors = bert_text_prep(text, tokenizer)

    #Finds the embeddings
    list_token_embeddings = get_bert_embeddings(tokens_tensor, segments_tensors, model)

    #Adds the [CLS] token embedding to the list
    cls_embeddings.append(list_token_embeddings[0])

    #Finds the indexes for chemicals and genes in the sentence
    chems, genes = findTags(tokenized_text)

    #Finds the cartesian product of all of these chemicals and genes
    prods = itertools.product(chems,genes)

    #Adds the concatenated embeddings of all chemicals and genes to the list
    entity_embeddings += findEmbeddings(prods, list_token_embeddings)

print(entity_embeddings)

[[-0.05108833312988281, 0.3297119438648224, -0.4981868863105774, -0.04462847113609314, 0.020150624215602875, -0.1945430338382721, 0.18260638415813446, 0.28659820556640625, 0.37647363543510437, -0.22916968166828156, -0.1256995052099228, 0.5117439031600952, 0.2666151821613312, -0.43435007333755493, -0.4107559621334076, 0.020583277568221092, -0.06312580406665802, -0.13350477814674377, -0.20826269686222076, -0.07796593010425568, 0.3311344087123871, -0.23943179845809937, -0.06202571466565132, -0.12597085535526276, -0.46859821677207947, 0.24351318180561066, -0.27199575304985046, -0.1680583655834198, -0.07419609278440475, -0.8748524188995361, 0.0008813408203423023, 0.18960852921009064, 0.12297125905752182, 1.4164780378341675, -0.13413837552070618, 0.021959761157631874, -0.07752124965190887, 0.08762495964765549, 0.978164792060852, -0.21565575897693634, 0.32896944880485535, 0.3276010751724243, 0.55742347240448, 0.3144276440143585, -0.027471814304590225, 0.04131792113184929, -0.20241037011146545

## Classification Step

Relevant imports:



In [51]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split

Splitting the embeddings into train/test sets

In [52]:
#Splits the sentences into training and test sets
cls_train_features, cls_test_features, cls_train_labels, cls_test_labels = train_test_split(cls_embeddings, l)

Classifying these using logistic regression

In [53]:
#Defines a logistic regression classifier
lr_clf = LogisticRegression()

#Fits the model to the training data
lr_clf.fit(cls_train_features, cls_train_labels)

#Tests the model using the test data
lr_clf.score(cls_test_features, cls_test_labels)

0.6923076923076923

Entity start architecture

In [None]:
start_train_features, start_test_features, start_train_labels, start_test_labels = train_test_split(entity_embeddings, l)

In [None]:
lr_clf = LogisticRegression()
lr_clf.fit(start_train_features, start_train_labels)
lr_clf.score(start_test_features, start_test_labels)

Establishing the relation labels

In [None]:
labels = list(set([r['type'] for doc in docs for r in doc['relations']]))