Trying to implement the architecture in this paper [Enriching Pre-trained Language Model with Entity Information for Relation Classification
](https://arxiv.org/pdf/1905.08284.pdf)

## Load Data

We will use an well established data set for relationship classification to compare our results to the state-of-the-art on this dataset. It’s the task8 dataset from SemEval 2010. You can find it here: https://github.com/davidsbatista/Annotated-Semantic-Relationships-Datasets.

In [1]:
import os
from urllib.request import urlretrieve
import glob
import tarfile

if not os.path.exists('data'):
    os.makedirs('data')
    
# Download data
url ='https://github.com/davidsbatista/Annotated-Semantic-Relationships-Datasets/raw/master/datasets/SemEval2010_task8_all_data.tar.gz'

urlretrieve(url, 'data/SemEval2010_task8_all_data.tar.gz')


tarf = tarfile.open("data/SemEval2010_task8_all_data.tar.gz")
tarf.extractall(path = 'data/')
    

glob.glob('data/*')

['data/SemEval2010_task8_all_data', 'data/SemEval2010_task8_all_data.tar.gz']

In [2]:
with open("data/SemEval2010_task8_all_data/SemEval2010_task8_training/TRAIN_FILE.TXT") as f:
    train_file = f.readlines()
    
with open("data/SemEval2010_task8_all_data/SemEval2010_task8_testing_keys/TEST_FILE_FULL.TXT") as f:
    test_file = f.readlines()
    
test_file[:10]

['8001\t"The most common <e1>audits</e1> were about <e2>waste</e2> and recycling."\n',
 'Message-Topic(e1,e2)\n',
 'Comment: Assuming an audit = an audit document.\n',
 '\n',
 '8002\t"The <e1>company</e1> fabricates plastic <e2>chairs</e2>."\n',
 'Product-Producer(e2,e1)\n',
 'Comment: (a) is satisfied\n',
 '\n',
 '8003\t"The school <e1>master</e1> teaches the lesson with a <e2>stick</e2>."\n',
 'Instrument-Agency(e2,e1)\n']

The training dataset consists of 8000 sentences with 10 different types of relations. Each sentence is annotated with a relation between two given nominals. The entities that are involved in this relations are identified by markers like <e1> in the text. For instance, the following sentence contains an example of the Entity-Destination relation between the
nominals Flowers and chapel.

`The system as described above has its greatest application in an arrayed <e1>configuration</e1> of antenna <e2>elements</e2>`

Using this kind of special tokens is a quite useful way to tell the network that we want it to focus on to answer our question. The main advantage is that we can use a normal text classifier architecture to tackle the relationship extraction task. This approach can be used in many different ways.

Now we need a function to parse the raw dataset into the format that is easy to use for model training.

In [3]:
def parse_dataset(raw):
    sentences, relations = [], []
    to_replace = [("\"", ""), ("\n", ""), ("<", " <"), (">", "> ")]
    last_was_sentence = False
    for line in raw:
        sl = line.split("\t")
        if last_was_sentence:
            relations.append(sl[0].split("(")[0].replace("\n", ""))
            last_was_sentence = False
        if sl[0].isdigit():
            sent = sl[1]
            for rp in to_replace:
                sent = sent.replace(rp[0], rp[1])
            sentences.append(sent)
            last_was_sentence = True
    print("Found {} sentences".format(len(sentences)))
    return sentences, relations

tr_sentences, tr_relations = parse_dataset(train_file)
te_sentences, te_relations = parse_dataset(test_file)

tr_sentences[0], tr_relations[0], te_sentences[0], te_relations[0]

Found 8000 sentences
Found 2717 sentences


('The system as described above has its greatest application in an arrayed  <e1> configuration </e1>  of antenna  <e2> elements </e2> .',
 'Component-Whole',
 'The most common  <e1> audits </e1>  were about  <e2> waste </e2>  and recycling.',
 'Message-Topic')

In [4]:
n_relations = len(set(tr_relations))
print("Found {} relations\n".format(n_relations))
print("Relations:\n{}".format(list(set(tr_relations))))

Found 10 relations

Relations:
['Member-Collection', 'Entity-Destination', 'Product-Producer', 'Other', 'Content-Container', 'Cause-Effect', 'Component-Whole', 'Instrument-Agency', 'Entity-Origin', 'Message-Topic']


### Use BERT tokenizer to tokenize the input string

In [38]:
!pip install bert-tensorflow
!pip install tensorflow-hub
import bert
from bert import tokenization
import tensorflow as tf
import tensorflow_hub as hub

# This is a path to an uncased version of BERT
BERT_MODEL_HUB = "https://tfhub.dev/google/bert_uncased_L-12_H-768_A-12/1"

addition_tokens = ['[EN1]', '[EN2]']


class MyFullTokenizer(object):
  """Runs end-to-end tokenziation."""

  def __init__(self, vocab_file, do_lower_case=True):
    self.vocab = tokenization.load_vocab(vocab_file)
    for i, atk in enumerate(addition_tokens):
        self.vocab[atk] = self.vocab.pop('[unused{}]'.format(i))
    #self.vocab = tokenization.load_vocab(vocab_file)
    self.inv_vocab = {v: k for k, v in self.vocab.items()}
    self.basic_tokenizer = tokenization.BasicTokenizer(do_lower_case=do_lower_case)
    self.wordpiece_tokenizer = tokenization.WordpieceTokenizer(vocab=self.vocab)

  def tokenize(self, text):
    split_tokens = []
    for token in self.basic_tokenizer.tokenize(text):
      for sub_token in self.wordpiece_tokenizer.tokenize(token):
        split_tokens.append(sub_token)

    return split_tokens

  def convert_tokens_to_ids(self, tokens):
    return tokenization.convert_by_vocab(self.vocab, tokens)

  def convert_ids_to_tokens(self, ids):
    return tokenization.convert_by_vocab(self.inv_vocab, ids)

def create_tokenizer_from_hub_module():
    """Get the vocab file and casing info from the Hub module."""
    with tf.Graph().as_default():
        bert_module = hub.Module(BERT_MODEL_HUB)
        tokenization_info = bert_module(signature="tokenization_info", as_dict=True)
        with tf.Session() as sess:
            vocab_file, do_lower_case = sess.run([tokenization_info["vocab_file"],
                                                tokenization_info["do_lower_case"]])

    return MyFullTokenizer(
        vocab_file=vocab_file, do_lower_case=do_lower_case)

tokenizer = create_tokenizer_from_hub_module()

INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


In [39]:
# tokenize the sentences

def tokenize(sentences):
  bert_tks = []
  for sentence in sentences:

      sentence= sentence.replace('<e1>', '<e1>')
      sentence= sentence.replace('</e1>', '<e1>')
      sentence= sentence.replace('<e2>', '<e1>')
      sentence= sentence.replace('</e2>', '<e1>')

      trunks = sentence.split('<e1>')

      bert_tokens = []
      bert_tokens.append("[CLS]")
      if len(trunks) != 5:
          raise ValueError('Something is wrong with this sentence: ' + sentence)

      for i, trunk in enumerate(trunks):
          tks = tokenizer.tokenize(trunk)
          bert_tokens.extend(tks)
          if i == 0 or i == 1:
              bert_tokens.append('[EN1]')
          elif i == 2 or i ==3:
              bert_tokens.append('[EN2]')

      bert_tokens.append("[SEP]")

      bert_tks.append(bert_tokens)
  return bert_tks

#tokenize([tr_sentences[0]])
tr_bert_tks = tokenize(tr_sentences)
te_bert_tks = tokenize(te_sentences)

tr_bert_tks[0], te_bert_tks[0]

(['[CLS]',
  'the',
  'system',
  'as',
  'described',
  'above',
  'has',
  'its',
  'greatest',
  'application',
  'in',
  'an',
  'array',
  '##ed',
  '[EN1]',
  'configuration',
  '[EN1]',
  'of',
  'antenna',
  '[EN2]',
  'elements',
  '[EN2]',
  '.',
  '[SEP]'],
 ['[CLS]',
  'the',
  'most',
  'common',
  '[EN1]',
  'audit',
  '##s',
  '[EN1]',
  'were',
  'about',
  '[EN2]',
  'waste',
  '[EN2]',
  'and',
  'recycling',
  '.',
  '[SEP]'])

In [40]:
# Padding the sequence so that they have the same length
import numpy as np

from tensorflow.keras.preprocessing.sequence import pad_sequences

MAX_LEN = 128
tr_input_ids = pad_sequences([tokenizer.convert_tokens_to_ids(txt) for txt in tr_bert_tks],
                          maxlen=MAX_LEN, dtype="long", truncating="post", padding="post")

te_input_ids = pad_sequences([tokenizer.convert_tokens_to_ids(txt) for txt in te_bert_tks],
                          maxlen=MAX_LEN, dtype="long", truncating="post", padding="post")

# Construct masks for 2 entities
def build_entity_mask(bert_tks):
  marks = [[i for i, x in enumerate(tks) if x == "[EN1]" or x == '[EN2]'] for tks in bert_tks]

  e1_masks = []
  e2_masks = []
  for i, mark in enumerate(marks):
      e1_mask = np.zeros((MAX_LEN, ))
      e2_mask = np.zeros((MAX_LEN, ))

      e1_mask[mark[0] + 1: mark[1]] = 1.
      e2_mask[mark[2] + 1: mark[3]] = 1.

      e1_masks.append(e1_mask)
      e2_masks.append(e2_mask)
      
  return e1_masks, e2_masks

tr_e1masks, tr_e2masks = build_entity_mask(tr_bert_tks)
te_e1masks, te_e2masks = build_entity_mask(te_bert_tks)

tr_input_ids[0], tr_e1masks[0], tr_e2masks[0]

(array([  101,  1996,  2291,  2004,  2649,  2682,  2038,  2049,  4602,
         4646,  1999,  2019,  9140,  2098,     1,  9563,     1,  1997,
        13438,     2,  3787,     2,  1012,   102,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
      

In [41]:
# encode the label
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
encoder.fit(tr_relations)

tr_y = encoder.transform(tr_relations)
te_y = encoder.transform(te_relations)
tr_y[0], te_y[0]

(1, 7)

In [0]:
# attention mask for BERT model
tr_attention_masks = [[float(i>0) for i in ii] for ii in tr_input_ids]
te_attention_masks = [[float(i>0) for i in ii] for ii in te_input_ids]

## Build the model

### Wrap the BERT model into a tensorflow keras layer

In [0]:
from tensorflow.keras import backend as K

class BertLayer(tf.keras.layers.Layer):
    def __init__(self, n_fine_tune_layers=10, **kwargs):
        self.n_fine_tune_layers = n_fine_tune_layers
        self.trainable = True
        self.output_size = 768
        super(BertLayer, self).__init__(**kwargs)

    def build(self, input_shape):
        self.bert = hub.Module(
            BERT_MODEL_HUB,
            trainable=self.trainable,
            name="{}_module".format(self.name)
        )
        trainable_vars = self.bert.variables
        
        # Remove unused layers
        trainable_vars = [var for var in trainable_vars if not ("/cls/" in var.name or 'pooler' in var.name)]
        
        # Select how many layers to fine tune
        if self.n_fine_tune_layers == -1:
            trainable_vars = []
        else:
            trainable_vars = trainable_vars[-self.n_fine_tune_layers :]
        
        # Add to trainable weights
        for var in trainable_vars:
            self._trainable_weights.append(var)
        
        # Add non-trainable weights
        for var in self.bert.variables:
            if var not in self._trainable_weights:
                self._non_trainable_weights.append(var)
        
        super(BertLayer, self).build(input_shape)

    def call(self, inputs):
        inputs = [K.cast(x, dtype="int32") for x in inputs]
        input_ids, input_mask, segment_ids = inputs
        bert_inputs = dict(
            input_ids=input_ids, input_mask=input_mask, segment_ids=segment_ids
        )
        
        # Use "pooled_output" for classification tasks on an entire sentence.
        # Use "sequence_outputs" for token-level output.
        result = self.bert(inputs=bert_inputs, signature="tokens", as_dict=True)
        return result['sequence_output']

    def compute_output_shape(self, input_shape):
        return (input_shape[0], self.output_size)

The custom layer that average the entities vector and concat them like described in the paper

In [0]:
from tensorflow.keras import backend as K
from tensorflow import initializers
from tensorflow.keras.layers import Layer, InputSpec, Dense, concatenate, Dropout

class AverageAndConcat(Layer):

    def __init__(self, **kwargs):
        self.init = tf.initializers.random_uniform()
        self.supports_masking = True
        super(AverageAndConcat, self).__init__(** kwargs)

    def build(self, input_shape):
        bert_input_shape, e1mask_shape, e2mask_shape = input_shape
        self.input_spec = [InputSpec(ndim=3)]
        assert len(bert_input_shape) == 3

        self.fc_0 = Dense(bert_input_shape[2])
        self.fc_1 = Dense(bert_input_shape[2])
        
        #self._trainable_weights = [self.w]
        super(AverageAndConcat, self).build(input_shape)
        
    def call(self, inputs, mask=None):
        bert_input, e1_mask, e2_mask = inputs
        h0 = K.tanh(bert_input[:, 0, :])
        h1 = K.sum(bert_input * K.expand_dims(e1_mask, -1), axis = 1, keepdims=False)
        h2 = K.sum(bert_input * K.expand_dims(e2_mask, -1), axis = 1, keepdims=False)

        e1_len = K.sum(e1_mask, axis = 1, keepdims=True)
        e2_len = K.sum(e2_mask, axis = 1, keepdims=True)

        h1 = h1 / (e1_len + 1e-8)
        h2 = h2 / (e2_len + 1e-8)

        h1 = K.tanh(h1)
        h2 = K.tanh(h2)
        
        # add dropout
        h0 = Dropout(0.1)(h0)
        h1 = Dropout(0.1)(h1)
        h2 = Dropout(0.1)(h2)
        
        # fully connected layer for each vector
        h0 = self.fc_0(h0)
        h1 = self.fc_1(h1)
        h2 = self.fc_1(h2)
        
        return concatenate([h0, h1, h2], -1)

    def get_output_shape_for(self, input_shape):
        return self.compute_output_shape(input_shape)

    def compute_output_shape(self, input_shape):
        bert_input_shape, mark_input_shape = input_shape

        return (bert_input_shape[0], bert_input_shape[2] * 3)

    def compute_mask(self, input, input_mask=None):
        if isinstance(input_mask, list):
            return [None] * len(input_mask)
        else:
            return None

### Now set up the main model here

In [45]:
from tensorflow.keras import Input
from tensorflow.keras.models import Model
from tensorflow.keras.layers import LSTM, Embedding, Dense, TimeDistributed, Dropout, Conv1D, Lambda, GlobalAveragePooling1D
from tensorflow.keras.layers import Bidirectional, concatenate, SpatialDropout1D, GlobalMaxPooling1D, add
from tensorflow.keras.optimizers import Adam, RMSprop

in_id = Input(shape=(MAX_LEN,), name="input_ids")
in_mask = Input(shape=(MAX_LEN,), name="input_masks")
in_segment = Input(shape=(MAX_LEN,), name="segment_ids")
e1mask_in = Input(shape=(MAX_LEN,), name="e1_masks")
e2mask_in = Input(shape=(MAX_LEN,), name="e2_masks")

bert_inputs = [in_id, in_mask, in_segment]

all_input = [in_id, in_mask, in_segment, e1mask_in, e2mask_in]


# Instantiate the custom Bert Layer defined above
bert_output = BertLayer(n_fine_tune_layers=0)(bert_inputs)

concat_input = [bert_output, e1mask_in, e2mask_in]

x = AverageAndConcat()(concat_input)
x = Dropout(0.1)(x)
out = Dense(units=n_relations, activation="softmax")(x)

model = Model(all_input, out)
model.compile(optimizer=Adam(lr = 2e-5), loss="sparse_categorical_crossentropy", metrics=["accuracy"])
model.summary()

INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor


Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor


Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_ids (InputLayer)          [(None, 128)]        0                                            
__________________________________________________________________________________________________
input_masks (InputLayer)        [(None, 128)]        0                                            
__________________________________________________________________________________________________
segment_ids (InputLayer)        [(None, 128)]        0                                            
__________________________________________________________________________________________________
bert_layer (BertLayer)          (None, None, 768)    110104890   input_ids[0][0]                  
                                                                 input_masks[0][0]            

In [46]:
tr_y = np.expand_dims(tr_y, 2)
te_y = np.expand_dims(te_y, 2)
batch_size = 16


with tf.Session() as sess:
    sess.run(tf.compat.v1.global_variables_initializer())
    history = model.fit([tr_input_ids, np.array(tr_attention_masks), np.zeros_like(tr_input_ids), np.array(tr_e1masks), np.array(tr_e2masks)],
                        tr_y,
                        validation_data=([te_input_ids, np.array(te_attention_masks), np.zeros_like(te_input_ids), np.array(te_e1masks), np.array(te_e2masks)], te_y),
                        batch_size=batch_size,
                        epochs=5,
                        verbose=1)
    
    pred = model.predict([te_input_ids, np.array(te_attention_masks), np.zeros_like(te_input_ids), np.array(te_e1masks), np.array(te_e2masks)])

  """Entry point for launching an IPython kernel.
  


Train on 8000 samples, validate on 2717 samples
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [47]:
# time for evaluation
from sklearn.metrics import precision_recall_fscore_support as score

pred_cl = np.argmax(pred, -1)
precision, recall, fscore, support = score(te_y, pred_cl)

# remove 'Other'
score = [s for i, s in enumerate(fscore) if i != encoder.transform(['Other'])[0]]
np.mean(score)

0.881904965271868