# **Attention-Based Bidirectional Long Short-Term Memory Networks for Relation Classification**


---

Ayazhan Aman, Aidana Kabdulova \\
Nazarbayev University \\
Nur-Sultan, Kazakhstan \\
ayazhan.aman@nu.edu.kz, aidana.kabdulova@nu.edu.kz


#Introduction


In this project, we deal with relation classification problem, which is one of the essential semantic processing tasks. Current methods which solves such natural language processing (NLP) problems require the high quality of the extracted features. There are several pre-existing natural language processing (NLP) systems like dependency parser and named entity recognizers (NER) or lexical databases such as WordNet which are very useful in order to get high level features. Even so, sometimes main information of the text may not depend on position in  the sentence. Thus, we exploit Attention-Based Bidirectional Long Short-Term Memory Networks(AttBLSTM) in order to obtain key semantic point of sentence. For this purpose we make experiments on SemEval-2010-task8 dataset.

Relation classification problem is an important semantic processing task and based on predicting semantic relations between pairs of nominals. The relation classification task can be defined as follows: given a sentence $S$ which contains a pair of nominals $<e_1,e_2>$, the goal is to find relation between these two nominals $e_1$ and $e_2$.  In this work our task is to classify which of the following nine semantic relations holds between the nominals: Cause-Effect, Instrument-Agency, Product-Producer, ContentContainer, Entity-Origin, Entity-Destination, Component-Whole, Member-Collection, Message-Topic, or Other if it does not belongs to any of the nine annotated relations.


For instance, ''burst'' and ''pressure'' connected in a Cause-Effect relation in the sentence:
```
 "The <e1>burst</e1> has been caused by water hammer <e2>pressure</e2>."
 Cause-Effect(e2,e1)
```
In this example, we obtain the relationship between the words burst and pressure by meaning of two nominals and context words. Thus, the representation and understanding of lexical and contextual meaning is the most important issues of semantic relation classification.

Recently, deep learning has made significant progress in natural language processing. There are lots of the state-of-the-art methods used for relation classification such as CNN, convolutional DNN, BLSTM based approaches. Some approaches use NLP systems like dependency parsers and NER or lexical resources like WordNet.

Our work proposes reproduction of a novel neural network AttBLSTM for relation classification. This model is combination of attention mechanism with Bidirectional LSTM, which can capture the most essential information from the text. Precisely speaking, this model can automatically identify the words which have crucial effect on classification. As it mentioned above, we make experiments on SemEval-2010-task8 dataset and obtain 74% of accuracy on test set and 75% of F1-score. We also use pre-trained vectors Glove.6B.100d to increase our accuracy.


# Dataset

Experiments are conducted on SemEval-2010 Task 8 dataset (Hendrickx et al., 2009). This dataset contains 9 relationships (with two directions) and an undirected Other class. There are 10,717 annotated examples, including 8,000 sentences for training, and 2,717 for testing. We adopt the official evaluation metric to evaluate our systems, which is based on macro-averaged F1-score for the nine actual relations (excluding the Other relation) and takes the directionality into consideration.
In order to compare with the work by Zhang and Wang (2015), we use the same word vectors proposed by Turian et al. (2010) (50-dimensional) to initialize the embedding layer. Additionally, to compare with the work by Zhang et al. (2015), we also use the 100-dimensional word vectors pretrained by Pennington et al. (2014).
 Since there is no official development dataset, we randomly select 1600 sentences (which is 20%) from training set for validation. The hyper-parameters for our model were tuned on the development set for each task. 

In [0]:
from google.colab import drive
drive.mount('/content/drive/')

In [0]:
! git clone https://github.com/SeoSangwoo/Attention-Based-BiLSTM-relation-extraction/tree/master/SemEval2010_task8_all_data

fatal: destination path 'SemEval2010_task8_all_data' already exists and is not an empty directory.


In [0]:
! ls

drive  sample_data  SemEval2010_task8_all_data


In [0]:
import os
import zipfile

local_zip = ''
zip_ref = zipfile.ZipFile(local_zip, 'r')
zip_ref.extractall('/content')
zip_ref.close()

with open("/content/SemEval2010_task8_all_data/SemEval2010_task8_training/TRAIN_FILE.TXT") as f:
  train_file = f.readlines()

#with open("/content/SemEval2010_task8_all_data/SemEval2010_task8_testing/TEST_FILE.txt") as f:
  #test_file = f.readlines()

with open("/content/SemEval2010_task8_all_data/SemEval2010_task8_testing_keys/TEST_FILE_FULL.TXT") as f:
  test_file = f.readlines()

# Data preprocessing


In our dataset, we have 8000 training sentences and 2717 test sentences. After uploading our dataset SemEval-2010 Task 8, we did simple data pre-processing. Firstly, we replace these symbols "<", ">" to symbols with space respectively " <", "> ". Also, we delete all new lines. 
Using the **split()** and **replace()** methods we divide our train and test datasets to sentences and relations.

In our data set we have following 10 relations: ['Instrument-Agency', 'Message-Topic', 'Content-Container', 'Entity-Origin', 'Product-Producer', 'Entity-Destination', 'Cause-Effect', 'Component-Whole', 'Other', 'Member-Collection']. Working with relations in this format is inconvenient. Therefore, we transform these relations into one-hot vector. 

We split our training dataset to train (80%) and validation (20%). After splitting in train set remained 6400 sentences, in validation 1600 sentences.

In [0]:
def prepare_dataset(raw):
    sentences, relations = [], []
    to_replace = [("\"", ""), ("\n", ""), ("<", " <"), (">", "> ")]
    last_was_sentence = False
    for line in raw:
        sl = line.split("\t")
        if last_was_sentence:
            relations.append(sl[0].split("(")[0].replace("\n", ""))
            last_was_sentence = False
        if sl[0].isdigit():
            sent = sl[1]
            for rp in to_replace:
                sent = sent.replace(rp[0], rp[1])
            sentences.append(sent)
            last_was_sentence = True
    print("Found {} sentences".format(len(sentences)))
    return sentences, relations

In [0]:
sentences, relations = prepare_dataset(train_file)
sentences_test, relations_test = prepare_dataset(test_file)

Found 8000 sentences
Found 2717 sentences


In [0]:
for n, line in enumerate(relations):
  if line=='Instrument-Agency':
    relations[n]=0
  elif line== 'Message-Topic':
    relations[n]=1
  elif line=='Content-Container':
    relations[n]=2
  elif line=='Entity-Origin':
    relations[n]=3
  elif line=='Product-Producer':
    relations[n]=4
  elif line=='Entity-Destination':
    relations[n]=5
  elif line=='Cause-Effect':
    relations[n]=6
  elif line=='Component-Whole':
    relations[n]=7
  elif line=='Other':
    relations[n]=8
  elif line=='Member-Collection':
    relations[n]=9 

for n, line in enumerate(relations_test):
  if line=='Instrument-Agency':
    relations_test[n]=0
  elif line== 'Message-Topic':
    relations_test[n]=1
  elif line=='Content-Container':
    relations_test[n]=2
  elif line=='Entity-Origin':
    relations_test[n]=3
  elif line=='Product-Producer':
    relations_test[n]=4
  elif line=='Entity-Destination':
    relations_test[n]=5
  elif line=='Cause-Effect':
    relations_test[n]=6
  elif line=='Component-Whole':
    relations_test[n]=7
  elif line=='Other':
    relations_test[n]=8
  elif line=='Member-Collection':
    relations_test[n]=9 

In [0]:
from keras.utils import to_categorical
val_binary = to_categorical(relations)
#print(val_binary)

from keras.utils import to_categorical
test_binary = to_categorical(relations_test)
#print(test_binary)

In [0]:
vocab_size = 30000
embedding_dim = 100
max_length = 50
trunc_type='post'
padding_type='post'
oov_tok = "<OOV>"
vs = 30001

In [0]:
import numpy as np
from sklearn.model_selection import train_test_split
tr_sent, te_sent, tr_rel, te_rel = train_test_split(sentences, val_binary, test_size=0.2)

##Tokenizer

Tokenization is the first step in many natural language processing tasks. Tokenizing text is the process of splitting a piece of text into words, symbols, punctuation, spaces and other elements, thereby creating “tokens”. And for each tokens Tokenizer gives index. It means, if we split our sentences into words, each word has own index. But in some sentences, there are words that out of vocabulary. In that situation, we use OOV_taken (out-of-vocabulary). For all words which do not exist on vocabulary, OOV_taken gives index 1. 

##Padding

In our dataset sentences has a different length. To make it easier to work we should make all sentences one size. For this, we use padded sequences.
Padding sequences mean adding zeros at the list to make it with the length size of the long sentences. 

In [0]:
import tensorflow as tf

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.optimizers import RMSprop, Adam
from tensorflow.keras.preprocessing.sequence import pad_sequences

tokenizer = Tokenizer(num_words=vocab_size, oov_token=oov_tok)
tokenizer.fit_on_texts(tr_sent)

word_index = tokenizer.word_index
total_words = len(tokenizer.word_index) 

training_sequences = tokenizer.texts_to_sequences(tr_sent)
training_padded = pad_sequences(training_sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)

val_sequences = tokenizer.texts_to_sequences(te_sent)
val_padded = pad_sequences(val_sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)

testing_sequences = tokenizer.texts_to_sequences(sentences_te)
testing_padded = pad_sequences(testing_sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)

#Reproduction

##*Model*

We do a reproduction of the article "Attention-Based Bidirectional Long Short-Term Memory Networks for Relation Classification". In this paper authors use Att-BLSTM model for NLP task. This model have following five components:

1.   Input layer: input sentence to this model;
2.   Embedding layer;
3.   LSTM layer;
4.   Attention layer;
5.   Output layer.

![Fig 1](https://user-images.githubusercontent.com/6512394/41424160-42520358-7038-11e8-8db0-859346a1fa3a.PNG)




**Word embeddings**

Word embedding used for mapping a sentence consisting of $T$ words $S = {x_1, x_2, . . . , x_T }$ to real-valued vectors. We lookup embedding matrix $W^{wrd} \in R^{d_w |V|}$ for each words in sentences $S$. Here, $|V|$ is vocabulary size and $d^w$ size of embedding. The matrix $W^{wrd}$ to be learned, $d^w$ hyper-parameter. Using the matrix-vector product 
$$e_i = W^{wrdv}v^i$$
we convert a word $x_i$ into its word embedding $e_i$, where $v_i$ is a vector of size $|V|$ which has value 1 at index $e_i$ and 0 in all other positions. And after word embedding our sentence will look 
vector $embs = {e_1, e_2, . . . , e_T }$.

**Bidirectional LSTM**

Next layer in our model is LSTM layer. The main idea is to introduce an adaptive gating mechanism, which decides the degree to which LSTM units keep the previous state and memorize the extracted features of the current data input.
The following equations for the forward pass of an LSTM units:
$$i_t = σ(W_{x_ix_t} + W_{hi}h_{t−1} + W_{ci}c_{t−1} + b_i) $$
$$f_t = σ(W_{xf}x_t+W_{hf}h_{t−1}+W_{cf} c_{t−1} + b_f)$$
$$g_t = tanh(W_{xc}x_t+W_{hc}h_{t−1}+W_{cc}c_{t−1}+b_c)$$
$$c_t = i_tg_t + f_tc_{t−1}$$
$$o_t = σ(W_{xo}x_t + W_{ho}h_{t−1} + W_{co}c_t + b_o)$$
$$h_t = o_t tanh(c_t)$$

**Variables:**

$i_t$ - input gate

$f_t$ - forget gate

$o_t$ - output gate

$x_i$ - the current input

$h_{i−1}$ - the state that previous step generated

$c_{i−1}$ - the current state of the cell

$W_{xi}, W_{hi}, W_{ci}, b_i$, $W_{xf}$ , $W_{hf}$ , $W_{cf}$ , $b_f$, $W_{xo}, W_{ho}, W_{co}, b_o $- weight matrix

Following picture shows how to works this equations. 

![LSTM](https://upload.wikimedia.org/wikipedia/commons/thumb/5/53/Peephole_Long_Short-Term_Memory.svg/1920px-Peephole_Long_Short-Term_Memory.svg.png)

In our task, it is beneficial to have access to future and past context. But, standard LSTM networks do not have access to future context. Therefore we use Bidirectional LSTM network.As also shown in figure above, the BiLSTM network contains forward and backward pass. The following equation present the output of the $i$-th word
$$h_i = [h_i^{→} ⊕h_i^{←}]. $$
Here, we use element-wise sum to combine the
forward and backward pass outputs.


**Attention Layer**

In recent years, the so-called attention mechanism demonstrated success in tasks as question answering, machine translations, speech recognition. This attention layer basically learns the input sequence and averages the sequence accordingly to extract the relations between the words, which needs attention. 

 Now we show the mathematical meaning of this layer. 
$$M = tanh(H)$$
$$α = sof tmax(w^T M)$$
$$r = Hα^T$$

**Notation**

$H ∈ R^{d^w ×T}$ - a matrix consisting of output vectors $[h_1, h_2, . . . , h_T]$ that the LSTM layer produced;

$d^w$ - the dimension of the word vectors;

$w$ -  trained parameter vector and

We obtain final sentences representation using the formula below:
$$h^∗ = tanh(r)$$


##Classifying and Regularization

To predict label $yˆ$ from a discrete set of classes Y for a
sentence S we use softmax classifier. The classifier takes $h^*$ as input:
$$pˆ(y|S) = softmax (W^{(S)}h^∗ + b^{(S)})$$

$$yˆ = argmax_y pˆ(y|S).$$

For regulazation we use two types of Dropout (droput, recurrent dropout) in BiLSTM layer.

##Experiments

After loading the data and some pre-processing, we begin our experiments to select the best model and parameters. We use pre-trained vectors Glove.6B.100d to increase our accuracy. Glove is an unsupervised learning algorithm for obtaining vector representations for words. For following models we consider that our vocabulary size = 30000 and maximum number of words for each sentence is 50.

### First approach


Once all the data has been preprocessed and pre-trained vectors Glove.6B.100d has been downloaded we begin to build a model for Bidirectional LSTM networks. Various layers has been applied and trained. The best result was achieved with model consisting Embedding layer with embedding dimension = 100; one BiLSTM layer with 150 units; one Dropout layer with dropout rate = 0,5 and Dense layer with 10 units and "Softmax" activation function. We choose an ”RMSprop” optimizer for optimization and since we have amulti-class classification problem we use ”categorical crossentropy” as a loss function. In order to fit this model we fix 30 epochs.
 As a result we get 68% of accuracy on testing set and 67% of F1-score. 

In [0]:
import numpy as np
embeddings_index = {};
with open('/content/drive/My Drive/glove.6B.100d.txt') as f:
    for line in f:
        values = line.split();
        word = values[0];
        coefs = np.asarray(values[1:], dtype='float32');
        embeddings_index[word] = coefs;

embeddings_matrix = np.zeros((vocab_size+1, embedding_dim));
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word);
    if embedding_vector is not None:
        embeddings_matrix[i] = embedding_vector;

In [0]:
from tensorflow.keras.layers import Embedding, LSTM, Dense, Bidirectional, Dropout, SpatialDropout1D
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import RMSprop, Adam

model1 = Sequential()
model1.add(Embedding(vs, embedding_dim, input_length=max_length, weights=[embeddings_matrix]))
model1.add(Bidirectional(LSTM(150)))
model1.add(Dropout(0.5))
model1.add(Dense(10, activation='softmax'))

#rmsprop= RMSprop(lr=0.01)
model1.compile(loss='categorical_crossentropy', optimizer=RMSprop(lr=0.01), metrics=['accuracy'])
#earlystop = EarlyStopping(monitor='val_loss', min_delta=0, patience=5, verbose=0, mode='auto')
history1 = model1.fit(training_padded, tr_rel, validation_data=(val_padded, te_rel), epochs=30, verbose=1)
#print model.summary()
#print(model)

Train on 6400 samples, validate on 1600 samples
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


In [0]:
result1 = model1.evaluate(testing_padded,test_binary)
print('Test accuracy:',result1[1])

from sklearn.metrics import f1_score, classification_report, accuracy_score
import numpy as np
y1_pred = model1.predict(testing_padded)
y1_pred=np.round(y1_pred, 0)
f1_1=f1_score(test_binary, y1_pred, average='weighted') 
print('f1_score:', f1_1)

Test accuracy: 0.64298856
f1_score: 0.6502470020275697


In [0]:
from tensorflow.keras.layers import Embedding, LSTM, Dense, Bidirectional, Dropout, SpatialDropout1D
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import RMSprop, Adam

embedding_dim = 100
model2 = Sequential()
model2.add(Embedding(vs, embedding_dim, input_length=max_length, weights=[embeddings_matrix]))
model2.add(Bidirectional(LSTM(150)))
model2.add(Dropout(0.5))
model2.add(Dense(10, activation='softmax'))

#rmsprop= RMSprop(lr=0.01)
model2.compile(loss='categorical_crossentropy', optimizer=RMSprop(lr=0.01), metrics=['accuracy'])
#earlystop = EarlyStopping(monitor='val_loss', min_delta=0, patience=5, verbose=0, mode='auto')
history2 = model2.fit(training_padded, tr_rel, validation_data=(val_padded, te_rel), epochs=30, verbose=1)
#print model.summary()
#print(model)

Train on 6400 samples, validate on 1600 samples
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


In [0]:
result2 = model2.evaluate(testing_padded,test_binary)
print('Test accuracy:',result2[1])

from sklearn.metrics import f1_score, classification_report, accuracy_score
import numpy as np
y2_pred = model2.predict(testing_padded)
y2_pred=np.round(y2_pred, 0)
f1_2=f1_score(test_binary, y2_pred, average='weighted') 
print('f1_score:', f1_2)

Test accuracy: 0.6812661
f1_score: 0.678821086488234


### Second approach

For the second approach we added attention mechanism to our network. After testing on different models with different parameters, we settled on a model which consists Embedding layer with embedding dimension = 100; one BiLSTM layer which has 150 units, dropout = 0.2 and reccurent_dropout = 0.2; Attention layer and Dense layer with 10 units and "Softmax" activation function. We again choose an ”RMSprop” optimizer for optimization and since we have a multi-class classification problem we use ”categorical crossentropy” as a loss function. In order to fit this model we fix 30 epochs. As a result we get 73% of accuracy on testing set and 73% of F1-score.

In [0]:
pip install keras-self-attention



In [0]:
from keras.preprocessing import sequence
from keras_self_attention import SeqSelfAttention, SeqWeightedAttention
from keras import models
from keras import layers
from keras.layers import Dense, Embedding, LSTM, Bidirectional, Flatten


model4 = models.Sequential()
# model.add( Embedding(max_features, 32,  mask_zero=True))
model4.add( Embedding(vs, embedding_dim, input_length=max_length, weights=[embeddings_matrix], mask_zero=True))
model4.add(Bidirectional( LSTM(150, return_sequences=True, dropout=0.2, recurrent_dropout=0.2)))
# add an attention layer

# model.add(SeqSelfAttention(attention_activation='sigmoid'))
model4.add(SeqWeightedAttention())

model4.add( Dense(10, activation='softmax') )

# compile and fit
model4.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['acc'])
model4.summary()

history4 = model4.fit(training_padded, tr_rel, validation_data=(val_padded, te_rel), epochs=30, verbose=1)

Model: "sequential_9"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_9 (Embedding)      (None, 50, 100)           3000100   
_________________________________________________________________
bidirectional_7 (Bidirection (None, 50, 300)           301200    
_________________________________________________________________
seq_weighted_attention_7 (Se (None, 300)               301       
_________________________________________________________________
dense_7 (Dense)              (None, 10)                3010      
Total params: 3,304,611
Trainable params: 3,304,611
Non-trainable params: 0
_________________________________________________________________
Train on 6400 samples, validate on 1600 samples
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30

In [0]:
result4 = model4.evaluate(testing_padded,test_binary)
print('Test accuracy:',result4[1])

from sklearn.metrics import f1_score, classification_report, accuracy_score
import numpy as np
y4_pred = model4.predict(testing_padded)
y4_pred=np.round(y4_pred, 0)
f1_4=f1_score(test_binary, y4_pred, average='weighted') 
print('f1_score:', f1_4)

Test accuracy: 0.7331615750035864
f1_score: 0.7392195331114566


## Results

**Results in Tabular form**
---

Our model | | Paper's model|  |
--- | --- | --- | --- 
**Model** | **F1-score** | **Model** | **F1-score** |
BLSTM   | 67.8% | BLSTM | 80.7%
Att-BLSTM | 73.9% | Att-BLSTM | 82.5%
**Table-1. F1-score of different models.**

# Colclusion

In this project we experienced with SemEval-2010 Task 8 dataset (Hendrickx et al., 2009). We tried to reproduce "Attention-Based Bidirectional Long Short-Term Memory Networks for Relation Classification" paper. We constructed same model but also added pre-trained vectors Glove.6B.100d since our accuracy was quite low. Although obtained F1-scores are a little bit different from paper's, from the given results in Table-1 we can conclude that model with Attention mechanism classifies better than simple BiLSTM.