# LSTM on Annotated Corpus for Named Entity Recognition

In this notebook, we perform Entity Extraction on the ACNER dataset using a LSTM-based neural network. We use `tf.keras.preprocessing.text.Tokenizer` for text preprocessing, we pad all the santences to the same length and load Glove embeddings for token encoding, then we use `tensorflow.keras` to build the model. Evaluation is made with the `seqeval` package.

---

In [1]:
import pandas as pd
import numpy as np
import os
from utils import dataio, kerasutils, modelutils
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.callbacks import ModelCheckpoint, EarlyStopping
from tensorflow.keras.utils import to_categorical
from sklearn.model_selection import train_test_split
from seqeval.metrics import classification_report

## Load Dataset
The dataset can be found [here](https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus). It reports a lot of features for each token, but we only keep the token string and the entity tag.

In [2]:
raw, ner, output_labels = dataio.load_anerd_data(
    os.path.join('data', 'annotated-ner-dataset', 'ner.csv'),
    filter_level='sentence_only'
)

b'Skipping line 281837: expected 25 fields, saw 34\n'


Filter level: sentence_only
Dataset dimension: 35177 sentences
Data read successfully!


In [3]:
print("Labels:", output_labels)

Labels: {'B-gpe', 'B-nat', 'I-geo', 'B-per', 'B-tim', 'I-org', 'B-geo', 'I-art', 'unk', 'O', 'I-gpe', 'B-eve', 'I-eve', 'I-nat', 'I-per', 'B-art', 'B-org', 'I-tim'}


In [4]:
print("Sentence Example:")
for i in range(len(raw[0])):
    print(f'{raw[0][i]:15} | {ner[0][i]}')

Sentence Example:
Thousands       | O
of              | O
demonstrators   | O
have            | O
marched         | O
through         | O
London          | B-geo
to              | O
protest         | O
the             | O
war             | O
in              | O
Iraq            | B-geo
and             | O
demand          | O
the             | O
withdrawal      | O
of              | O
British         | B-gpe
troops          | O
from            | O
that            | O
country         | O
.               | O
Thousands       | O
of              | O
demonstrators   | O
have            | O
marched         | O
through         | O
London          | B-geo
to              | O
protest         | O
the             | O
war             | O
in              | O
Iraq            | B-geo
and             | O
demand          | O
the             | O
withdrawal      | O
of              | O
British         | B-gpe
troops          | O
from            | O
that            | O
country         | O
.               | 

---

## Text preprocessing and token encoding

In [5]:
# Integer encoding of tokens
token_tokenizer = Tokenizer()    # Automatically lowers tokens
token_tokenizer.fit_on_texts(raw)
sequences = token_tokenizer.texts_to_sequences(raw)

# Dictionaries for id <-> string conversation of labels
tag2idx = { tag: idx for idx, tag in enumerate(output_labels) }
idx2tag = { idx: tag for tag, idx in tag2idx.items() }

# Label encoding
ner_sequences = [[tag2idx[tag] for tag in sentence] for sentence in ner]

In [6]:
print(sequences[0])
for i in sequences[0]:
    print(f'{i:6} : {token_tokenizer.index_word[i]}')

[259, 5, 902, 15, 1950, 245, 482, 6, 492, 1, 134, 4, 59, 8, 640, 1, 799, 5, 182, 91, 21, 14, 54, 2, 259, 5, 902, 15, 1950, 245, 482, 6, 492, 1, 134, 4, 59, 8, 640, 1, 799, 5, 182, 91, 21, 14, 54, 2]
   259 : thousands
     5 : of
   902 : demonstrators
    15 : have
  1950 : marched
   245 : through
   482 : london
     6 : to
   492 : protest
     1 : the
   134 : war
     4 : in
    59 : iraq
     8 : and
   640 : demand
     1 : the
   799 : withdrawal
     5 : of
   182 : british
    91 : troops
    21 : from
    14 : that
    54 : country
     2 : .
   259 : thousands
     5 : of
   902 : demonstrators
    15 : have
  1950 : marched
   245 : through
   482 : london
     6 : to
   492 : protest
     1 : the
   134 : war
     4 : in
    59 : iraq
     8 : and
   640 : demand
     1 : the
   799 : withdrawal
     5 : of
   182 : british
    91 : troops
    21 : from
    14 : that
    54 : country
     2 : .


In [7]:
vocabulary_size = len(token_tokenizer.word_counts)
print('Vocabulary dimension:', vocabulary_size)

Vocabulary dimension: 27419


#### Sequence Padding

The input sequence of an LSTM model must have a fixed length. We choose the most appropriate seqence length given the length of the sentences of the dataset, than we PAD shorter sentences and truncate the longer ones. 

In [8]:
sequence_len = np.array([len(s) for s in sequences])
longest_sequence = sequence_len.max()
print(f'Longest sequence: {longest_sequence}')

print([(str(p)+'%', np.percentile(sequence_len, p)) for p in range(75,101, 5)])

Longest sequence: 140
[('75%', 38.0), ('80%', 42.0), ('85%', 47.0), ('90%', 52.0), ('95%', 62.0), ('100%', 140.0)]


In [9]:
n_tags = len(output_labels); n_tags

18

In [10]:
max_len = 60
X = pad_sequences(sequences, maxlen=max_len, padding='post', truncating='post')
y = pad_sequences(ner_sequences, maxlen=max_len, value=tag2idx['O'], 
                  padding='post', truncating='post')

# Convert labels from ids to one-hot vectors
y = to_categorical(y, num_classes=n_tags, dtype='int32')

In [11]:
token_tokenizer.index_word[0] = '_PAD_'

In [12]:
X = np.array(X)
y = np.array(y)

In [13]:
# Final training set dimensionalities
print(X.shape)
print(y.shape)

(35177, 60)
(35177, 60, 18)


## Training

The function that loads Glove embedding and the function which creates the LSTM model can be found in the `utils/kerasutils.py` module. Training stopping criterion is Early Stopping with patience on the loss value on validation set.

In [14]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                    random_state=42)

In [15]:
USE_GLOVE=True      # Choose if you want to use Glove pretrained embeddings or 
                    # to train an Embedding from scratch

In [16]:
glove_matrix=None
if USE_GLOVE:
    glove_embedding_path = os.path.join('embeddings', 'glove.6B.100d.txt')
    embedding_dim = 100
    glove_matrix = kerasutils.load_glove_embedding_matrix(
        glove_embedding_path, 
        token_tokenizer.word_index, 
        embedding_dim
    )

Found 400001 word vectors.


In [17]:
model = kerasutils.create_paper_BiLSTM(vocabulary_size+1, max_len, 
                                       len(output_labels), 
                                       use_glove=USE_GLOVE, 
                                       glove_matrix=glove_matrix)

# Early stopping
early_stopping_callback = EarlyStopping(
    monitor="val_loss",
    min_delta=0.01,
    patience=3,
    verbose=1,
    mode="auto",
    restore_best_weights=True
)

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 60, 100)           2742000   
_________________________________________________________________
dropout (Dropout)            (None, 60, 100)           0         
_________________________________________________________________
bidirectional (Bidirectional (None, 60, 400)           481600    
_________________________________________________________________
dense (Dense)                (None, 60, 18)            7218      
Total params: 3,230,818
Trainable params: 3,230,818
Non-trainable params: 0
_________________________________________________________________


In [18]:
%%time
batch_size = 10
history = model.fit(
    X_train, y_train, 
    batch_size=batch_size, 
    epochs=20, 
    validation_split=0.2, 
    verbose=1,
    callbacks=[early_stopping_callback]
)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 00006: early stopping
Wall time: 18min 11s


---

## Evaluation
We evaluate three aspects of the model:
* **Memory consumption** using the `kerasutils.print_model_memory_usage()` function (found [here](https://stackoverflow.com/questions/43137288/how-to-determine-needed-memory-of-keras-model));
* **Latency in prediction** using the function `time.process_time()`;
* **F1-score** _on entities_ on the test set using `seqeval`;

In [19]:
kerasutils.print_model_memory_usage(batch_size, model)

Model size: 13.739 MB


In [20]:
print(f'Model latency in prediction: {modelutils.compute_prediction_latency(X_test, model):.3} s')

Model latency in prediction: 0.00476 s


In [21]:
datasets = [('Training Set', X_train, y_train), ('Test Set', X_test, y_test)]

for title, X, Y in datasets:
    # Get predictions: for each token we have as prediction a vector 
    # of probabilites
    Y_pred = model.predict(X, batch_size=batch_size)
    # We choose as category the one with the highest probability
    Y_pred = np.array(np.argmax(Y_pred, axis=-1))
    # Also flatten true labels
    Y = np.array(np.argmax(Y, axis=-1))
    # Remove padding from predictions and labels
    Y, Y_pred = kerasutils.remove_seq_padding(X, Y, Y_pred)
    # Restore strings instead that entity idss
    Y, Y_pred = modelutils.from_encode_to_literal_labels(Y, Y_pred, idx2tag)
    
    print(title)
    print(classification_report(Y, Y_pred, digits=3))
    print('\n')

Training Set
           precision    recall  f1-score   support

      per      0.748     0.797     0.772     13596
      geo      0.820     0.895     0.856     29297
      org      0.686     0.549     0.610     15970
      gpe      0.954     0.924     0.939     12914
      tim      0.838     0.880     0.858     15898
      eve      0.374     0.367     0.371       267
      art      0.761     0.099     0.175       355
      nat      0.439     0.307     0.361       176

micro avg      0.809     0.813     0.811     88473
macro avg      0.805     0.813     0.806     88473



Test Set
           precision    recall  f1-score   support

      per      0.728     0.782     0.754      3265
      tim      0.823     0.850     0.836      3987
      geo      0.807     0.879     0.842      7580
      gpe      0.956     0.926     0.941      3260
      org      0.661     0.528     0.587      3950
      eve      0.262     0.250     0.256        68
      nat      0.214     0.122     0.156        49
   

---