# LSTM for Entity Extraction on WikiNER (English)

WikiNER is a dataset of annotated sentences for Entity Extraction taken from Wikipedia. In this notebook, we train and evaluate a Bidirectional LSTM neural network model on the english WikiNER dataset to recognize Person, Locations and Organizations.

We use `tf.keras.preprocessing.text.Tokenizer` for text preprocessing, we pad all the santences to the same length and load Glove embeddings for token encoding, then we use `tensorflow.keras` to build the model. Evaluation is made with the `seqeval` package.

---

In [1]:
import os
import numpy as np
from pprint import pprint
from utils import dataio, kerasutils, modelutils
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical, plot_model
from tensorflow.keras.callbacks import ModelCheckpoint, EarlyStopping
from seqeval.metrics import classification_report

## Load Dataset

Thanks to the author of [this repo](https://github.com/dice-group/FOX/blob/master/input/Wikiner/aij-wikiner-en-wp3.bz2) that makes WikiNER data easily available.

In [2]:
file_path = os.path.join('data', 'wikiner-en-wp3-raw.txt')
sentences, tags, output_labels = dataio.load_wikiner(file_path, token_only=True)

Read 142153 sentences.


In [3]:
print("Labels:", output_labels)

Labels: {'I-MISC', 'B-ORG', 'B-PER', 'O', 'I-ORG', 'B-MISC', 'I-LOC', 'B-LOC', 'I-PER'}


In [4]:
print("Sentence Example:")
for i in range(len(sentences[1])):
    print(f'{sentences[1][i]:15}  |  {tags[1][i]}')

Sentence Example:
In               |  O
the              |  O
end              |  O
,                |  O
for              |  O
anarchist        |  O
historian        |  O
Daniel           |  I-PER
Guerin           |  I-PER
"                |  O
Some             |  O
anarchists       |  O
are              |  O
more             |  O
individualistic  |  O
than             |  O
social           |  O
,                |  O
some             |  O
more             |  O
social           |  O
than             |  O
individualistic  |  O
.                |  O


---

## Text preprocessing and token encoding

#### Token Ordinal Encoding

In [5]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(sentences)
X = tokenizer.texts_to_sequences(sentences)

tag2idx = { tag: idx for idx, tag in enumerate(output_labels) }
print(tag2idx)

{'I-MISC': 0, 'B-ORG': 1, 'B-PER': 2, 'O': 3, 'I-ORG': 4, 'B-MISC': 5, 'I-LOC': 6, 'B-LOC': 7, 'I-PER': 8}


In [6]:
idx2tag = { idx: tag for tag, idx in tag2idx.items() }
tags = [[tag2idx[tag] for tag in sentence] for sentence in tags]

In [7]:
print(sentences[1])
print(X[1])
for i in X[1]:
    print(f'{i:6} | {tokenizer.index_word[i]}')

['In', 'the', 'end', ',', 'for', 'anarchist', 'historian', 'Daniel', 'Guerin', '"', 'Some', 'anarchists', 'are', 'more', 'individualistic', 'than', 'social', ',', 'some', 'more', 'social', 'than', 'individualistic', '.']
[5, 1, 160, 2, 14, 4838, 2337, 2371, 55954, 10, 56, 7102, 31, 54, 21657, 71, 414, 2, 56, 54, 414, 71, 21657, 3]
     5 | in
     1 | the
   160 | end
     2 | ,
    14 | for
  4838 | anarchist
  2337 | historian
  2371 | daniel
 55954 | guerin
    10 | "
    56 | some
  7102 | anarchists
    31 | are
    54 | more
 21657 | individualistic
    71 | than
   414 | social
     2 | ,
    56 | some
    54 | more
   414 | social
    71 | than
 21657 | individualistic
     3 | .


In [8]:
vocabulary_size = len(tokenizer.word_counts)
print(vocabulary_size)

108276


In [9]:
print(sentences[1])
print(tags[1])
for i in tags[1]:
    print(f'{i} : {idx2tag[i]}')

['In', 'the', 'end', ',', 'for', 'anarchist', 'historian', 'Daniel', 'Guerin', '"', 'Some', 'anarchists', 'are', 'more', 'individualistic', 'than', 'social', ',', 'some', 'more', 'social', 'than', 'individualistic', '.']
[3, 3, 3, 3, 3, 3, 3, 8, 8, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]
3 : O
3 : O
3 : O
3 : O
3 : O
3 : O
3 : O
8 : I-PER
8 : I-PER
3 : O
3 : O
3 : O
3 : O
3 : O
3 : O
3 : O
3 : O
3 : O
3 : O
3 : O
3 : O
3 : O
3 : O
3 : O


#### Sequence Padding

The input sequence of an LSTM model must have a fixed length. We choose the most appropriate seqence length given the length of the sentences of the dataset, than we pad shorter sentences and truncate the longer ones.

In [10]:
sequence_len = np.array([len(s) for s in sentences])
longest_sequence = sequence_len.max()
print(f'Longest sequence: {longest_sequence}')

print([(str(p) + '%', np.percentile(sequence_len, p)) for p in range(75,101, 5)])

Longest sequence: 173
[('75%', 31.0), ('80%', 33.0), ('85%', 36.0), ('90%', 40.0), ('95%', 46.0), ('100%', 173.0)]


In [11]:
max_sequence_len = 50
X = pad_sequences(X, maxlen=max_sequence_len, padding='post', truncating='post')

y = pad_sequences(tags, maxlen=max_sequence_len, value=tag2idx['O'], padding='post', truncating='post')
y = to_categorical(y, num_classes=len(output_labels), dtype='int32')

In [12]:
tokenizer.index_word[0] = '_PAD_'

In [13]:
X = np.array(X)
y = np.array(y)

In [14]:
print(X.shape)
print(y.shape)

(142153, 50)
(142153, 50, 9)


## Build, train and evaluate an LSTM model

The function that loads Glove embedding and the function which creates the LSTM model can be found in the `utils/kerasutils.py` module. Training stopping criterion is Early Stopping with patience on the loss value on validation set.

In [15]:
USE_GLOVE=True

In [16]:
glove_matrix=None
if USE_GLOVE:
    glove_embedding_path = os.path.join('embeddings', 'glove.6B.100d.txt')
    embedding_dim = 100
    glove_matrix = kerasutils.load_glove_embedding_matrix(glove_embedding_path, tokenizer.word_index, embedding_dim)

Found 400001 word vectors.


In [17]:
model = kerasutils.create_paper_BiLSTM(vocabulary_size+1, max_sequence_len, len(output_labels), 
                                 use_glove=USE_GLOVE, glove_matrix=glove_matrix)

# Early Stopping on validation loss
early_stopping_callback = EarlyStopping(
    monitor="val_loss", 
    min_delta=0.01, 
    patience=3, 
    verbose=1, 
    mode="auto", 
    restore_best_weights=True
)

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 50, 100)           10827700  
_________________________________________________________________
dropout (Dropout)            (None, 50, 100)           0         
_________________________________________________________________
bidirectional (Bidirectional (None, 50, 400)           481600    
_________________________________________________________________
dense (Dense)                (None, 50, 9)             3609      
Total params: 11,312,909
Trainable params: 11,312,909
Non-trainable params: 0
_________________________________________________________________


In [18]:
# Create train-test-dev split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=3791)
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=0.2, random_state=3791)
batch_size = 10

In [19]:
kerasutils.print_model_memory_usage(batch_size, model)

Model size: 44.317 MB


In [20]:
%%time
history = model.fit(X_train, 
          y_train, 
          batch_size=batch_size, 
          epochs=50,
          verbose=1,
          callbacks=[early_stopping_callback],
          validation_data=(X_valid, y_valid)
         )

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 00007: early stopping
Wall time: 2h 34min 54s


---

## Evaluation
We evaluate three aspects of the model:
* **Memory consumption** using the `kerasutils.print_model_memory_usage()` function (found [here](https://stackoverflow.com/questions/43137288/how-to-determine-needed-memory-of-keras-model));
* **Latency in prediction** using the function `time.process_time()`;
* **F1-score** _on entities_ on the test set using `seqeval`;

In [21]:
print(f'Model latency in prediction: {modelutils.compute_prediction_latency(X_test, model):.3} s')

Model latency in prediction: 0.00389 s


In [22]:
datasets = [('Training Set', X_train, y_train), ('Test Set', X_test, y_test)]


In [23]:
for title, X, Y in datasets:
    Y_pred = model.predict(X, batch_size=128)
    Y_pred = np.array(np.argmax(Y_pred, axis=-1))
    Y = np.array(np.argmax(Y, axis=-1))
    Y, Y_pred = kerasutils.remove_seq_padding(X, Y, Y_pred)
    Y, Y_pred = modelutils.from_encode_to_literal_labels(Y, Y_pred, idx2tag)
    print(title)
    print(classification_report(Y, Y_pred, digits=3))
    print('\n')

Training Set
           precision    recall  f1-score   support

      LOC      0.816     0.878     0.846     54367
      ORG      0.752     0.733     0.743     31449
      PER      0.943     0.951     0.947     61086
     MISC      0.747     0.695     0.720     46733

micro avg      0.831     0.833     0.832    193635
macro avg      0.829     0.833     0.831    193635



Test Set
           precision    recall  f1-score   support

     MISC      0.685     0.626     0.654     14427
      PER      0.913     0.931     0.922     19192
      LOC      0.776     0.841     0.807     17119
      ORG      0.695     0.676     0.685      9760

micro avg      0.788     0.792     0.790     60498
macro avg      0.785     0.792     0.787     60498





---