# LSTM for Entity Extraction on CoNLL2003

In this notebook, we perform Entity Extraction on the CoNLL03 dataset using a LSTM-based neural network. We use `tf.keras.preprocessing.text.Tokenizer` for text preprocessing, we pad all the santences to the same length and load Glove embeddings for token encoding, then we use `tensorflow.keras` to build the model. Evaluation is made with the `seqeval` package.

---

In [1]:
import urllib
import sklearn
import logging
import os
import numpy as np
from pprint import pprint
from utils import dataio, kerasutils, modelutils
from seqeval.metrics import classification_report
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical, plot_model
from tensorflow.keras.callbacks import ModelCheckpoint, EarlyStopping

## Load Dataset
We load CONLL2003 dataset from [this GitHub repo](https://github.com/davidsbatista/NER-datasets/tree/master/CONLL2003). 
For each token we keep only the string of the word and and the Entity tag (in BIO notation), we discard PoS and Dependency tags. One token per line, features separated with a whitespace, sentences are separated with an empty line.

In [2]:
data_dir = os.path.join('data', 'conll03')
raw_train, ner_train, output_labels = dataio.load_conll_data('train.txt', dir_path=data_dir, only_tokens=True)
raw_valid, ner_valid, _ = dataio.load_conll_data('valid.txt', dir_path=data_dir, only_tokens=True)
raw_test, ner_test, _ = dataio.load_conll_data('test.txt', dir_path=data_dir, only_tokens=True)

Reading file data\conll03\train.txt
Read 14027 sentences
Reading file data\conll03\valid.txt
Read 3249 sentences
Reading file data\conll03\test.txt
Read 3452 sentences


In [3]:
print("Labels:", output_labels)

Labels: {'I-PER', 'B-MISC', 'O', 'B-ORG', 'I-LOC', 'B-LOC', 'I-ORG', 'I-MISC', 'B-PER'}


In [4]:
print("Sentence Example:")
for i in range(len(raw_train[0])):
    print(f'{raw_train[0][i]:7}  |  {ner_train[0][i]}')

Sentence Example:
German   |  B-MISC
call     |  O
to       |  O
boycott  |  O
British  |  B-MISC
lamb     |  O
.        |  O


---

## Text preprocessing and token encoding

#### Token Ordinal Encoding

In [5]:
# integer encode sequences of words
token_tokenizer = Tokenizer()    # Automatically lowers tokens
token_tokenizer.fit_on_texts(raw_train + raw_valid + raw_test)
train_sequences = token_tokenizer.texts_to_sequences(raw_train)
test_sequences = token_tokenizer.texts_to_sequences(raw_test)
valid_sequences = token_tokenizer.texts_to_sequences(raw_valid)

# Dictionaries for id <-> string conversation of labels
tag2idx = { tag: idx for idx, tag in enumerate(output_labels) }
idx2tag = { idx: tag for tag, idx in tag2idx.items() }

# Label encoding
ner_train_sequences = [[tag2idx[tag] for tag in sentence] for sentence in ner_train]
ner_test_sequences  = [[tag2idx[tag] for tag in sentence] for sentence in ner_test ]
ner_valid_sequences = [[tag2idx[tag] for tag in sentence] for sentence in ner_valid]

In [6]:
print(raw_test[0])
print(test_sequences[0])
for i in test_sequences[0]:
    print(f'{i:6} | {token_tokenizer.index_word[i]}')

['JAPAN', 'GET', 'LUCKY', 'WIN', ',', 'CHINA', 'IN', 'SURPRISE', 'DEFEAT', '.']
[210, 481, 4284, 161, 2, 175, 5, 2047, 946, 3]
   210 | japan
   481 | get
  4284 | lucky
   161 | win
     2 | ,
   175 | china
     5 | in
  2047 | surprise
   946 | defeat
     3 | .


In [7]:
vocabulary_size = len(token_tokenizer.word_counts)
print('Vocabulary dimension:', vocabulary_size)

Vocabulary dimension: 26861


In [8]:
print(raw_train[0])
print(ner_train_sequences[0])
for i in ner_train_sequences[0]:
    print(f'{i} : {idx2tag[i]}')

['German', 'call', 'to', 'boycott', 'British', 'lamb', '.']
[1, 2, 2, 2, 1, 2, 2]
1 : B-MISC
2 : O
2 : O
2 : O
1 : B-MISC
2 : O
2 : O


#### Sequence Padding

The input sequence of an LSTM model must have a fixed length. We choose the most appropriate seqence length given the length of the sentences of the dataset, than we pad shorter sentences and truncate the longer ones. 

In [9]:
sequence_len = np.array([len(s) for s in train_sequences])
longest_sequence = sequence_len.max()
print(f'Longest sequence: {longest_sequence}')

print([(str(p) + '%', np.percentile(sequence_len, p)) for p in range(75,101, 5)])

Longest sequence: 113
[('75%', 22.0), ('80%', 26.0), ('85%', 29.0), ('90%', 32.0), ('95%', 37.69999999999891), ('100%', 113.0)]


In [10]:
max_sequence_len = 50
X_train = pad_sequences(train_sequences, maxlen=max_sequence_len, padding='post', truncating='post')
X_test = pad_sequences(test_sequences, maxlen=max_sequence_len, padding='post', truncating='post')
X_valid = pad_sequences(valid_sequences, maxlen=max_sequence_len, padding='post', truncating='post')

Y_train = pad_sequences(ner_train_sequences, maxlen=max_sequence_len, value=tag2idx['O'], padding='post', truncating='post')
Y_test = pad_sequences(ner_test_sequences, maxlen=max_sequence_len, value=tag2idx['O'], padding='post', truncating='post')
Y_valid = pad_sequences(ner_valid_sequences, maxlen=max_sequence_len, value=tag2idx['O'], padding='post', truncating='post')

# Convert labels from ids to one-hot vectors
Y_train = to_categorical(Y_train, num_classes=len(output_labels), dtype='int32')
Y_test = to_categorical(Y_test, num_classes=len(output_labels), dtype='int32')
Y_valid = to_categorical(Y_valid, num_classes=len(output_labels), dtype='int32')

In [11]:
for i in range(len(X_train[0])):
    print(f'{X_train[0][i]:6} | {Y_train[0][i]}')

   207 | [0 1 0 0 0 0 0 0 0]
   709 | [0 0 1 0 0 0 0 0 0]
     6 | [0 0 1 0 0 0 0 0 0]
  3628 | [0 0 1 0 0 0 0 0 0]
   228 | [0 1 0 0 0 0 0 0 0]
  7656 | [0 0 1 0 0 0 0 0 0]
     3 | [0 0 1 0 0 0 0 0 0]
     0 | [0 0 1 0 0 0 0 0 0]
     0 | [0 0 1 0 0 0 0 0 0]
     0 | [0 0 1 0 0 0 0 0 0]
     0 | [0 0 1 0 0 0 0 0 0]
     0 | [0 0 1 0 0 0 0 0 0]
     0 | [0 0 1 0 0 0 0 0 0]
     0 | [0 0 1 0 0 0 0 0 0]
     0 | [0 0 1 0 0 0 0 0 0]
     0 | [0 0 1 0 0 0 0 0 0]
     0 | [0 0 1 0 0 0 0 0 0]
     0 | [0 0 1 0 0 0 0 0 0]
     0 | [0 0 1 0 0 0 0 0 0]
     0 | [0 0 1 0 0 0 0 0 0]
     0 | [0 0 1 0 0 0 0 0 0]
     0 | [0 0 1 0 0 0 0 0 0]
     0 | [0 0 1 0 0 0 0 0 0]
     0 | [0 0 1 0 0 0 0 0 0]
     0 | [0 0 1 0 0 0 0 0 0]
     0 | [0 0 1 0 0 0 0 0 0]
     0 | [0 0 1 0 0 0 0 0 0]
     0 | [0 0 1 0 0 0 0 0 0]
     0 | [0 0 1 0 0 0 0 0 0]
     0 | [0 0 1 0 0 0 0 0 0]
     0 | [0 0 1 0 0 0 0 0 0]
     0 | [0 0 1 0 0 0 0 0 0]
     0 | [0 0 1 0 0 0 0 0 0]
     0 | [0 0 1 0 0 0 0 0 0]
     0 | [0 0 

In [12]:
token_tokenizer.index_word[0] = '_PAD_'

In [13]:
X_train = np.array(X_train)
Y_train = np.array(Y_train)
X_test = np.array(X_test)
Y_test = np.array(Y_test)
X_valid = np.array(X_valid)
Y_valid = np.array(Y_valid)

In [14]:
# Final training set dimensionalities
print(X_train.shape)
print(Y_train.shape)

(14027, 50)
(14027, 50, 9)


## Build, train and evaluate an LSTM model

The function that loads Glove embedding and the function which creates the LSTM model can be found in the `utils/kerasutils.py` module. Training stopping criterion is Early Stopping with patience on the loss value on validation set.

In [15]:
USE_GLOVE=True

In [16]:
glove_matrix=None
if USE_GLOVE:
    glove_embedding_path = os.path.join('embeddings', 'glove.6B.100d.txt')
    embedding_dim = 100
    glove_matrix = kerasutils.load_glove_embedding_matrix(glove_embedding_path, token_tokenizer.word_index, embedding_dim)

Found 400001 word vectors.


The code that build the model is the following:
```python
def create_paper_BiLSTM(vocabulary_size, seq_len, n_classes, hidden_cells=200, 
                  embed_dim=100, drop=0.4, use_glove=False, glove_matrix=None):
    """Create a BiLSTM model using keras, given its parameters"""
    
    model = Sequential()
    if use_glove:
        model.add(Embedding(vocabulary_size, embed_dim, 
                            weights=[glove_matrix], input_length=seq_len,
                            mask_zero=True, trainable=True))
    else:
        model.add(Embedding(vocabulary_size, embed_dim, input_length=seq_len, 
                            mask_zero=True))
    model.add(Dropout(drop))
    
    model.add(Bidirectional(LSTM(hidden_cells, return_sequences=True, 
                                 dropout=drop)))

    model.add(Dense(n_classes, activation='softmax'))
    
    model.compile(loss='categorical_crossentropy', 
                  optimizer='adam',
                  metrics=['accuracy',
                           Precision(),
                           Recall()])
    model.summary()
    return model
```

In [17]:
model = kerasutils.create_paper_BiLSTM(vocabulary_size+1, max_sequence_len, 
                                       len(output_labels),
                                       use_glove=USE_GLOVE, 
                                       glove_matrix=glove_matrix)

# Early Stopping on validation loss
early_stopping_callback = EarlyStopping(
    monitor="val_loss", 
    min_delta=0.01, 
    patience=3, 
    verbose=1, 
    mode="auto", 
    restore_best_weights=True
)

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 50, 100)           2686200   
_________________________________________________________________
dropout (Dropout)            (None, 50, 100)           0         
_________________________________________________________________
bidirectional (Bidirectional (None, 50, 400)           481600    
_________________________________________________________________
dense (Dense)                (None, 50, 9)             3609      
Total params: 3,171,409
Trainable params: 3,171,409
Non-trainable params: 0
_________________________________________________________________


In [18]:
%%time
batch_size = 10
history = model.fit(X_train, 
          Y_train, 
          batch_size=batch_size, 
          epochs=50,
          verbose=1,
          callbacks=[early_stopping_callback],
          validation_data=(X_valid, Y_valid)
         )

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 00006: early stopping
Wall time: 10min 23s


---

## Evaluation
We evaluate three aspects of the model:
* **Memory consumption** using the `kerasutils.print_model_memory_usage()` function (found [here](https://stackoverflow.com/questions/43137288/how-to-determine-needed-memory-of-keras-model));
* **Latency in prediction** using the function `time.process_time()`;
* **F1-score** _on entities_ on the test set using `seqeval`;

In [19]:
kerasutils.print_model_memory_usage(batch_size, model)

Model size: 13.26 MB


In [20]:
print(f'Model latency in predictions: {modelutils.compute_prediction_latency(X_test, model):.3} s')

Model latency in predictions: 0.00489 s


In [21]:
datasets = [('Training Set', X_train, Y_train), 
            ('Test Set', X_test, Y_test), 
            ('Validation Set', X_valid, Y_valid)]

for title, X, Y in datasets:
    # Get predictions: for each token we have as prediction a vector 
    # of probabilites
    Y_pred = model.predict(X, batch_size=batch_size)
    # We choose as category the one with the highest probability
    Y_pred = np.array(np.argmax(Y_pred, axis=-1))
    # Also flatten true labels
    Y = np.array(np.argmax(Y, axis=-1))
    # Remove padding from predictions and labels
    Y, Y_pred = kerasutils.remove_seq_padding(X, Y, Y_pred)
    # Restore strings instead that entity idss
    let_y_true, let_y_pred = modelutils.from_encode_to_literal_labels(Y, Y_pred, idx2tag)
    
    print(title)
    print(classification_report(let_y_true, let_y_pred, digits=3))
    print('\n')

Training Set
           precision    recall  f1-score   support

      PER      0.971     0.973     0.972      6589
      ORG      0.890     0.834     0.861      6312
     MISC      0.870     0.813     0.841      3435
      LOC      0.935     0.949     0.942      7134

micro avg      0.925     0.905     0.915     23470
macro avg      0.924     0.905     0.914     23470



Test Set
           precision    recall  f1-score   support

      ORG      0.826     0.715     0.766      1657
      PER      0.932     0.904     0.918      1579
      LOC      0.831     0.886     0.858      1655
     MISC      0.734     0.693     0.713       700

micro avg      0.846     0.816     0.831      5591
macro avg      0.846     0.816     0.829      5591



Validation Set
           precision    recall  f1-score   support

      LOC      0.904     0.929     0.916      1834
      ORG      0.842     0.743     0.790      1338
      PER      0.940     0.937     0.938      1796
     MISC      0.827     0.754    

---

### Bonus: visualize results

In [24]:
i = 5
sentence = X_test[i]
y_pred = model.predict(X_test)
y_pred = np.argmax(y_pred, axis=-1)
y_pred = y_pred[i]
y_true = np.argmax(Y_test, axis=-1)[i]

print('      TOKEN      TRUE Y | PRED Y')
print('='*34)
for idx in range(len(sentence)):
    print(f'{token_tokenizer.index_word[sentence[idx]]:15}  {idx2tag[y_true[idx]]:6} | {idx2tag[y_pred[idx]]}')


      TOKEN      TRUE Y | PRED Y
china            B-LOC  | B-LOC
controlled       O      | O
most             O      | O
of               O      | O
the              O      | O
match            O      | O
and              O      | O
saw              O      | O
several          O      | O
chances          O      | O
missed           O      | O
until            O      | O
the              O      | O
78th             O      | O
minute           O      | O
when             O      | O
uzbek            B-MISC | B-MISC
striker          O      | O
igor             B-PER  | B-PER
shkvyrin         I-PER  | I-PER
took             O      | O
advantage        O      | O
of               O      | O
a                O      | O
misdirected      O      | O
defensive        O      | O
header           O      | O
to               O      | O
lob              O      | O
the              O      | O
ball             O      | O
over             O      | O
the              O      | O
advancing        O      | 

---