# CNN + Glove + BiLSTM + CRF model for Entity Extraction on CoNLL 03

In this notebook, we implement the neural network model described in [this paper](https://www.aclweb.org/anthology/P16-1101.pdf). This model is composed of:
* A CNN that extracts morphological character-level features;
* Glove 100-dimensional 6B embedding for word-level information;
* A BiLSTM and a CRF layers for predictions.

Data preprocessing is composed of padding sentences plus token encoding and character-sequences padding to fixed length. Then, we implement this model using `tensorflow.keras` and the `tf2crf` package for a CRF layer compatible with tensorflow. We test it on the CoNLL03 english dataset, using the `seqeval` package for f1-score evaluation.

---

In [34]:
import os
import string
import numpy as np
from pprint import pprint
from utils import dataio, kerasutils, modelutils
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical
from seqeval.metrics import classification_report

## Load Dataset
We load CONLL2003 dataset from [this GitHub repo](https://github.com/davidsbatista/NER-datasets/tree/master/CONLL2003). 
For each token we keep only the string of the word and and the Entity tag (in BIO notation), we discard PoS and Dependency tags. One token per line, features separated with a whitespace, sentences are separated with an empty line.

In [2]:
data_dir = os.path.join('data', 'conll03')
raw_train, ner_train, output_labels = dataio.load_conll_data('train.txt', dir_path=data_dir, only_tokens=True)
raw_valid, ner_valid, _ = dataio.load_conll_data('valid.txt', dir_path=data_dir, only_tokens=True)
raw_test, ner_test, _ = dataio.load_conll_data('test.txt', dir_path=data_dir, only_tokens=True)

Reading file data\conll03\train.txt
Read 14027 sentences
Reading file data\conll03\valid.txt
Read 3249 sentences
Reading file data\conll03\test.txt
Read 3452 sentences


In [3]:
print('Labels:', output_labels)

Labels: {'B-PER', 'I-MISC', 'I-PER', 'B-LOC', 'B-MISC', 'I-LOC', 'I-ORG', 'B-ORG', 'O'}


In [4]:
print("Sentence Example:")
for i in range(len(raw_train[0])):
    print(f'{raw_train[0][i]:7}  |  {ner_train[0][i]}')

Sentence Example:
German   |  B-MISC
call     |  O
to       |  O
boycott  |  O
British  |  B-MISC
lamb     |  O
.        |  O


---

# Data Preparation
Prepare character- and word-level input for the model.

## Sentence encoding and padding
We use a Keras `Tokenizer` to extract the vocabulary and encode words. We pad sentences to a fixed length because it is required from LSTM.

In [5]:
# integer encode sequences of words
token_tokenizer = Tokenizer()    # Automatically lowers tokens
token_tokenizer.fit_on_texts(raw_train + raw_valid + raw_test)
train_sequences = token_tokenizer.texts_to_sequences(raw_train)
test_sequences = token_tokenizer.texts_to_sequences(raw_test)
valid_sequences = token_tokenizer.texts_to_sequences(raw_valid)

# Label encoding
tag2idx = { tag: idx for idx, tag in enumerate(output_labels) }
idx2tag = { idx: tag for tag, idx in tag2idx.items() }
ner_train_sequences = [[tag2idx[tag] for tag in sentence] for sentence in ner_train]
ner_test_sequences  = [[tag2idx[tag] for tag in sentence] for sentence in ner_test ]
ner_valid_sequences = [[tag2idx[tag] for tag in sentence] for sentence in ner_valid]

In [6]:
vocabulary_size = len(token_tokenizer.word_counts)
print(vocabulary_size)

26861


In [7]:
max_sentence_len = 50

# Sentence padding
X_sent_train = pad_sequences(train_sequences, maxlen=max_sentence_len, padding='post', truncating='post')
X_sent_test = pad_sequences(test_sequences, maxlen=max_sentence_len, padding='post', truncating='post')
X_sent_valid = pad_sequences(valid_sequences, maxlen=max_sentence_len, padding='post', truncating='post')

Y_train = pad_sequences(ner_train_sequences, maxlen=max_sentence_len, value=tag2idx['O'], padding='post', truncating='post')
Y_test = pad_sequences(ner_test_sequences, maxlen=max_sentence_len, value=tag2idx['O'], padding='post', truncating='post')
Y_valid = pad_sequences(ner_valid_sequences, maxlen=max_sentence_len, value=tag2idx['O'], padding='post', truncating='post')

X_sent_train = np.array(X_sent_train)
Y_train = np.array(Y_train)
X_sent_test = np.array(X_sent_test)
Y_test = np.array(Y_test)
X_sent_valid = np.array(X_sent_valid)
Y_valid = np.array(Y_valid)

In [8]:
token_tokenizer.index_word[0] = '_PAD_'
token_tokenizer.word_index['_PAD_'] = 0

In [22]:
print('Encoded and padded sentence:')
for i in range(len(X_sent_train[0][:20])):
    print(f'{X_sent_train[0][i]:6} | {token_tokenizer.index_word[X_sent_train[0][i]]}')

Encoded and padded sentence:
   207 | german
   709 | call
     6 | to
  3628 | boycott
   228 | british
  7656 | lamb
     3 | .
     0 | _PAD_
     0 | _PAD_
     0 | _PAD_
     0 | _PAD_
     0 | _PAD_
     0 | _PAD_
     0 | _PAD_
     0 | _PAD_
     0 | _PAD_
     0 | _PAD_
     0 | _PAD_
     0 | _PAD_
     0 | _PAD_


---

## Character encoding and padding
In order to extract character-level informations, we have to:
* Encode characters with integers;
* Pad words to a fixed lengths;
* Use the 0 as padding code both for sentence padding and for word padding.

We don't want to truncate words because prefix and suffix contains precious informations, so we take the longest word and we pad words to its length.

In [10]:
def to_char_list(data):
    '''Transform all the words of a dataset into lists of characters'''
    
    char_data = []
    for sentence in data:
        char_sent = []
        for word in sentence:
            char_sent.append(list(word))
        char_data.append(char_sent)
    return char_data

In [11]:
raw_char_train = to_char_list(raw_train)
raw_char_test = to_char_list(raw_test)
raw_char_valid = to_char_list(raw_valid)

In [12]:
print('Sentence as char lists:')
for token in raw_char_train[0]:
    print(token)
print('='*30)
print(len(raw_char_train))

Sentence as char lists:
['G', 'e', 'r', 'm', 'a', 'n']
['c', 'a', 'l', 'l']
['t', 'o']
['b', 'o', 'y', 'c', 'o', 't', 't']
['B', 'r', 'i', 't', 'i', 's', 'h']
['l', 'a', 'm', 'b']
['.']
14027


In [13]:
# Sanity check of preprocessed data dimensions. If it does not output anything,
# everything is fine.
for sent_idx in range(len(raw_train)):
    if len(raw_char_train[sent_idx]) != len(train_sequences[sent_idx]):
        print('sequence len error')
        print(raw_char_train[sent_idx])
        print(train_sequences[sent_idx])
    for word_idx in range(len(raw_train[sent_idx])):
        if len(raw_char_train[sent_idx][word_idx]) != len(raw_train[sent_idx][word_idx]):
            print('word len error')

Character vocabulary:

In [14]:
# NOTE: Tokenizer may take an argument char_level=True. We should try it in 
# order to get a cleaner code, but in this way we do not have a fixed length
# for words.
char_tokenizer = Tokenizer(lower=False, filters='')
# Build a list with all the characters
charset = string.ascii_letters + string.digits + string.punctuation
print(f'Charset dimension: {len(charset)}')
print(f'Charset: {charset}')
char_tokenizer.fit_on_texts(list(charset))

Charset dimension: 94
Charset: abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [15]:
# Add padding to the tokenizer with the 0 integer encoding
char_tokenizer.index_word[0] = '_PAD_'
char_tokenizer.word_index['_PAD_'] = 0

#### Pad sentences
Set the lengths to `max_sentence_len` (50) with padding and truncate.

In [16]:
for dataset in [raw_char_train, raw_char_test, raw_char_valid]:
    for sent_idx in range(len(dataset)):
        if len(dataset[sent_idx]) > max_sentence_len:
            # Truncate long sentences
            dataset[sent_idx] = dataset[sent_idx][:max_sentence_len]
        while len(dataset[sent_idx]) < max_sentence_len:
            # Pad sentences with '_PAD_' characters
            # ATTENTION: we are not adding the padding tokens of sentence 
            # padding. We add padding tokens made by a single padding character
            pad_word = []
            pad_word.append(char_tokenizer.index_word[0])
            dataset[sent_idx].append(pad_word)

In [21]:
print('Padded sentence:')
for token in raw_char_train[0][:20]:
    print(token)

Padded sentence:
['G', 'e', 'r', 'm', 'a', 'n']
['c', 'a', 'l', 'l']
['t', 'o']
['b', 'o', 'y', 'c', 'o', 't', 't']
['B', 'r', 'i', 't', 'i', 's', 'h']
['l', 'a', 'm', 'b']
['.']
['_PAD_']
['_PAD_']
['_PAD_']
['_PAD_']
['_PAD_']
['_PAD_']
['_PAD_']
['_PAD_']
['_PAD_']
['_PAD_']
['_PAD_']
['_PAD_']
['_PAD_']


In [17]:
len(char_tokenizer.word_index)

95

#### Encode characters with integers

In [18]:
char_seq_train = []
for sentence in raw_char_train:
    char_seq_train.append(char_tokenizer.texts_to_sequences(sentence))

char_seq_test = []
for sentence in raw_char_test:
    char_seq_test.append(char_tokenizer.texts_to_sequences(sentence))

char_seq_valid = []
for sentence in raw_char_valid:
    char_seq_valid.append(char_tokenizer.texts_to_sequences(sentence))


In [52]:
print('A human unfriendly encoded sentence:')
print(char_seq_train[0][:20])

A human unfriendly encoded sentence:
[[33, 5, 18, 13, 1, 14], [3, 1, 12, 12], [20, 15], [2, 15, 25, 3, 15, 20, 20], [28, 18, 9, 20, 9, 19, 8], [12, 1, 13, 2], [76], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0]]


#### Pad words 
Set all the words to max_word_len with padding and (possibly without) truncate:

In [25]:
def pad_words(sentence, maxlen, pad=0):
    padded_sentence = []
    for word in sentence:
        new_word = word.copy()
        if len(word) > maxlen:
            new_word = word[:maxlen]
        else:
            while maxlen - len(new_word) > 1:
                new_word.append(pad)
                new_word.insert(0, pad)
            if maxlen - len(new_word) == 1:
                new_word.insert(0, pad)
        padded_sentence.append(new_word)
    
    return padded_sentence

In [27]:
max_word_len = max([len(word) for word in token_tokenizer.word_index.keys()])
print('Word length:', max_word_len)

Word length: 52


In [28]:
X_char_train = np.array([pad_words(sentence, maxlen=max_word_len) for sentence in char_seq_train])
X_char_test  = np.array([pad_words(sentence, maxlen=max_word_len) for sentence in char_seq_test ])
X_char_valid = np.array([pad_words(sentence, maxlen=max_word_len) for sentence in char_seq_valid])

In [36]:
for word in X_char_train[0][:10]:
    print(word[20:32])
    print([char_tokenizer.index_word[char] for char in word[20:32]])
    print('='*30)

[ 0  0  0 33  5 18 13  1 14  0  0  0]
['_PAD_', '_PAD_', '_PAD_', 'G', 'e', 'r', 'm', 'a', 'n', '_PAD_', '_PAD_', '_PAD_']
[ 0  0  0  0  3  1 12 12  0  0  0  0]
['_PAD_', '_PAD_', '_PAD_', '_PAD_', 'c', 'a', 'l', 'l', '_PAD_', '_PAD_', '_PAD_', '_PAD_']
[ 0  0  0  0  0 20 15  0  0  0  0  0]
['_PAD_', '_PAD_', '_PAD_', '_PAD_', '_PAD_', 't', 'o', '_PAD_', '_PAD_', '_PAD_', '_PAD_', '_PAD_']
[ 0  0  0  2 15 25  3 15 20 20  0  0]
['_PAD_', '_PAD_', '_PAD_', 'b', 'o', 'y', 'c', 'o', 't', 't', '_PAD_', '_PAD_']
[ 0  0  0 28 18  9 20  9 19  8  0  0]
['_PAD_', '_PAD_', '_PAD_', 'B', 'r', 'i', 't', 'i', 's', 'h', '_PAD_', '_PAD_']
[ 0  0  0  0 12  1 13  2  0  0  0  0]
['_PAD_', '_PAD_', '_PAD_', '_PAD_', 'l', 'a', 'm', 'b', '_PAD_', '_PAD_', '_PAD_', '_PAD_']
[ 0  0  0  0  0  0 76  0  0  0  0  0]
['_PAD_', '_PAD_', '_PAD_', '_PAD_', '_PAD_', '_PAD_', '.', '_PAD_', '_PAD_', '_PAD_', '_PAD_', '_PAD_']
[0 0 0 0 0 0 0 0 0 0 0 0]
['_PAD_', '_PAD_', '_PAD_', '_PAD_', '_PAD_', '_PAD_', '_PAD_', '_PAD

In [31]:
# Sanity check of preprocessed data dimensions. If it does not output anything,
# everything is fine.
for sentence in X_char_train:
    if len(sentence) != max_sentence_len:
        print('sentence error')
    for word in sentence:
        if len(word) != max_word_len:
            print(f'word error: {len(word)}')

---

# Model implementation

In [37]:
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Embedding, Conv1D, TimeDistributed, Dropout, Input, \
    MaxPooling1D, Flatten, concatenate, Bidirectional, LSTM, Dense
from tensorflow.keras.utils import plot_model
from tensorflow.keras.metrics import Precision, Recall
from tensorflow.keras.optimizers import SGD
from tensorflow.keras.callbacks import ModelCheckpoint, EarlyStopping
from tf2crf import CRF

#### Hyperparameters of the model
You can choose between the parametrization of two proposed models:
* Ma, Xuezhe, and Eduard Hovy. "End-to-end sequence labeling via bi-directional lstm-cnns-crf." *arXiv preprint arXiv:1603.01354* (2016).
* Chiu, Jason PC, and Eric Nichols. "Named entity recognition with bidirectional LSTM-CNNs." *Transactions of the Association for Computational Linguistics 4* (2016): 357-370.
The first works better, but it may be because second originally included the use of additional word features that we don't consider.

In [38]:
USE_CHIU_CONFIG = False

In [39]:
if USE_CHIU_CONFIG:
    char_embedding_dim = 25
    cnn_window_size = 3
    cnn_filters_number = 53

    word_embedding_dim = 100
    hidden_cells = 275
    drop=0.68

    batch_size = 9
    epochs = 80
else:
    char_embedding_dim = 30
    cnn_window_size = 3
    cnn_filters_number = 30

    word_embedding_dim = 100
    hidden_cells = 200
    drop=0.5

    batch_size = 10
    epochs = 20

In [41]:
print('Sentence token length:', max_sentence_len)
print('Word character length:', max_word_len)

Sentence token length: 50
Word character length: 52


## CNN
We use a Convolutional Neural Network to extract pattern information from the characters of the word. The CNN architecture is composed by:
* A `keras.layers.Embedding` layer, which is a lookup table that associate a vector to each character;
* A 1-dimensional convolution on the embedding vectors in order to capture character-level information;
* A MaxPool1d that transforms a series of vectors in a unique vectors which contains informations from the characters of the word. 

Credits to the author of [this repo](https://github.com/kamalkraj/Named-Entity-Recognition-with-Bidirectional-LSTM-CNNs/blob/master/nn.py).

In [42]:
cnn_input = Input(shape=(max_sentence_len, max_word_len,), name='char_encoding')
# We use TimeDistributed layer because we have two level of sequences:
# * The sentence is a sequence of words;
# * The word is a sequence of characters;
# We want to work on the lowest sequence. the sequence of characters, so the
# TimeDistributed layer allow us to apply this model to each word. We cannot 
# use mask_zero=True in the Embedding because the cnn1d does not support masking.
cnn = TimeDistributed(Embedding(len(char_tokenizer.word_index), char_embedding_dim), name='cnn_Embedding')(cnn_input)
cnn = Dropout(drop)(cnn)
cnn = TimeDistributed(Conv1D(filters=cnn_filters_number, kernel_size=cnn_window_size, padding='same'), name='cnn_Convolution1d')(cnn)
cnn = TimeDistributed(MaxPooling1D(max_word_len), name='cnn_MaxPooling1d')(cnn)
# We finally obtain a 30-dimensional vector for each word which contains 
# char-level informations!
cnn_out = TimeDistributed(Flatten(), name='cnn_Flatten')(cnn)

## Glove
We load Glove embedding in order to embed tokens and capture word-level informations:

In [43]:
glove_embedding_path = os.path.join('embeddings', 'glove.6B.100d.txt')
embedding_matrix = kerasutils.load_glove_embedding_matrix(glove_embedding_path, token_tokenizer.word_index, word_embedding_dim)

Found 400001 word vectors.


In [44]:
word_input = Input(shape=(max_sentence_len,), name='word_encoding')
word_embed = Embedding(len(token_tokenizer.word_index)+1, word_embedding_dim, 
                       weights=[embedding_matrix], input_length=max_sentence_len,
                       trainable=True, mask_zero=True, 
                       name='Glove_Embedding')(word_input)

# BiLSTM + CRF
We concatenate character- and word-level informations and pass it to a bidirectional LSTM:

In [45]:
x = concatenate([word_embed, cnn_out], axis=-1)
x = Dropout(drop)(x)
x = Bidirectional(LSTM(hidden_cells, return_sequences=True, dropout=drop))(x)
x = Dense(len(output_labels), activation='relu', name='Dense_Layer')(x)
crf = CRF(len(output_labels), dtype='float32', name='CRF_Layer')
out = crf(x)

In [46]:
model = Model(
    inputs=[cnn_input, word_input],
    outputs=out
)

In [48]:
model.compile(
    loss=crf.loss, 
    optimizer='adam',
    metrics=[crf.accuracy]
)

model.summary()

Model: "functional_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
char_encoding (InputLayer)      [(None, 50, 52)]     0                                            
__________________________________________________________________________________________________
cnn_Embedding (TimeDistributed) (None, 50, 52, 30)   2850        char_encoding[0][0]              
__________________________________________________________________________________________________
dropout (Dropout)               (None, 50, 52, 30)   0           cnn_Embedding[0][0]              
__________________________________________________________________________________________________
cnn_Convolution1d (TimeDistribu (None, 50, 52, 30)   2730        dropout[0][0]                    
_______________________________________________________________________________________

In [49]:
# Early stopping
early_stopping_callback = EarlyStopping(monitor="val_loss",
                                        patience=3, min_delta=0.001, verbose=1, 
                                        restore_best_weights=True)

## Training

In [51]:
history = model.fit([X_char_train, X_sent_train],
    Y_train, 
    batch_size=batch_size, 
    epochs=epochs,
    verbose=1,
    callbacks=[early_stopping_callback],
    validation_data=({'char_encoding': X_char_valid, 'word_encoding': X_sent_valid}, Y_valid)
)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 00007: early stopping


---

## Evaluation
We evaluate three aspects of the model:
* **Memory consumption** using the `kerasutils.print_model_memory_usage()` function (found [here](https://stackoverflow.com/questions/43137288/how-to-determine-needed-memory-of-keras-model));
* **Latency in prediction** using the function `time.process_time()`;
* **F1-score** _on entities_ on the test set using `seqeval`;

In [53]:
kerasutils.print_model_memory_usage(batch_size, model)

Model size: 22.914 MB


In [54]:
lat = modelutils.compute_prediction_latency(
    {'char_encoding': X_char_test, 'word_encoding': X_sent_test}, 
    model, 
    n_instances=len(X_sent_test)
)
print(f'Model latency in predictions: {lat:.3} s')

Model latency in predictions: 0.00679 s


In [55]:
datasets = [('Training Set', X_char_train, X_sent_train, Y_train), 
            ('Test Set', X_char_test, X_sent_test, Y_test), 
            ('Validation Set', X_char_valid, X_sent_valid, Y_valid)]

for title, X_char, X_sent, Y in datasets:
    Y_pred = model.predict({'char_encoding': X_char, 'word_encoding': X_sent}, batch_size=batch_size)
    # Remove padding
    Y, Y_pred = kerasutils.remove_seq_padding(X_sent, Y, Y_pred)
    # Transform label ids in label strings
    Y, Y_pred = modelutils.from_encode_to_literal_labels(Y, Y_pred, idx2tag)
    print(title)
    print(classification_report(Y, Y_pred, digits=3))
    print('\n')

Training Set
           precision    recall  f1-score   support

      PER      0.960     0.986     0.973      6589
      LOC      0.934     0.965     0.949      7134
     MISC      0.948     0.848     0.896      3435
      ORG      0.916     0.930     0.923      6312

micro avg      0.938     0.944     0.941     23470
macro avg      0.939     0.944     0.941     23470



Test Set
           precision    recall  f1-score   support

      LOC      0.846     0.929     0.885      1655
      PER      0.933     0.949     0.941      1579
     MISC      0.796     0.707     0.749       700
      ORG      0.856     0.847     0.851      1657

micro avg      0.868     0.882     0.875      5591
macro avg      0.867     0.882     0.874      5591



Validation Set
           precision    recall  f1-score   support

      LOC      0.913     0.959     0.936      1834
     MISC      0.920     0.799     0.855       919
      PER      0.931     0.968     0.949      1796
      ORG      0.883     0.886    

---

## Visualize Results

In [58]:
i = 0
sentence = X_sent_test[i]
y_pred = model.predict({'char_encoding': X_char_test, 'word_encoding': X_sent_test})
y_pred = y_pred[i]
y_true = Y_test[i]

print('      TOKEN      TRUE Y | PRED Y')
print('='*34)
for idx in range(len(sentence[:15])):
    print(f'{token_tokenizer.index_word[sentence[idx]]:15}  {idx2tag[y_true[idx]]:6} | {idx2tag[y_pred[idx]]}')


      TOKEN      TRUE Y | PRED Y
japan            B-LOC  | B-LOC
get              O      | O
lucky            O      | O
win              O      | O
,                O      | O
china            B-PER  | B-LOC
in               O      | O
surprise         O      | O
defeat           O      | O
.                O      | O
_PAD_            O      | O
_PAD_            O      | O
_PAD_            O      | O
_PAD_            O      | O
_PAD_            O      | O


---