# CNN + Glove + BiLSTM + CRF model for Entity Extraction on ACNER

In this notebook, I implement the neural model described in [this paper](https://www.aclweb.org/anthology/P16-1101.pdf). This model uses:
* Character-level informations extracted with a CNN;
* Word-level informations starting from Glove 100-dimensional 6B embedding;
* A BiLSTM and a CRF layer for making predictions.

Data preprocessing is composed of padding sentences plus token encoding and character-sequences padding to fixed length. Then, we implement this model using `tensorflow.keras` and the `tf2crf` package for a CRF layer compatible with tensorflow. We test it on the Annotated Corpus for NER dataset, using the `seqeval` package for f1-score evaluation.

---

In [1]:
import os
import string
import numpy as np
from utils import dataio, kerasutils, modelutils
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical

## Load Dataset
The dataset can be found [here](https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus). It reports a lot of features for each token, but we only keep the token string and the entity tag.

In [2]:
raw, ner, output_labels = dataio.load_anerd_data(os.path.join('data', 'annotated-ner-dataset', 'ner.csv'),
                                           filter_level='sentence_only')

b'Skipping line 281837: expected 25 fields, saw 34\n'


Filter level: sentence_only
Dataset dimension: 35177 sentences
Data read successfully!


In [3]:
print("Labels:", output_labels)

Labels: {'I-nat', 'B-gpe', 'B-org', 'I-org', 'I-eve', 'I-gpe', 'B-art', 'I-geo', 'B-tim', 'B-per', 'B-geo', 'I-art', 'I-per', 'I-tim', 'O', 'unk', 'B-nat', 'B-eve'}


In [4]:
print("Sentence Example:")
for i in range(len(raw[0])):
    print(f'{raw[0][i]:15} | {ner[0][i]}')

Sentence Example:
Thousands       | O
of              | O
demonstrators   | O
have            | O
marched         | O
through         | O
London          | B-geo
to              | O
protest         | O
the             | O
war             | O
in              | O
Iraq            | B-geo
and             | O
demand          | O
the             | O
withdrawal      | O
of              | O
British         | B-gpe
troops          | O
from            | O
that            | O
country         | O
.               | O
Thousands       | O
of              | O
demonstrators   | O
have            | O
marched         | O
through         | O
London          | B-geo
to              | O
protest         | O
the             | O
war             | O
in              | O
Iraq            | B-geo
and             | O
demand          | O
the             | O
withdrawal      | O
of              | O
British         | B-gpe
troops          | O
from            | O
that            | O
country         | O
.               | 

---

# Data Preparation
Prepare character- and word-level input for the model.

## Sentence encoding and padding
We use a Keras `Tokenizer` to extract the vocabulary and encode words. We pad sentences to a fixed length because it is required from LSTM.

In [5]:
# integer encode sequences of words
token_tokenizer = Tokenizer()    # Automatically lowers tokens
token_tokenizer.fit_on_texts(raw)
sequences = token_tokenizer.texts_to_sequences(raw)

# Label encoding
tag2idx = { tag: idx for idx, tag in enumerate(output_labels) }
idx2tag = { idx: tag for tag, idx in tag2idx.items() }
ner_sequences = [[tag2idx[tag] for tag in sentence] for sentence in ner]

In [6]:
vocabulary_size = len(token_tokenizer.word_counts)
print(vocabulary_size)

27419


In [7]:
max_sentence_len = 60
X_sent = pad_sequences(sequences, maxlen=max_sentence_len, padding='post', truncating='post')
Y = pad_sequences(ner_sequences, maxlen=max_sentence_len, value=tag2idx['O'], padding='post', truncating='post')

X_sent = np.array(X_sent)
Y = np.array(Y)

In [8]:
token_tokenizer.index_word[0] = '_PAD_'
token_tokenizer.word_index['_PAD_'] = 0

In [9]:
print('Encoded and padded sentence:')
for i in range(len(X_sent[0][:20])):
    print(f'{X_sent[0][i]:6} | {token_tokenizer.index_word[X_sent[0][i]]}')

Encoded and padded sentence:
   259 | thousands
     5 | of
   902 | demonstrators
    15 | have
  1950 | marched
   245 | through
   482 | london
     6 | to
   492 | protest
     1 | the
   134 | war
     4 | in
    59 | iraq
     8 | and
   640 | demand
     1 | the
   799 | withdrawal
     5 | of
   182 | british
    91 | troops


---

## Character encoding and padding
In order to extract character-level informations, we have to:
* Encode characters with integers;
* Pad words to a fixed lengths;
* Use the 0 as padding integer both for sentence padding and for word padding.

We don't want to truncate words because prefix and suffix contains precious informations, so we take the longest words and we pad words to its length.

In [10]:
def to_char_list(data):
    '''Transform all the words of a dataset into lists of characters'''
    
    char_data = []
    for sentence in data:
        char_sent = []
        for word in sentence:
            char_sent.append(list(word))
        char_data.append(char_sent)
    return char_data

In [11]:
raw_char = to_char_list(raw)

for token in raw_char[0]:
    print(token)
print('='*30)
print(len(raw_char))

['T', 'h', 'o', 'u', 's', 'a', 'n', 'd', 's']
['o', 'f']
['d', 'e', 'm', 'o', 'n', 's', 't', 'r', 'a', 't', 'o', 'r', 's']
['h', 'a', 'v', 'e']
['m', 'a', 'r', 'c', 'h', 'e', 'd']
['t', 'h', 'r', 'o', 'u', 'g', 'h']
['L', 'o', 'n', 'd', 'o', 'n']
['t', 'o']
['p', 'r', 'o', 't', 'e', 's', 't']
['t', 'h', 'e']
['w', 'a', 'r']
['i', 'n']
['I', 'r', 'a', 'q']
['a', 'n', 'd']
['d', 'e', 'm', 'a', 'n', 'd']
['t', 'h', 'e']
['w', 'i', 't', 'h', 'd', 'r', 'a', 'w', 'a', 'l']
['o', 'f']
['B', 'r', 'i', 't', 'i', 's', 'h']
['t', 'r', 'o', 'o', 'p', 's']
['f', 'r', 'o', 'm']
['t', 'h', 'a', 't']
['c', 'o', 'u', 'n', 't', 'r', 'y']
['.']
['T', 'h', 'o', 'u', 's', 'a', 'n', 'd', 's']
['o', 'f']
['d', 'e', 'm', 'o', 'n', 's', 't', 'r', 'a', 't', 'o', 'r', 's']
['h', 'a', 'v', 'e']
['m', 'a', 'r', 'c', 'h', 'e', 'd']
['t', 'h', 'r', 'o', 'u', 'g', 'h']
['L', 'o', 'n', 'd', 'o', 'n']
['t', 'o']
['p', 'r', 'o', 't', 'e', 's', 't']
['t', 'h', 'e']
['w', 'a', 'r']
['i', 'n']
['I', 'r', 'a', 'q']
['a', 'n

In [12]:
# Sanity check of preprocessed data dimensions. If it does not output anything,
# everything is fine.
for sent_idx in range(len(raw)):
    if len(raw_char[sent_idx]) != len(sequences[sent_idx]):
        print('sequence len error')
        print(raw_char[sent_idx])
        print(sequences[sent_idx])
    for word_idx in range(len(raw[sent_idx])):
        if len(raw_char[sent_idx][word_idx]) != len(raw[sent_idx][word_idx]):
            print('word len error')

Character vocabulary:

In [13]:
# NOTE: Tokenizer may take an argument char_level=True. We should try it in 
# order to get a cleaner code, but in this way we do not have a fixed length
# for words.
char_tokenizer = Tokenizer(lower=False, filters='')
# Build a list with all the characters
charset = string.ascii_letters + string.digits + string.punctuation
print(f'Charset dimension: {len(charset)}')
print(f'Charset: {charset}')
char_tokenizer.fit_on_texts(list(charset))

Charset dimension: 94
Charset: abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [14]:
# Add padding to the tokenizer with the 0 integer encoding
char_tokenizer.index_word[0] = '_PAD_'
char_tokenizer.word_index['_PAD_'] = 0

#### Pad sentences
Set the lengths to `max_sentence_len` (60) with padding and truncate.

In [15]:
for sent_idx in range(len(raw_char)):
    if len(raw_char[sent_idx]) > max_sentence_len:
        # Truncate long sentences
        raw_char[sent_idx] = raw_char[sent_idx][:max_sentence_len]
    while len(raw_char[sent_idx]) < max_sentence_len:
        # Pad sentences with '_PAD_' characters
        pad_word = []
        pad_word.append(char_tokenizer.index_word[0])
        raw_char[sent_idx].append(pad_word)

In [16]:
print('Padded sentence:')
for token in raw_char[0]:
    print(token)

Padded sentence:
['T', 'h', 'o', 'u', 's', 'a', 'n', 'd', 's']
['o', 'f']
['d', 'e', 'm', 'o', 'n', 's', 't', 'r', 'a', 't', 'o', 'r', 's']
['h', 'a', 'v', 'e']
['m', 'a', 'r', 'c', 'h', 'e', 'd']
['t', 'h', 'r', 'o', 'u', 'g', 'h']
['L', 'o', 'n', 'd', 'o', 'n']
['t', 'o']
['p', 'r', 'o', 't', 'e', 's', 't']
['t', 'h', 'e']
['w', 'a', 'r']
['i', 'n']
['I', 'r', 'a', 'q']
['a', 'n', 'd']
['d', 'e', 'm', 'a', 'n', 'd']
['t', 'h', 'e']
['w', 'i', 't', 'h', 'd', 'r', 'a', 'w', 'a', 'l']
['o', 'f']
['B', 'r', 'i', 't', 'i', 's', 'h']
['t', 'r', 'o', 'o', 'p', 's']
['f', 'r', 'o', 'm']
['t', 'h', 'a', 't']
['c', 'o', 'u', 'n', 't', 'r', 'y']
['.']
['T', 'h', 'o', 'u', 's', 'a', 'n', 'd', 's']
['o', 'f']
['d', 'e', 'm', 'o', 'n', 's', 't', 'r', 'a', 't', 'o', 'r', 's']
['h', 'a', 'v', 'e']
['m', 'a', 'r', 'c', 'h', 'e', 'd']
['t', 'h', 'r', 'o', 'u', 'g', 'h']
['L', 'o', 'n', 'd', 'o', 'n']
['t', 'o']
['p', 'r', 'o', 't', 'e', 's', 't']
['t', 'h', 'e']
['w', 'a', 'r']
['i', 'n']
['I', 'r', '

In [17]:
len(char_tokenizer.word_index)

95

#### Encode characters with integers

In [18]:
char_seq = []
for sentence in raw_char:
    char_seq.append(char_tokenizer.texts_to_sequences(sentence))

In [19]:
for i in range(len(char_seq[0])):
    w = [char_tokenizer.index_word[letter] for letter in char_seq[0][i]]
    print(char_seq[0][i], '=>', w)

[46, 8, 15, 21, 19, 1, 14, 4, 19] => ['T', 'h', 'o', 'u', 's', 'a', 'n', 'd', 's']
[15, 6] => ['o', 'f']
[4, 5, 13, 15, 14, 19, 20, 18, 1, 20, 15, 18, 19] => ['d', 'e', 'm', 'o', 'n', 's', 't', 'r', 'a', 't', 'o', 'r', 's']
[8, 1, 22, 5] => ['h', 'a', 'v', 'e']
[13, 1, 18, 3, 8, 5, 4] => ['m', 'a', 'r', 'c', 'h', 'e', 'd']
[20, 8, 18, 15, 21, 7, 8] => ['t', 'h', 'r', 'o', 'u', 'g', 'h']
[38, 15, 14, 4, 15, 14] => ['L', 'o', 'n', 'd', 'o', 'n']
[20, 15] => ['t', 'o']
[16, 18, 15, 20, 5, 19, 20] => ['p', 'r', 'o', 't', 'e', 's', 't']
[20, 8, 5] => ['t', 'h', 'e']
[23, 1, 18] => ['w', 'a', 'r']
[9, 14] => ['i', 'n']
[35, 18, 1, 17] => ['I', 'r', 'a', 'q']
[1, 14, 4] => ['a', 'n', 'd']
[4, 5, 13, 1, 14, 4] => ['d', 'e', 'm', 'a', 'n', 'd']
[20, 8, 5] => ['t', 'h', 'e']
[23, 9, 20, 8, 4, 18, 1, 23, 1, 12] => ['w', 'i', 't', 'h', 'd', 'r', 'a', 'w', 'a', 'l']
[15, 6] => ['o', 'f']
[28, 18, 9, 20, 9, 19, 8] => ['B', 'r', 'i', 't', 'i', 's', 'h']
[20, 18, 15, 15, 16, 19] => ['t', 'r', 'o', 'o'

#### Pad words 
Set all the words to `maxlen` with padding and (possibly without) truncate:

In [20]:
def pad_words(sentence, maxlen, pad=0):
    padded_sentence = []
    for word in sentence:
        new_word = word.copy()
        if len(word) > maxlen:
            new_word = word[:maxlen]
        else:
            while maxlen - len(new_word) > 1:
                new_word.append(pad)
                new_word.insert(0, pad)
            if maxlen - len(new_word) == 1:
                new_word.insert(0, pad)
        padded_sentence.append(new_word)
    
    return padded_sentence

In [21]:
max_word_len = max([len(word) for word in token_tokenizer.word_index.keys()])
print('Word length:', max_word_len)

Word length: 64


In [22]:
X_char = np.array([pad_words(sentence, maxlen=max_word_len) for sentence in char_seq])

In [23]:
for word in X_char[0][:10]:
    print(word[25:35])
    print([char_tokenizer.index_word[char] for char in word[25:35]])
    print('='*30)
print('[...]')

[ 0  0  0  0  0  0  0  0 46  8 15 21]
['_PAD_', '_PAD_', '_PAD_', '_PAD_', '_PAD_', '_PAD_', '_PAD_', '_PAD_', 'T', 'h', 'o', 'u']
[ 0  0  0  0  0  0  0  0  0  0  0 15]
['_PAD_', '_PAD_', '_PAD_', '_PAD_', '_PAD_', '_PAD_', '_PAD_', '_PAD_', '_PAD_', '_PAD_', '_PAD_', 'o']
[ 0  0  0  0  0  0  4  5 13 15 14 19]
['_PAD_', '_PAD_', '_PAD_', '_PAD_', '_PAD_', '_PAD_', 'd', 'e', 'm', 'o', 'n', 's']
[0 0 0 0 0 0 0 0 0 0 8 1]
['_PAD_', '_PAD_', '_PAD_', '_PAD_', '_PAD_', '_PAD_', '_PAD_', '_PAD_', '_PAD_', '_PAD_', 'h', 'a']
[ 0  0  0  0  0  0  0  0  0 13  1 18]
['_PAD_', '_PAD_', '_PAD_', '_PAD_', '_PAD_', '_PAD_', '_PAD_', '_PAD_', '_PAD_', 'm', 'a', 'r']
[ 0  0  0  0  0  0  0  0  0 20  8 18]
['_PAD_', '_PAD_', '_PAD_', '_PAD_', '_PAD_', '_PAD_', '_PAD_', '_PAD_', '_PAD_', 't', 'h', 'r']
[ 0  0  0  0  0  0  0  0  0 38 15 14]
['_PAD_', '_PAD_', '_PAD_', '_PAD_', '_PAD_', '_PAD_', '_PAD_', '_PAD_', '_PAD_', 'L', 'o', 'n']
[ 0  0  0  0  0  0  0  0  0  0  0 20]
['_PAD_', '_PAD_', '_PAD_', '_PAD

In [24]:
# Sanity check of preprocessed data dimensions. If it does not output anything,
# everything is fine.
for sentence in X_char:
    if len(sentence) != max_sentence_len:
        print('sentence error')
    for word in sentence:
        if len(word) != max_word_len:
            print(f'word error: {len(word)}')

---

# Model implementation

In [25]:
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Embedding, Conv1D, TimeDistributed, Dropout, Input, \
    MaxPooling1D, Flatten, concatenate, Bidirectional, LSTM, Dense
from tensorflow.keras.utils import plot_model
from tensorflow.keras.metrics import Precision, Recall
from tensorflow.keras.optimizers import SGD
from tensorflow.keras.callbacks import ModelCheckpoint, EarlyStopping
from tf2crf import CRF

#### Hyperparameters of the model
You can choose between the parametrization of two proposed models:
* Ma, Xuezhe, and Eduard Hovy. "End-to-end sequence labeling via bi-directional lstm-cnns-crf." *arXiv preprint arXiv:1603.01354* (2016).
* Chiu, Jason PC, and Eric Nichols. "Named entity recognition with bidirectional LSTM-CNNs." *Transactions of the Association for Computational Linguistics 4* (2016): 357-370.
The first works better, but it may be because second originally included the use of additional word features that we don't consider.

In [26]:
USE_CHIU_CONFIG = False

In [27]:
if USE_CHIU_CONFIG:
    char_embedding_dim = 25
    cnn_window_size = 3
    cnn_filters_number = 53

    word_embedding_dim = 100
    hidden_cells = 275
    drop=0.68

    batch_size = 9
    epochs = 80
else:
    char_embedding_dim = 30
    cnn_window_size = 3
    cnn_filters_number = 30

    word_embedding_dim = 100
    hidden_cells = 200
    drop=0.5

    batch_size = 10
    epochs = 20

In [28]:
print('Sentence token length:', max_sentence_len)
print('Word character length:', max_word_len)

Sentence token length: 60
Word character length: 64


## CNN
We use a Convolutional Neural Network in order to extract pattern informations from the letters of the word. The CNN embedding is composed of:
* A `keras.layers.Embedding` layer, which is a lookup table that associate a vector to each character;
* A 1-dimensional convolution on the embedding vectors in order to capture patterns in letters;
* A MaxPool1d that transforms a series of vectors in a unique vectors which contains informations from the characters of the word. 

Credits to the author of [this repo](https://github.com/kamalkraj/Named-Entity-Recognition-with-Bidirectional-LSTM-CNNs/blob/master/nn.py).

In [29]:
cnn_input = Input(shape=(max_sentence_len, max_word_len,), name='char_encoding')
# We use TimeDistributed layer because we have two level of sequences:
# * The sentence is a sequence of words;
# * The word is a sequence of characters;
# We want to work on the lowest sequence. the sequence of characters, so the
# TimeDistributed layer allow us to apply this model to each word. 
cnn = TimeDistributed(Embedding(len(char_tokenizer.word_index), char_embedding_dim), name='cnn_Embedding')(cnn_input)
cnn = Dropout(drop)(cnn)
cnn = TimeDistributed(Conv1D(filters=cnn_filters_number, kernel_size=cnn_window_size, padding='same'), name='cnn_Convolution1d')(cnn)
cnn = TimeDistributed(MaxPooling1D(max_word_len), name='cnn_MaxPooling1d')(cnn)
# We finally obtain a 30-dimensional vector for each word which contains 
# char-level informations!
cnn_out = TimeDistributed(Flatten(), name='cnn_Flatten')(cnn)

## Glove
We load Glove embedding in order to embed tokens and capture word-level informations:

In [30]:
glove_embedding_path = os.path.join('embeddings', 'glove.6B.100d.txt')
embedding_dim = 100
embedding_matrix = kerasutils.load_glove_embedding_matrix(glove_embedding_path, token_tokenizer.word_index, embedding_dim)

Found 400001 word vectors.


In [31]:
word_input = Input(shape=(max_sentence_len,), name='word_encoding')
word_embed = Embedding(len(token_tokenizer.word_index)+1, word_embedding_dim, 
                       weights=[embedding_matrix], input_length=max_sentence_len,
                       trainable=True, mask_zero=True, 
                       name='Glove_Embedding')(word_input)

# BiLSTM + CRF
We concatenate character- and word-level informations and pass it to a bidirectional LSTM:

In [32]:
x = concatenate([word_embed, cnn_out], axis=-1)
x = Dropout(drop)(x)
x = Bidirectional(LSTM(hidden_cells, return_sequences=True, dropout=drop))(x)
x = Dense(len(output_labels), activation='relu', name='Dense_Layer')(x)
crf = CRF(len(output_labels), dtype='float32', name='CRF_Layer')
out = crf(x)

In [33]:
model = Model(
    inputs=[cnn_input, word_input],
    outputs=out
)

In [34]:
model.compile(
    loss=crf.loss, 
    optimizer='adam',
    metrics=[crf.accuracy]
)

model.summary()

Model: "functional_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
char_encoding (InputLayer)      [(None, 60, 64)]     0                                            
__________________________________________________________________________________________________
cnn_Embedding (TimeDistributed) (None, 60, 64, 30)   2850        char_encoding[0][0]              
__________________________________________________________________________________________________
dropout (Dropout)               (None, 60, 64, 30)   0           cnn_Embedding[0][0]              
__________________________________________________________________________________________________
cnn_Convolution1d (TimeDistribu (None, 60, 64, 30)   2730        dropout[0][0]                    
_______________________________________________________________________________________

In [35]:
# Early stopping
early_stopping_callback = EarlyStopping(monitor="val_loss",
                                        patience=3, min_delta=0.001, verbose=1, 
                                        restore_best_weights=True)

# Training

In [36]:
from sklearn.model_selection import train_test_split


# Train-test split
X_sent_train, X_sent_test, Y_train, Y_test = train_test_split(X_sent, Y, test_size=0.2, random_state=42)
X_char_train, X_char_test, _, _ = train_test_split(X_char, Y, test_size=0.2, random_state=42)

In [37]:
history = model.fit([X_char_train, X_sent_train],
    Y_train, 
    batch_size=batch_size, 
    epochs=epochs,
    verbose=1,
    callbacks=[early_stopping_callback],
    validation_split=0.2
)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 00009: early stopping


---

## Evaluation
We evaluate three aspects of the model:
* **Memory consumption** using the `kerasutils.print_model_memory_usage()` function (found [here](https://stackoverflow.com/questions/43137288/how-to-determine-needed-memory-of-keras-model));
* **Latency in prediction** using the function `time.process_time()`;
* **F1-score** _on entities_ on the test set using `seqeval`;

In [38]:
kerasutils.print_model_memory_usage(batch_size, model)

Model size: 27.783 MB


In [39]:
print(f'Model latency in predictions: {modelutils.compute_prediction_latency([X_char_test, X_sent_test], model, n_instances=len(X_sent_test)):.3} s')

Model latency in predictions: 0.00779 s


In [40]:
from seqeval.metrics import classification_report


datasets = [('Training Set', X_char_train, X_sent_train, Y_train), 
            ('Test Set', X_char_test, X_sent_test, Y_test)]

for title, X_char, X_sent, Y in datasets:
    Y_pred = model.predict({'char_encoding': X_char, 'word_encoding': X_sent}, batch_size=batch_size)
    Y, Y_pred = kerasutils.remove_seq_padding(X_sent, Y, Y_pred)
    Y, Y_pred = modelutils.from_encode_to_literal_labels(Y, Y_pred, idx2tag)
    print(title)
    print(classification_report(Y, Y_pred, digits=3))
    print('\n')

Training Set
           precision    recall  f1-score   support

      org      0.793     0.723     0.757     15970
      gpe      0.968     0.928     0.948     12914
      tim      0.832     0.888     0.859     15898
      per      0.807     0.851     0.828     13596
      geo      0.871     0.912     0.891     29297
      nat      0.670     0.335     0.447       176
      art      0.000     0.000     0.000       355
      eve      0.482     0.303     0.372       267

micro avg      0.853     0.860     0.856     88473
macro avg      0.849     0.860     0.854     88473



Test Set
           precision    recall  f1-score   support

      gpe      0.960     0.930     0.945      3260
      tim      0.812     0.860     0.835      3987
      org      0.752     0.682     0.715      3950
      geo      0.857     0.896     0.876      7580
      nat      0.500     0.224     0.310        49
      per      0.771     0.817     0.794      3265
      eve      0.452     0.206     0.283        68
   

---