# LSTM for Entity Extraction on WikiNER (Italian)

WikiNER is a dataset of annotated sentences for Entity Extraction taken from Wikipedia. In this notebook, we train and evaluate a Bidirectional LSTM neural network model on the italian WikiNER dataset to recognize Person, Locations and Organizations.

We use `tf.keras.preprocessing.text.Tokenizer` for text preprocessing, we pad all the santences to the same length and load [this word2vec embedding](http://www.italianlp.it/resources/italian-word-embeddings/) for token encoding, then we use `tensorflow.keras` to build the model. Evaluation is made with the `seqeval` package.

---

In [1]:
import os
import numpy as np
from utils import dataio, kerasutils, modelutils
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical, plot_model
from tensorflow.keras.callbacks import ModelCheckpoint, EarlyStopping
from seqeval.metrics import classification_report

## Load Dataset

Thanks to [this repo](https://github.com/dice-group/FOX/blob/master/input/Wikiner/aij-wikiner-it-wp3.bz2) that makes WikiNER data easily available.

In [2]:
file_path = os.path.join('data', 'wikiner-it-wp3-raw.txt')
sentences, tags, output_labels = dataio.load_wikiner(file_path, token_only=True)

# A specific text preprocessing is required to effectively use itWac italian
# word embedding
sentences = dataio.itwac_preprocess_data(sentences)

Read 127940 sentences.


In [3]:
print(output_labels)
print(sentences[1])
print(tags[1])

{'B-LOC', 'I-MISC', 'I-LOC', 'I-ORG', 'B-MISC', 'B-ORG', 'O', 'I-PER', 'B-PER'}
['Seguirono', 'Lamarck', '(', '1744', '--', '1829', ')', ',', 'Blumenbach', '(', '1752', '--', '1840', ')', ',', 'con', 'le', 'sue', 'norme', 'descrittive', 'del', 'cranio', ',', 'Paul', 'Broca', 'con', 'la', 'focalizzazione', 'dei', 'rapporti', 'tra', 'morfologia', 'e', 'funzionalità', '.']
['O', 'I-PER', 'O', 'O', 'O', 'O', 'O', 'O', 'I-PER', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'I-PER', 'I-PER', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']


---

## Text preprocessing and token encoding

#### Token Ordinal Encoding

In [4]:
tokenizer = Tokenizer(lower=False)
tokenizer.fit_on_texts(sentences)
X = tokenizer.texts_to_sequences(sentences)

tag2idx = { tag: idx for idx, tag in enumerate(output_labels) }
print(tag2idx)

{'B-LOC': 0, 'I-MISC': 1, 'I-LOC': 2, 'I-ORG': 3, 'B-MISC': 4, 'B-ORG': 5, 'O': 6, 'I-PER': 7, 'B-PER': 8}


In [5]:
idx2tag = { idx: tag for tag, idx in tag2idx.items() }
tags = [[tag2idx[tag] for tag in sentence] for sentence in tags]

In [6]:
print(sentences[1])
print(X[1])
for i in X[1]:
    print(f'{i:6} | {tokenizer.index_word[i]}')

['Seguirono', 'Lamarck', '(', '1744', '--', '1829', ')', ',', 'Blumenbach', '(', '1752', '--', '1840', ')', ',', 'con', 'le', 'sue', 'norme', 'descrittive', 'del', 'cranio', ',', 'Paul', 'Broca', 'con', 'la', 'focalizzazione', 'dei', 'rapporti', 'tra', 'morfologia', 'e', 'funzionalità', '.']
[9039, 26311, 21, 22394, 73, 9233, 20, 1, 68504, 21, 16395, 73, 9040, 20, 1, 16, 24, 139, 4538, 42627, 10, 19505, 1, 1837, 42628, 16, 6, 51767, 28, 974, 56, 10484, 4, 4599, 3]
  9039 | Seguirono
 26311 | Lamarck
    21 | (
 22394 | 1744
    73 | --
  9233 | 1829
    20 | )
     1 | ,
 68504 | Blumenbach
    21 | (
 16395 | 1752
    73 | --
  9040 | 1840
    20 | )
     1 | ,
    16 | con
    24 | le
   139 | sue
  4538 | norme
 42627 | descrittive
    10 | del
 19505 | cranio
     1 | ,
  1837 | Paul
 42628 | Broca
    16 | con
     6 | la
 51767 | focalizzazione
    28 | dei
   974 | rapporti
    56 | tra
 10484 | morfologia
     4 | e
  4599 | funzionalità
     3 | .


In [7]:
vocabulary_size = len(tokenizer.word_counts)
print(vocabulary_size)

117893


In [8]:
print(sentences[1])
print(tags[1])
for i in tags[1]:
    print(f'{i} : {idx2tag[i]}')

['Seguirono', 'Lamarck', '(', '1744', '--', '1829', ')', ',', 'Blumenbach', '(', '1752', '--', '1840', ')', ',', 'con', 'le', 'sue', 'norme', 'descrittive', 'del', 'cranio', ',', 'Paul', 'Broca', 'con', 'la', 'focalizzazione', 'dei', 'rapporti', 'tra', 'morfologia', 'e', 'funzionalità', '.']
[6, 7, 6, 6, 6, 6, 6, 6, 7, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6]
6 : O
7 : I-PER
6 : O
6 : O
6 : O
6 : O
6 : O
6 : O
7 : I-PER
6 : O
6 : O
6 : O
6 : O
6 : O
6 : O
6 : O
6 : O
6 : O
6 : O
6 : O
6 : O
6 : O
6 : O
7 : I-PER
7 : I-PER
6 : O
6 : O
6 : O
6 : O
6 : O
6 : O
6 : O
6 : O
6 : O
6 : O


#### Sequence Padding

The input sequence of an LSTM model must have a fixed length. We choose the most appropriate seqence length given the length of the sentences of the dataset, than we pad shorter sentences and truncate the longer ones.

In [9]:
sequence_len = np.array([len(s) for s in sentences])
longest_sequence = sequence_len.max()
print(f'Longest sequence: {longest_sequence}')

print([(str(p) + '%', np.percentile(sequence_len, p)) for p in range(75,101, 5)])

Longest sequence: 206
[('75%', 35.0), ('80%', 38.0), ('85%', 42.0), ('90%', 47.0), ('95%', 56.0), ('100%', 206.0)]


In [10]:
max_sequence_len = 60
X = pad_sequences(X, maxlen=max_sequence_len, padding='post', truncating='post')

y = pad_sequences(tags, maxlen=max_sequence_len, value=tag2idx['O'], padding='post', truncating='post')
y = to_categorical(y, num_classes=len(output_labels), dtype='int32')

In [11]:
tokenizer.index_word[0] = '_PAD_'

In [12]:
X = np.array(X)
y = np.array(y)

In [13]:
print(X.shape)
print(y.shape)

(127940, 60)
(127940, 60, 9)


## Build, train and evaluate an LSTM model

itWac embedding format is equal to the Glove embedding one, so we use the `load_glove_embeddings_matrix()` function. This and the function which creates the LSTM model can be found in the `utils/kerasutils.py` module. Training stopping criterion is Early Stopping with patience on the loss value on validation set.

In [14]:
USE_W2V=True

In [15]:
w2v_matrix=None
if USE_W2V:
    w2v_embedding_path = os.path.join('embeddings', 'w2v.itWac.128d.txt')
    embedding_dim = 128
    w2v_matrix = kerasutils.load_glove_embedding_matrix(w2v_embedding_path, tokenizer.word_index, embedding_dim)

Found 1247492 word vectors.


In [16]:
model = kerasutils.create_paper_BiLSTM(vocabulary_size+1, max_sequence_len, len(output_labels), 
                                 use_glove=USE_W2V, glove_matrix=w2v_matrix, embed_dim = 128)

# Early stopping with patience on validation loss
early_stopping_callback = EarlyStopping(monitor="val_loss", min_delta=0.01, patience=3, verbose=1, mode="auto", restore_best_weights=True)

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 60, 128)           15090432  
_________________________________________________________________
dropout (Dropout)            (None, 60, 128)           0         
_________________________________________________________________
bidirectional (Bidirectional (None, 60, 400)           526400    
_________________________________________________________________
dense (Dense)                (None, 60, 9)             3609      
Total params: 15,620,441
Trainable params: 15,620,441
Non-trainable params: 0
_________________________________________________________________


In [17]:
# Create train-test-dev split. Fix random_state to improve repeatability.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=3791)
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=0.2, random_state=3791)
batch_size = 10

In [18]:
%%time
history = model.fit(X_train, 
          y_train, 
          batch_size=batch_size, 
          epochs=50,
          verbose=1,
          callbacks=[early_stopping_callback],
          validation_data=(X_valid, y_valid)
         )

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 00004: early stopping
Wall time: 1h 50min 55s


---

## Evaluation
We evaluate three aspects of the model:
* **Memory consumption** using the `kerasutils.print_model_memory_usage()` function (found [here](https://stackoverflow.com/questions/43137288/how-to-determine-needed-memory-of-keras-model));
* **Latency in prediction** using the function `time.process_time()`;
* **F1-score** _on entities_ on the test set using `seqeval`;

In [19]:
kerasutils.print_model_memory_usage(batch_size, model)

Model size: 61.109 MB


In [20]:
print(f'Model latency in prediction: {modelutils.compute_prediction_latency(X_test, model):.3} s')

Model latency in prediction: 0.00499 s


In [21]:
datasets = [('Training Set', X_train, y_train), ('Valid Set', X_valid, y_valid), ('Test Set', X_test, y_test)]


In [22]:
for title, X, Y in datasets:
    Y_pred = model.predict(X, batch_size=batch_size)
    Y_pred = np.array(np.argmax(Y_pred, axis=-1))
    Y = np.array(np.argmax(Y, axis=-1))
    Y, Y_pred = kerasutils.remove_seq_padding(X, Y, Y_pred)
    Y, Y_pred = modelutils.from_encode_to_literal_labels(Y, Y_pred, idx2tag)
    print(title)
    print(classification_report(Y, Y_pred, digits=3))
    print('\n')

Training Set
           precision    recall  f1-score   support

      LOC      0.905     0.926     0.916     81845
     MISC      0.781     0.768     0.774     24096
      PER      0.929     0.952     0.940     45352
      ORG      0.868     0.794     0.829     13541

micro avg      0.891     0.899     0.895    164834
macro avg      0.891     0.899     0.895    164834



Valid Set
           precision    recall  f1-score   support

      LOC      0.877     0.897     0.887     20070
      ORG      0.826     0.730     0.775      3412
     MISC      0.719     0.720     0.720      6037
      PER      0.899     0.925     0.911     11251

micro avg      0.856     0.864     0.860     40770
macro avg      0.855     0.864     0.859     40770



Test Set
           precision    recall  f1-score   support

      ORG      0.828     0.746     0.785      4175
     MISC      0.722     0.717     0.720      7317
      LOC      0.883     0.901     0.892     25648
      PER      0.905     0.930     0.91

---