We will use a residual LSTM network together with ELMo embeddings [1], developed at Allen NLP. You will learn how to wrap a tensorflow hub pre-trained model to work with keras. The resulting model with give you state-of-the-art performance on the named entity recognition task.

### What are ELMo embeddings?
ELMo embeddings are embeddings from a language model trained on the 1 Billion Word Benchmark and the pretrained version is availiable on tensorflow hub. Unlike most widely used word embeddings, ELMo word representations are functions of the entire input sentence. They are computed on top of two-layer bidirectional language model with character convolutions, as a linear function of the internal network states. Concretely, ELMos use a pre-trained, multi-layer, bi-directional, LSTM-based language model and extract the hidden state of each layer for the input sequence of words. Then, they compute a weighted sum of those hidden states to obtain an embedding for each word. The weight of each hidden state is task-dependent and is learned. ELMo improves the performance of models across a wide range of tasks, spanning from question answering and sentiment analysis to named entity recognition. This setup allows us to do semi-supervised learning, where the biLM is pre-trained at a large scale and easily incorporated into a wide range of existing neural NLP architectures.

I suggest having a look at the great paper “Deep contextualized word representations” https://arxiv.org/pdf/1802.05365.pdf.

### Data preperation

In [1]:
import pandas as pd    
import numpy as np

data = pd.read_csv("data/ner_dataset.csv", encoding="latin1")
data = data.fillna(method="ffill")

class SentenceGetter(object):
    def __init__(self, data):
        self.n_sent = 1
        self.data = data
        self.empty = False
        agg_func = lambda s: [(w, p, t) for w, p, t in zip(s["Word"].values.tolist(),
                                                           s["POS"].values.tolist(),
                                                           s["Tag"].values.tolist())]
        self.grouped = self.data.groupby("Sentence #").apply(agg_func)
        self.sentences = [s for s in self.grouped]
    
    def get_next(self):
        try:
            s = self.grouped["Sentence: {}".format(self.n_sent)]
            self.n_sent += 1
            return s
        except:
            return None
        
getter = SentenceGetter(data)
sentences = getter.sentences

max_len = 50
max_len_char = 10

words = list(set(data["Word"].values))
words.append("ENDPAD")
n_words = len(words)

tags = list(set(data["Tag"].values))
n_tags = len(tags); n_tags
tag2idx = {t: i for i, t in enumerate(tags)}

To apply the EMLo embedding from tensorflow hub, we have to use strings as input. So we take the tokenized sentences and pad them to the desired length.

In [2]:
X = [[w[0] for w in s] for s in sentences]

new_X = []
for seq in X:
    new_seq = []
    for i in range(max_len):
        try:
            new_seq.append(seq[i])
        except:
            new_seq.append("__PAD__")
    new_X.append(new_seq)
X = new_X
X[1]

['Iranian',
 'officials',
 'say',
 'they',
 'expect',
 'to',
 'get',
 'access',
 'to',
 'sealed',
 'sensitive',
 'parts',
 'of',
 'the',
 'plant',
 'Wednesday',
 ',',
 'after',
 'an',
 'IAEA',
 'surveillance',
 'system',
 'begins',
 'functioning',
 '.',
 '__PAD__',
 '__PAD__',
 '__PAD__',
 '__PAD__',
 '__PAD__',
 '__PAD__',
 '__PAD__',
 '__PAD__',
 '__PAD__',
 '__PAD__',
 '__PAD__',
 '__PAD__',
 '__PAD__',
 '__PAD__',
 '__PAD__',
 '__PAD__',
 '__PAD__',
 '__PAD__',
 '__PAD__',
 '__PAD__',
 '__PAD__',
 '__PAD__',
 '__PAD__',
 '__PAD__',
 '__PAD__']

In [3]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

y = [[tag2idx[w[2]] for w in s] for s in sentences]
y = pad_sequences(maxlen=max_len, sequences=y, padding="post", value=tag2idx["O"])
y[1]

array([ 5,  4,  4,  4,  4,  4,  4,  4,  4,  4,  4,  4,  4,  4,  4,  3,  4,
        4,  4, 14,  4,  4,  4,  4,  4,  4,  4,  4,  4,  4,  4,  4,  4,  4,
        4,  4,  4,  4,  4,  4,  4,  4,  4,  4,  4,  4,  4,  4,  4,  4],
      dtype=int32)

In [4]:
from sklearn.model_selection import train_test_split

X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.1, random_state=2018)

### The ELMo residual LSTM model

In [5]:
# !pip install 'tensorflow_hub==0.4.0'

import tensorflow as tf
from tensorflow.python.framework.ops import disable_eager_execution
import tensorflow_hub as hub

disable_eager_execution()

batch_size = 32

sess = tf.compat.v1.Session()

In [6]:
from tensorflow.keras import backend as K

class ElmoEmbeddingLayer(tf.keras.layers.Layer):
    def __init__(self, **kwargs):
        self.dimensions = 1024
        self.trainable = False
        super(ElmoEmbeddingLayer, self).__init__(**kwargs)
        
    def build(self, input_shape):
        self.elmo = hub.Module('https://tfhub.dev/google/elmo/2', trainable=self.trainable, name="{}_module".format(self.name))
        self._trainable_weights += tf.compat.v1.trainable_variables(scope="^{}_module/.*".format(self.name))
        super(ElmoEmbeddingLayer, self).build(input_shape)
            
    def call(self, x, mask=None):
        result = self.elmo(inputs={
                            "tokens": tf.squeeze(tf.cast(x, tf.string)),
                            "sequence_len": tf.constant(batch_size*[max_len])
                      },
                      signature="tokens",
                      as_dict=True)["elmo"]
        return result
    
    def compute_mask(self, inputs, mask=None):
        return K.not_equal(inputs, '__PAD__')
    
    def compute_output_shape(self, input_shape):
        return (input_shape[0], self.dimensions)

In [7]:
from tensorflow.keras import Input
from tensorflow.keras.models import Model
from tensorflow.keras.layers import LSTM, Embedding, Dense, TimeDistributed, Dropout, Conv1D, Lambda
from tensorflow.keras.layers import Bidirectional, concatenate, SpatialDropout1D, GlobalMaxPooling1D, add

input_text = Input(shape=(max_len,), dtype=tf.string)
#embedding = Lambda(ElmoEmbedding, output_shape=(None, max_len, 1024))(input_text)
embedding = ElmoEmbeddingLayer()(input_text)
x = Bidirectional(LSTM(units=128, return_sequences=True,
                       recurrent_dropout=0.2, dropout=0.2))(embedding)
x_rnn = Bidirectional(LSTM(units=128, return_sequences=True,
                           recurrent_dropout=0.2, dropout=0.2))(x)
x = add([x, x_rnn])  # residual connection to the first biLSTM
out = TimeDistributed(Dense(n_tags, activation="softmax"))(x)

model = Model(input_text, out)
model.compile(optimizer="adam", loss="sparse_categorical_crossentropy", metrics=["accuracy"])

model.summary()

INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


Instructions for updating:
If using Keras pass *_constraint arguments to layers.


Instructions for updating:
If using Keras pass *_constraint arguments to layers.


Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            [(None, 50)]         0                                            
__________________________________________________________________________________________________
elmo_embedding_layer (ElmoEmbed (32, None, 1024)     4           input_1[0][0]                    
__________________________________________________________________________________________________
bidirectional (Bidirectional)   (32, None, 256)      1180672     elmo_embedding_layer[0][0]       
__________________________________________________________________________________________________
bidirectional_1 (Bidirectional) (32, None, 256)      394240      bidirectional[0][0]              
______________________________________________________________________________________________

In [8]:
X_tr, X_val = X_tr[:1213*batch_size], X_tr[-135*batch_size:]
y_tr, y_val = y_tr[:1213*batch_size], y_tr[-135*batch_size:]
y_tr = y_tr.reshape(y_tr.shape[0], y_tr.shape[1], 1)
y_val = y_val.reshape(y_val.shape[0], y_val.shape[1], 1)

In [9]:
with sess:
    sess.run(tf.compat.v1.global_variables_initializer())
    sess.run(tf.compat.v1.tables_initializer())
    history = model.fit(np.array(X_tr), y_tr, validation_data=(np.array(X_val), y_val),
                        batch_size=batch_size, epochs=5, verbose=1)

Train on 38816 samples, validate on 4320 samples
Epoch 1/5

KeyboardInterrupt: 