## Neural Part Of Speech Tagging

We're now going to solve the same problem of POS tagging with neural networks.
<img src=https://i.stack.imgur.com/6pdIT.png width=320>

From deep learning perspective, this is a task of predicting a sequence of outputs aligned to a sequence of inputs. There are several problems that match this formulation:
* Part Of Speech Tagging -  an auxuliary task for many NLP problems
* Named Entity Recognition - for chat bots and web crawlers
* Protein structure prediction - for bioinformatics

In [1]:
import tensorflow as tf
#import sklearn
#from tqdm import tqdm_notebook
import nltk
import sys
import numpy as np

In [None]:
#%tensorflow_version 2.x

In [2]:
nltk.download('brown')
nltk.download('universal_tagset') #загрузка тегов по каждой части речи
data = nltk.corpus.brown.tagged_sents(tagset='universal') #присвоение словам корпуса brown тегов частей речи
all_tags = ['#EOS#','#UNK#','ADV', 'NOUN', 'ADP', 'PRON', 'DET', '.', 'PRT', 'VERB', 'X', 'NUM', 'CONJ', 'ADJ']

[nltk_data] Downloading package brown to
[nltk_data]     C:\Users\grebe\AppData\Roaming\nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package universal_tagset to
[nltk_data]     C:\Users\grebe\AppData\Roaming\nltk_data...
[nltk_data]   Package universal_tagset is already up-to-date!


In [3]:
data = np.array([[(word.lower(),tag) for word,tag in sentence] for sentence in data ], dtype = object)

In [4]:
from sklearn.model_selection import train_test_split
train_data, test_data = train_test_split(data,test_size=0.25,random_state=42)

In [5]:
from IPython.display import HTML, display
def draw(sentence):
    words,tags = zip(*sentence)
    #display(HTML('<table><tr>{tags}</tr>{words}<tr></table>'.format(
    display(HTML('<table><tr>{tags}</tr><tr>{words}</tr></table>'.format(
                words = '<td>{}</td>'.format('</td><td>'.join(words)),
                tags = '<td>{}</td>'.format('</td><td>'.join(tags)))))


draw(data[0])
#draw(data[10])
#draw(data[7])

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24
DET,NOUN,NOUN,ADJ,NOUN,VERB,NOUN,DET,NOUN,ADP,NOUN,ADJ,NOUN,NOUN,VERB,.,DET,NOUN,.,ADP,DET,NOUN,VERB,NOUN,.
the,fulton,county,grand,jury,said,friday,an,investigation,of,atlanta's,recent,primary,election,produced,``,no,evidence,'',that,any,irregularities,took,place,.


### Building vocabularies

Just like before, we have to build a mapping from tokens to integer ids. This time around, our model operates on a word level, processing one word per RNN step. This means we'll have to deal with far larger vocabulary.

Luckily for us, we only receive those words as input i.e. we don't have to predict them. This means we can have a large vocabulary for free by using word embeddings.

In [6]:
from collections import Counter # helps to count objects and create dictionaries {'object':'quantity'}
word_counts = Counter()
for sentence in data:
    words,tags = zip(*sentence)
    word_counts.update(words)

all_words = ['#EOS#','#UNK#'] + list(list(zip(*word_counts.most_common(10000)))[0]) #List the n most common elements and their counts

#let's measure what fraction of data words are in the dictionary
print("Coverage = %.5f" % (float(sum(word_counts[w] for w in all_words)) / sum(word_counts.values())))

Coverage = 0.92876


In [7]:
from collections import defaultdict
word_to_id = defaultdict(lambda:1, { word: i for i, word in enumerate(all_words) })
tag_to_id = { tag: i for i, tag in enumerate(all_tags)}

convert words and tags into fixed-size matrix

In [8]:
def to_matrix(lines, token_to_id, max_len=None, pad=0, dtype='int32', time_major=False):
    """Converts a list of names into rnn-digestable matrix with paddings added after the end"""

    max_len = max_len or max(map(len,lines))
    matrix = np.empty([len(lines), max_len],dtype)
    matrix.fill(pad)

    for i in range(len(lines)):
        line_ix = list(map(token_to_id.__getitem__,lines[i]))[:max_len]
        matrix[i,:len(line_ix)] = line_ix

    return matrix.T if time_major else matrix

In [9]:
batch_words, batch_tags = zip(*[zip(*sentence) for sentence in data[-3:]])

print("Word ids:")
print(to_matrix(batch_words, word_to_id))
print("Tag ids:")
print(to_matrix(batch_tags, tag_to_id))

Word ids:
[[   2 3057    5    2 2238 1334 4238 2454    3    6   19   26 1070   69
     8 2088    6    3    1    3  266   65  342    2    1    3    2  315
     1    9   87  216 3322   69 1558    4    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0]
 [  45   12    8  511 8419    6   60 3246   39    2    1    1    3    2
   845    1    3    1    3   10 9910    2    1 3470    9   43    1    1
     3    6    2 1046  385   73 4562    3    9    2    1    1 3250    3
    12   10    2  861 5240   12    8 8936  121    1    4]
 [  33   64   26   12  445    7 7346    9    8 3337    3    1 2811    3
     2  463  572    2    1    1 1649   12    1    4    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0]]
Tag ids:
[[ 6  3  4  6  3  3  9  9  7 12  4  5  9  4  6  3 12  7  9  7  9  8  4  6
   3  7  6 13  3  4  6  3  9  4  3  7  0  0  0  0  0  0  0  0  0  0  0  0
   0  0  0

### Build model

Unlike our previous lab, this time we'll focus on a high-level keras interface to recurrent neural networks. It is as simple as you can get with RNN, allbeit somewhat constraining for complex tasks like seq2seq.

By default, all keras RNNs apply to a whole sequence of inputs and produce a sequence of hidden states `(return_sequences=True` or just the last hidden state `(return_sequences=False)`. All the recurrence is happening under the hood.

At the top of our model we need to apply a Dense layer to each time-step independently. As of now, by default keras.layers.Dense would apply once to all time-steps concatenated. We use __keras.layers.TimeDistributed__ to modify Dense layer so that it would apply across both batch and time axes.

In [10]:
import keras
import keras.layers as L

model = keras.models.Sequential()
model.add(L.InputLayer([None],dtype='int32'))
model.add(L.Embedding(len(all_words),50))
model.add(L.SimpleRNN(64,return_sequences=True))

#add top layer that predicts tag probabilities
stepwise_dense = L.Dense(len(all_tags),activation='softmax')
stepwise_dense = L.TimeDistributed(stepwise_dense)
model.add(stepwise_dense)

__Training:__ in this case we don't want to prepare the whole training dataset in advance. The main cause is that the length of every batch depends on the maximum sentence length within the batch. This leaves us two options: use custom training code as in previous seminar or use generators.

Keras models have a __`model.fit_generator`__ method that accepts a python generator yielding one batch at a time. But first we need to implement such generator:

In [11]:
import tensorflow
from tensorflow.keras.utils import to_categorical

BATCH_SIZE=32
def generate_batches(sentences,batch_size=BATCH_SIZE,max_len=None,pad=0):
    assert isinstance(sentences,np.ndarray),"Make sure sentences is q numpy array"

    while True:
        indices = np.random.permutation(np.arange(len(sentences)))
        for start in range(0,len(indices)-1,batch_size):
            batch_indices = indices[start:start+batch_size]
            batch_words,batch_tags = [],[]
            for sent in sentences[batch_indices]:
                words,tags = zip(*sent)
                batch_words.append(words)
                batch_tags.append(tags)

            batch_words = to_matrix(batch_words,word_to_id,max_len,pad)
            batch_tags = to_matrix(batch_tags,tag_to_id,max_len,pad)

            batch_tags_1hot = to_categorical(batch_tags,len(all_tags)).reshape(batch_tags.shape+(-1,))
            yield batch_words,batch_tags_1hot


__Callbacks:__ Another thing we need is to measure model performance. The tricky part is not to count accuracy after sentence ends (on padding) and making sure we count all the validation data exactly once.

While it isn't impossible to persuade Keras to do all of that, we may as well write our own callback that does that.
Keras callbacks allow you to write a custom code to be ran once every epoch or every minibatch. We'll define one via LambdaCallback

In [12]:
def compute_test_accuracy(model):
    test_words,test_tags = zip(*[zip(*sentence) for sentence in test_data])
    test_words,test_tags = to_matrix(test_words,word_to_id),to_matrix(test_tags,tag_to_id)

    #predict tag probabilities of shape [batch,time,n_tags]
    predicted_tag_probabilities = model.predict(test_words,verbose=1)
    predicted_tags = predicted_tag_probabilities.argmax(axis=-1)

    #compute accurary excluding padding
    numerator = np.sum(np.logical_and((predicted_tags == test_tags),(test_words != 0)))
    denominator = np.sum(test_words != 0)
    return float(numerator)/denominator


class EvaluateAccuracy(keras.callbacks.Callback):
    def on_epoch_end(self,epoch,logs=None):
        sys.stdout.flush()
        print("\nMeasuring validation accuracy...")
        acc = compute_test_accuracy(self.model)
        print("\nValidation accuracy: %.5f\n"%acc)
        sys.stdout.flush()

In [13]:
model.compile('adam','categorical_crossentropy')

model.fit_generator(generate_batches(train_data),len(train_data)/BATCH_SIZE,
                    callbacks=[EvaluateAccuracy()], epochs=5,)

Epoch 1/5


  model.fit_generator(generate_batches(train_data),len(train_data)/BATCH_SIZE,


Measuring validation accuracy...

Validation accuracy: 0.93995

Epoch 2/5
Measuring validation accuracy...

Validation accuracy: 0.94483

Epoch 3/5
Measuring validation accuracy...

Validation accuracy: 0.94617

Epoch 4/5
Measuring validation accuracy...

Validation accuracy: 0.94630

Epoch 5/5
Measuring validation accuracy...

Validation accuracy: 0.94550



<keras.src.callbacks.History at 0x1b9c6d820d0>

Measure final accuracy on the whole test set.

In [14]:
acc = compute_test_accuracy(model)
print("Final accuracy: %.5f"%acc)

assert acc>0.94, "Keras has gone on a rampage again, please contact course staff."

Final accuracy: 0.94550


### Going bidirectional

Since we're analyzing a full sequence, it's legal for us to look into future data.

A simple way to achieve that is to go both directions at once, making a __bidirectional RNN__.

In Keras you can achieve that both manually (using two LSTMs and Concatenate) and by using __`keras.layers.Bidirectional`__.

This one works just as `TimeDistributed` we saw before: you wrap it around a recurrent layer (SimpleRNN now and LSTM/GRU later) and it actually creates two layers under the hood.

Your first task is to use such a layer our POS-tagger.

In [15]:
#Define a model that utilizes bidirectional SimpleRNN

model = keras.models.Sequential()
model.add(L.InputLayer([None],dtype='int32'))
model.add(L.Embedding(len(all_words),50))
model.add(L.Bidirectional(L.SimpleRNN(64,return_sequences=True)))
model.add(L.TimeDistributed(L.Dense(len(all_tags),activation='softmax')))


#<Your code here!>


In [16]:
model.compile('adam','categorical_crossentropy')

model.fit_generator(generate_batches(train_data),len(train_data)/BATCH_SIZE,
                    callbacks=[EvaluateAccuracy()], epochs=5,)

Epoch 1/5


  model.fit_generator(generate_batches(train_data),len(train_data)/BATCH_SIZE,


Measuring validation accuracy...

Validation accuracy: 0.95513

Epoch 2/5
Measuring validation accuracy...

Validation accuracy: 0.96061

Epoch 3/5
Measuring validation accuracy...

Validation accuracy: 0.96131

Epoch 4/5
Measuring validation accuracy...

Validation accuracy: 0.96224

Epoch 5/5
Measuring validation accuracy...

Validation accuracy: 0.96157



<keras.src.callbacks.History at 0x1b9cc1aa110>

In [17]:
acc = compute_test_accuracy(model)
print("\nFinal accuracy: %.5f"%acc)

assert acc>0.96, "Bidirectional RNNs are better than this!"
print("Well done!")


Final accuracy: 0.96157
Well done!


Create **at least one experiment** from the list bellow, you can choose the most interesting and promising options to improve the performance of Bidirectional LSTM:

* __Go beyond SimpleRNN__: there's `keras.layers.LSTM` and `keras.layers.GRU`
  * If you want to use a custom recurrent Cell, read [this](https://keras.io/layers/recurrent/#rnn)
  * You can also use 1D Convolutions (`keras.layers.Conv1D`). They are often as good as recurrent layers but with less overfitting.
* __Stack more layers__: if there is a common motif to this course it's about stacking layers
  * You can just add recurrent and 1dconv layers on top of one another and keras will understand it
  * Just remember that bigger networks may need more epochs to train
* __Regularization__: you can apply dropouts as usual but also in an RNN-specific way
  * `keras.layers.Dropout` works inbetween RNN layers
  * Recurrent layers also have `recurrent_dropout` parameter
* __Gradient clipping__: If your training isn't as stable as you'd like, set `clipnorm` in your optimizer.
  * Which is to say, it's a good idea to watch over your loss curve at each minibatch. Try tensorboard callback or something similar.

### **Stack more Layers**

In [20]:
#<Your code here!>

model = keras.models.Sequential()
model.add(L.InputLayer([None],dtype='int32'))
model.add(L.Embedding(len(all_words),50))

model.add(L.Bidirectional(L.SimpleRNN(64,return_sequences=True), merge_mode='concat'))
model.add(L.Bidirectional(L.SimpleRNN(64,return_sequences=True), merge_mode='concat'))

model.add(L.Conv1D(64, kernel_size=3, padding='same', activation='relu'))

model.add(L.TimeDistributed(L.Dense(len(all_tags),activation='softmax')))

In [21]:
model.compile('adam','categorical_crossentropy')

model.fit_generator(generate_batches(train_data),len(train_data)/BATCH_SIZE,
                    callbacks=[EvaluateAccuracy()], epochs=4,)

Epoch 1/4


  model.fit_generator(generate_batches(train_data),len(train_data)/BATCH_SIZE,


Measuring validation accuracy...

Validation accuracy: 0.95858

Epoch 2/4
Measuring validation accuracy...

Validation accuracy: 0.96043

Epoch 3/4
Measuring validation accuracy...

Validation accuracy: 0.96213

Epoch 4/4
Measuring validation accuracy...

Validation accuracy: 0.96152



<keras.src.callbacks.History at 0x1b9d05efb90>

In [22]:
acc = compute_test_accuracy(model)
print("\nFinal accuracy: %.5f"%acc)

assert acc>0.96, "Bidirectional RNNs are better than this!"
print("Well done!")


Final accuracy: 0.96152
Well done!


### **Stack more Layers + Regularization**

In [23]:
#<Your code here!>

model = keras.models.Sequential()
model.add(L.InputLayer([None],dtype='int32'))
model.add(L.Embedding(len(all_words),50))

model.add(L.Bidirectional(L.SimpleRNN(64,return_sequences=True, dropout=0.2, recurrent_dropout=0.2), merge_mode='concat'))
model.add(L.Bidirectional(L.SimpleRNN(64,return_sequences=True,dropout=0.2, recurrent_dropout=0.2), merge_mode='concat'))

model.add(L.Conv1D(64, kernel_size=3, padding='same', activation='relu'))

model.add(L.TimeDistributed(L.Dense(len(all_tags),activation='softmax')))

In [24]:
model.compile('adam','categorical_crossentropy')

model.fit_generator(generate_batches(train_data),len(train_data)/BATCH_SIZE,
                    callbacks=[EvaluateAccuracy()], epochs=7,)

Epoch 1/7


  model.fit_generator(generate_batches(train_data),len(train_data)/BATCH_SIZE,


Measuring validation accuracy...

Validation accuracy: 0.95424

Epoch 2/7
Measuring validation accuracy...

Validation accuracy: 0.96152

Epoch 3/7
Measuring validation accuracy...

Validation accuracy: 0.96339

Epoch 4/7
Measuring validation accuracy...

Validation accuracy: 0.96539

Epoch 5/7
Measuring validation accuracy...

Validation accuracy: 0.96519

Epoch 6/7
Measuring validation accuracy...

Validation accuracy: 0.96609

Epoch 7/7
Measuring validation accuracy...

Validation accuracy: 0.96647



<keras.src.callbacks.History at 0x1b9ded07850>

In [25]:
acc = compute_test_accuracy(model)
print("\nFinal accuracy: %.5f"%acc)

assert acc>0.96, "Bidirectional RNNs are better than this!"
print("Well done!")


Final accuracy: 0.96647
Well done!
