# Recurrent Neural Networks: Part-of-Speech Tagging with a Bidirectional LSTM

In [1]:
import nltk
import numpy as np
import requests
import tensorflow as tf

from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.callbacks import ModelCheckpoint
from nltk.corpus import treebank, brown, conll2000
from sklearn.model_selection import train_test_split
from tensorflow import keras

### Part-of-Speech Tagging with a Bidirectional LSTM

It's difficult to find free sequence labelling datasets because they're so labour-intensive to create.
<br><br>
Fortunately, **Natural Language Toolkit (NLTK)** includes enough free sets of labelled corpora for our purposes. NLTK also provides them in a convenient uniform format.<br>
https://www.nltk.org/index.html<br>
https://www.nltk.org/nltk_data/<br>
<br>
We'll use the Treebank, Brown, and CONLL-2000 datasets.

In [2]:
nltk.download('treebank')
nltk.download('brown')
nltk.download('conll2000')

[nltk_data] Downloading package treebank to
[nltk_data]     C:\Users\Sawan\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\treebank.zip.
[nltk_data] Downloading package brown to
[nltk_data]     C:\Users\Sawan\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\brown.zip.
[nltk_data] Downloading package conll2000 to
[nltk_data]     C:\Users\Sawan\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\conll2000.zip.


True

In their original form, the datasets use different part-of-speech (PoS) tag sets. We need to ensure they all use the same tagset, so we'll download a simplified set called the *universal_tagset* from NLTK.<br>

See Section 2.3 here for a list of tags: https://www.nltk.org/book/ch05.html

In [3]:
nltk.download('universal_tagset')

[nltk_data] Downloading package universal_tagset to
[nltk_data]     C:\Users\Sawan\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\universal_tagset.zip.


True

We'll then retrieve the tagged sentences from each dataset, taking care to specify they should use the *universal tagset* we just downloaded. We'll then combine them into one collection.

In [8]:
# Download all PoS-tagged sentences and place them in one list.
tagged_sentences = treebank.tagged_sents(tagset='universal') +\
                   brown.tagged_sents(tagset='universal') +\
                   conll2000.tagged_sents(tagset='universal')

print(tagged_sentences[0])
print("\n")
print(f"Dataset size: {len(tagged_sentences)}")

[('Pierre', 'NOUN'), ('Vinken', 'NOUN'), (',', '.'), ('61', 'NUM'), ('years', 'NOUN'), ('old', 'ADJ'), (',', '.'), ('will', 'VERB'), ('join', 'VERB'), ('the', 'DET'), ('board', 'NOUN'), ('as', 'ADP'), ('a', 'DET'), ('nonexecutive', 'ADJ'), ('director', 'NOUN'), ('Nov.', 'NOUN'), ('29', 'NUM'), ('.', '.')]


Dataset size: 72202


Each tagged sentence is actually a list of word-tag tuples (bear in mind that NLTK's universal tagset is a reduced tagset so items such as *proper nouns* are simply tagged as *nouns*).<br>

Our model is going to take in a sequence of words, and output a sequence of PoS tags, so we need to separate the words from the tags in our dataset. The tag sequences will serve as our training labels.

In [9]:
sentences, sentence_tags = [], []

for s in tagged_sentences:
  sentence, tags = zip(*s)
  sentences.append(list(sentence))
  sentence_tags.append(list(tags))

The sentences and their respective tags are now in separate lists.

In [6]:
print(sentences[0])
print(sentence_tags[0])

['Pierre', 'Vinken', ',', '61', 'years', 'old', ',', 'will', 'join', 'the', 'board', 'as', 'a', 'nonexecutive', 'director', 'Nov.', '29', '.']
['NOUN', 'NOUN', '.', 'NUM', 'NOUN', 'ADJ', '.', 'VERB', 'VERB', 'DET', 'NOUN', 'ADP', 'DET', 'ADJ', 'NOUN', 'NOUN', 'NUM', '.']


In [7]:
print(len(sentences), len(sentence_tags))

72202 72202


Create train/validation/test splits. We don't have a separate test set so we'll call *train_test_split* twice.<br>
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

In [10]:
train_ratio = 0.75
validation_ratio = 0.15
test_ratio = 0.10

x_train, x_test, y_train, y_test = train_test_split(sentences, sentence_tags,
                                                    test_size=1 - train_ratio,
                                                    random_state=1)

x_val, x_test, y_val, y_test = train_test_split(x_test, y_test,
                                                test_size=test_ratio/(test_ratio + validation_ratio),
                                                random_state=1)

In [11]:
print(len(x_train), len(y_train))
print(len(x_val), len(y_val))
print(len(x_test), len(y_test))

54151 54151
10830 10830
7221 7221


Now that we have our datasets preprocessed, the next step is to vectorize. 

First, we need to create a tokenizer for the sentences and *fit* it to the training dataset to create a vocabulary. We'll just use the default tokenizer settings which applies some light filtering, lowers the case, and separates on spaces. We'll also supply an out-of-vocabulary token (\<OOV\>) in case the tokenizer encounters words during testing/inference which it doesn't during training.<br>
https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer

In [13]:
sentence_tokenizer = keras.preprocessing.text.Tokenizer(oov_token='<OOV>')
sentence_tokenizer

<keras.src.legacy.preprocessing.text.Tokenizer at 0x13275cc25d0>

In [14]:
sentence_tokenizer.fit_on_texts(x_train)

In [15]:
print(f"Vocabulary size: {len(sentence_tokenizer.word_index)}")

Vocabulary size: 52041


We also need to create *another* tokenizer for the tags since our labels are also sequences. This time, we won't need an OOV token because there are only a handful of tags and, in this case, they'll all be encountered during training.

In [17]:
tag_tokenizer = keras.preprocessing.text.Tokenizer()
tag_tokenizer.fit_on_texts(y_train)
tag_tokenizer

<keras.src.legacy.preprocessing.text.Tokenizer at 0x1327b8e25d0>

In [18]:
print(f"Number of PoS tags: {len(tag_tokenizer.word_index)}\n")
tag_tokenizer.get_config()

Number of PoS tags: 12



{'num_words': None,
 'filters': '!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',
 'lower': True,
 'split': ' ',
 'char_level': False,
 'oov_token': None,
 'document_count': 54151,
 'word_counts': '{"det": 126968, "verb": 174593, "adj": 80523, "adp": 136453, "noun": 286676, "adv": 51205, ".": 142935, "pron": 44684, "conj": 35060, "num": 21461, "prt": 31229, "x": 6090}',
 'word_docs': '{"verb": 50837, "adj": 36344, "noun": 51171, "adp": 43855, ".": 53332, "det": 44747, "adv": 29531, "conj": 24383, "num": 11964, "pron": 26965, "prt": 21777, "x": 2682}',
 'index_docs': '{"2": 50837, "6": 36344, "1": 51171, "4": 43855, "3": 53332, "5": 44747, "7": 29531, "9": 24383, "11": 11964, "8": 26965, "10": 21777, "12": 2682}',
 'index_word': '{"1": "noun", "2": "verb", "3": ".", "4": "adp", "5": "det", "6": "adj", "7": "adv", "8": "pron", "9": "conj", "10": "prt", "11": "num", "12": "x"}',
 'word_index': '{"noun": 1, "verb": 2, ".": 3, "adp": 4, "det": 5, "adj": 6, "adv": 7, "pron": 8, "conj": 9, "prt": 10, "

In [19]:
# The set of universal PoS tags.
tag_tokenizer.word_index

{'noun': 1,
 'verb': 2,
 '.': 3,
 'adp': 4,
 'det': 5,
 'adj': 6,
 'adv': 7,
 'pron': 8,
 'conj': 9,
 'prt': 10,
 'num': 11,
 'x': 12}

Next, we need to vectorize our sentences and corresponding tags. We'll use the tokenizer's *texts_to_sequences* method to convert each sentence to a sequence of integers where each integer maps to a particular token.<br>
https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer#texts_to_sequences

In [21]:
x_train_seqs = sentence_tokenizer.texts_to_sequences(x_train)
x_train_seqs

[[27, 86, 21, 479, 7, 2, 920, 10903, 20547, 3327, 5644, 337, 4],
 [52,
  28,
  14029,
  1496,
  10,
  4802,
  12,
  8,
  516,
  38,
  3,
  9,
  2358,
  5,
  242,
  4198,
  6,
  104,
  20548,
  3,
  43,
  834,
  379,
  75,
  42,
  5,
  2,
  7843,
  678,
  5,
  2,
  141,
  4],
 [12314, 1444, 10, 802, 7, 21, 706, 207, 99, 4],
 [30, 222, 44, 162, 140, 9, 20, 8, 581, 1497, 13, 4],
 [3088,
  3,
  17,
  1892,
  65,
  10,
  906,
  19,
  2,
  160,
  158,
  3,
  2,
  3238,
  43,
  224,
  2454,
  1362,
  4336,
  4992,
  14030,
  4],
 [17,
  253,
  2,
  14031,
  123,
  667,
  2,
  1156,
  6,
  364,
  80,
  2811,
  3,
  6,
  16,
  15,
  22,
  505,
  7,
  2455,
  93,
  4],
 [22,
  87,
  635,
  3,
  110,
  2,
  4199,
  67,
  9901,
  66,
  3,
  12,
  8,
  1778,
  5,
  2065,
  10,
  28,
  255,
  46,
  20549,
  3,
  31,
  98,
  2517,
  5,
  27,
  635,
  17,
  2748,
  2181,
  3,
  6,
  12315,
  8,
  28206,
  12316,
  3,
  229,
  7844,
  22,
  5645,
  9902,
  4],
 [2022, 321, 72, 372, 1383, 81, 4],
 [2,
 

In [22]:
print(x_train_seqs[0])

[27, 86, 21, 479, 7, 2, 920, 10903, 20547, 3327, 5644, 337, 4]


We can use the *sequences_to_texts* method to convert a vectorized sentence back to its preprocessed form.<br>
https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer#sequences_to_texts

In [23]:
print(f"Original: {x_train[0]}")
print(f"Reconstructed: {sentence_tokenizer.sequences_to_texts([x_train_seqs[0]])}")

Original: ['This', 'may', 'be', 'due', 'to', 'the', 'heavy', 'interlobular', 'connective', 'tissue', 'barriers', 'present', '.']
Reconstructed: ['this may be due to the heavy interlobular connective tissue barriers present .']


Next, we'll vectorize the labels (i.e. sequences of PoS tags) using its respective tokenizer.

In [25]:
y_train_seqs = tag_tokenizer.texts_to_sequences(y_train)
y_train_seqs

[[5, 2, 2, 6, 4, 5, 6, 6, 6, 1, 1, 7, 3],
 [8,
  2,
  7,
  6,
  4,
  8,
  2,
  5,
  1,
  5,
  3,
  4,
  1,
  4,
  6,
  1,
  9,
  6,
  1,
  3,
  2,
  2,
  8,
  4,
  11,
  4,
  5,
  6,
  1,
  4,
  5,
  1,
  3],
 [1, 2, 5, 1, 10, 2, 6, 6, 1, 3],
 [8, 2, 8, 2, 2, 10, 4, 5, 1, 1, 3, 3],
 [7, 3, 8, 2, 4, 5, 2, 4, 5, 1, 1, 3, 5, 1, 2, 7, 7, 7, 6, 1, 1, 3],
 [8, 2, 5, 1, 7, 4, 5, 1, 9, 2, 8, 7, 3, 9, 8, 2, 5, 1, 10, 2, 7, 3],
 [5,
  6,
  1,
  3,
  4,
  5,
  1,
  3,
  11,
  3,
  3,
  2,
  5,
  1,
  4,
  1,
  8,
  2,
  7,
  10,
  6,
  3,
  9,
  4,
  1,
  4,
  5,
  1,
  8,
  2,
  1,
  3,
  9,
  2,
  5,
  6,
  1,
  3,
  4,
  2,
  5,
  6,
  1,
  3],
 [1, 1, 3, 11, 11, 1, 3],
 [5,
  1,
  4,
  8,
  2,
  2,
  4,
  5,
  1,
  5,
  1,
  4,
  5,
  6,
  1,
  2,
  7,
  2,
  8,
  5,
  5,
  7,
  6,
  3,
  8,
  7,
  2,
  3,
  4,
  5,
  1,
  2,
  2,
  7,
  7,
  6,
  4,
  6,
  1,
  4,
  4,
  5,
  6,
  9,
  2,
  1,
  7,
  1,
  9,
  1,
  9,
  1,
  9,
  1,
  2,
  2,
  5,
  1,
  1,
  3],
 [5, 1, 7, 2, 1, 3, 1, 4, 5,

In [26]:
tag_tokenizer.sequences_to_texts([y_train_seqs[0]])

['det verb verb adj adp det adj adj adj noun noun adv .']

Finally, we'll do the same with the validation inputs and labels.

In [27]:
x_val_seqs = sentence_tokenizer.texts_to_sequences(x_val)
x_val_seqs

[[7,
  173,
  27,
  3,
  5,
  243,
  3,
  12,
  7,
  165,
  64,
  8,
  424,
  20,
  42,
  289,
  5,
  8,
  4192,
  249,
  20,
  93,
  11,
  74,
  84,
  807,
  103,
  3,
  32,
  3,
  23,
  88,
  321,
  3,
  152,
  2,
  740,
  5,
  2,
  47852,
  681,
  346,
  9,
  4907,
  222,
  4],
 [719, 17, 230, 75, 2, 3396, 6, 1001, 115, 7, 165, 151, 22, 2932, 4],
 [27, 15, 2, 214, 108, 19, 2, 1, 390, 6, 2, 36627, 3, 38863, 10659, 4],
 [31,
  5751,
  9,
  10,
  15031,
  1184,
  51,
  15,
  2,
  7174,
  437,
  5,
  8829,
  2432,
  5,
  1206,
  4],
 [2,
  2209,
  836,
  149,
  67,
  2119,
  986,
  66,
  12,
  8,
  31698,
  12044,
  5,
  9700,
  2329,
  3,
  1172,
  19,
  23,
  295,
  720,
  1804,
  5,
  1,
  28051,
  4],
 [14, 783, 552, 130, 620, 18, 580, 18, 30, 140, 198, 4],
 [16,
  12,
  59,
  30,
  824,
  1281,
  57,
  15,
  8,
  406,
  1012,
  5,
  2,
  314,
  3,
  6,
  654,
  5,
  1909,
  3,
  6,
  37,
  8434,
  7,
  109,
  71,
  26302,
  6,
  1424,
  238,
  3,
  10,
  16,
  16471,
  4],
 [182,
 

In [28]:
y_val_seqs = tag_tokenizer.texts_to_sequences(y_val)
y_val_seqs

[[10,
  2,
  5,
  3,
  4,
  1,
  3,
  2,
  10,
  2,
  10,
  5,
  1,
  4,
  11,
  1,
  4,
  5,
  1,
  2,
  10,
  7,
  4,
  5,
  11,
  11,
  1,
  3,
  9,
  3,
  4,
  5,
  1,
  3,
  4,
  5,
  1,
  4,
  5,
  7,
  6,
  1,
  4,
  6,
  1,
  3],
 [7, 8, 2, 4, 5, 1, 9, 2, 10, 10, 2, 10, 5, 1, 3],
 [5, 2, 5, 6, 1, 4, 5, 6, 1, 9, 5, 6, 3, 6, 1, 3],
 [9, 2, 4, 5, 6, 1, 10, 2, 5, 6, 1, 4, 11, 1, 4, 1, 3],
 [5,
  1,
  2,
  7,
  3,
  1,
  11,
  3,
  2,
  5,
  6,
  1,
  4,
  2,
  1,
  3,
  2,
  4,
  4,
  6,
  11,
  1,
  4,
  6,
  1,
  3],
 [3, 10, 2, 5, 1, 7, 7, 4, 8, 2, 1, 3],
 [8,
  2,
  7,
  8,
  2,
  1,
  8,
  2,
  5,
  6,
  1,
  4,
  5,
  1,
  3,
  9,
  7,
  4,
  8,
  3,
  9,
  8,
  2,
  4,
  8,
  7,
  7,
  9,
  2,
  7,
  3,
  4,
  8,
  2,
  3],
 [6, 1, 2, 4, 11, 1, 4, 6, 6, 1, 2, 2, 11, 1, 1, 4, 11, 1, 3],
 [5, 1, 4, 1, 2, 5, 6, 1, 8, 2, 11, 1, 2, 11, 4, 5, 6, 3],
 [6, 1, 2, 2, 4, 6, 1, 4, 1, 4, 5, 1, 4, 5, 6, 1, 8, 2, 5, 1, 3],
 [1,
  1,
  3,
  5,
  1,
  1,
  1,
  4,
  6,
  1,
  1,
  9,
  6,
  

**Recurrent Neural Networks** are capable of handling variable length sequences.<br><br>
Despite that, it's still best to pad or truncate sequences to a uniform length for one or both of these reasons:<br>
1. Performance. The longer a sequence, the higher the computation cost. One may want to truncate long sequences to a shorter length if that's feasible and doesn't result in too much performance loss.

2. When processing datasets in batches, each sequence *in a batch* usually has to be of uniform length.<br>

For simplicity, we'll make *every* sequence be as long as the longest sequence. In other words, we'll determine how long the longest sequence is, then pad out the rest of the sequences to be the same length.<br>

A more optimized solution would be to make each sequence as long as the longest sequence in each *batch* to avoid unnecessary processing.

In [29]:
MAX_LENGTH = len(max(x_train_seqs, key=len))
print(f"Length of longest input sequence: {MAX_LENGTH}")

Length of longest input sequence: 161


We can pad the sequences with the *pad_sequences* method:<br>
https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/sequence/pad_sequences

In [30]:
x_train_padded = keras.preprocessing.sequence.pad_sequences(x_train_seqs, padding='post',
                                                            maxlen=MAX_LENGTH)
type(x_train_padded)

numpy.ndarray

In [31]:
print(x_train_padded[0])

[   27    86    21   479     7     2   920 10903 20547  3327  5644   337
     4     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0]


In [32]:
print(x_train_padded[1])

[   52    28 14029  1496    10  4802    12     8   516    38     3     9
  2358     5   242  4198     6   104 20548     3    43   834   379    75
    42     5     2  7843   678     5     2   141     4     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0]


We'll do the same with the training label (PoS sequences)...

In [33]:
y_train_padded = keras.preprocessing.sequence.pad_sequences(y_train_seqs, padding='post',
                                                            maxlen=MAX_LENGTH)

In [34]:
y_train_padded[0]

array([5, 2, 2, 6, 4, 5, 6, 6, 6, 1, 1, 7, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0])

...and the validation dataset.

In [35]:
x_val_padded = keras.preprocessing.sequence.pad_sequences(x_val_seqs, padding='post', maxlen=MAX_LENGTH)
y_val_padded = keras.preprocessing.sequence.pad_sequences(y_val_seqs, padding='post', maxlen=MAX_LENGTH)
y_val_padded[0]

array([10,  2,  5,  3,  4,  1,  3,  2, 10,  2, 10,  5,  1,  4, 11,  1,  4,
        5,  1,  2, 10,  7,  4,  5, 11, 11,  1,  3,  9,  3,  4,  5,  1,  3,
        4,  5,  1,  4,  5,  7,  6,  1,  4,  6,  1,  3,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0])

In [41]:
len(y_train_padded)

54151

PoS tagging is a multiclass classification task done at each timestep, so we need to convert every tag for every sentence into a one-hot encoding.<br>
https://www.tensorflow.org/api_docs/python/tf/keras/utils/to_categorical<br>

In [37]:
y_train_categoricals = keras.utils.to_categorical(y_train_padded)
# The one hot encodings for the first label.
y_train_categoricals[0]

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       ...,
       [1., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.]])

The label (PoS tag sequence) for a single sentence is now a **sequence of one-hot encodings**. These will serve as our training targets.

In [38]:
# One-hot encoding for a single tag.
print(y_train_categoricals[0][0])

[0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]


In [40]:
len(y_train_categoricals)

54151

We can determine the PoS tag from a one-hot encoding by seeing which index is set to 1, then using that to query the tag tokenizer's *index_word* dictionary.

In [42]:
idx = np.argmax(y_train_categoricals[0][0])
print(f"Index: {idx}")

print(f"Tag: {tag_tokenizer.index_word[idx]}")

Index: 5
Tag: det


We'll one-hot encode the validation labels as well.

In [43]:
y_val_categoricals = keras.utils.to_categorical(y_val_padded)
y_val_categoricals[0]

array([[0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [1., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.]])

At this point, we're ready to build our model. We'll train word embeddings concurrently with our model (though you can use pretrained word vectors as well). If you're unfamiliar with this idea, refer to the [Word Vectors](https://www.nlpdemystified.org/course/word-vectors) module.<br><br>
There are several new things here:<br>
1. The embedding layer has a *mask_zero* parameter. We added padding in order to make our batches the same size, but we don't want the model to make PoS predictions on padding. Setting *mask_zero* to *True* makes the layers following the embedding layer ignore padding values.<br>
https://www.tensorflow.org/guide/keras/masking_and_padding<br>
https://stackoverflow.com/questions/47485216/how-does-mask-zero-in-keras-embedding-layer-work<br><br>
2. We're using a **bidirectional LSTM**. The *Bidirectional* layer is a wrapper to which we pass an *LSTM* layer. The first parameter to the *LSTM* layer is the number of units in the cell. The second parameter, *return_sequences*, controls whether the RNN returns an output for each timestep or only the last output. Since we're doing PoS-tagging, we want an output for each timestep and so *return_sequences* is set to *True*.<br>
https://www.tensorflow.org/api_docs/python/tf/keras/layers/LSTM<br>
https://www.tensorflow.org/api_docs/python/tf/keras/layers/Bidirectional


In [44]:
# For the embedding layer. "+ 1" to account for the padding token.
num_tokens = len(sentence_tokenizer.word_index) + 1
print(num_tokens)

embedding_dim = 128

# For the output layer. The number of classes corresponds to the number of possible tags.
num_classes = len(tag_tokenizer.word_index) + 1
num_classes

52042


13

In [45]:
# The set_seed call and kernel_initializer parameters are used here to ensure you and 
# I get the same results. To get random weight initializations, remove them.
tf.random.set_seed(0)

model = keras.Sequential()

model.add(layers.Embedding(input_dim=num_tokens,
                           output_dim=embedding_dim,
                           input_length=MAX_LENGTH,
                           mask_zero=True))

model.add(layers.Bidirectional(layers.LSTM(128, return_sequences=True,
                                           kernel_initializer=tf.keras.initializers.random_normal(seed=1))))

model.add(layers.Dense(num_classes, activation='softmax',
                       kernel_initializer=tf.keras.initializers.random_normal(seed=1)))

model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])



A few notes about the model summary:<br>

The embedding layer **output** has three dimensions:
- Batch size (it's showing as "None" because we didn't specify it upfront. We'll do it when we call *model.fit*).
- Sequence length (the sequences are all the same length now after our padding step).
- Embedding dimension.
<br><br>

The LSTM outputs a vector *twice* the size of what we specified because it's bidirectional. Recall from the slides that the outputs from the two LSTMs will be concatenated before going to the output layer.
<br><br>

The final layer's **output** also has three dimensions:
- Batch size
- Sequence length
- Output dimension (the number of possible tags).

The output will be a **sequence of probability distributions** for each input sequence. One probability distribution per tag.



In [46]:
model.summary()

As we did in previous demos, we'll use *early stopping* with some *patience* to halt training once validation loss stops improving.<br>

The model will compare its output (sequences of softmax-generated probability distributions) against the one-hot encoded targets.



In [47]:
es_callback = keras.callbacks.EarlyStopping(monitor='val_loss', patience=3)

history = model.fit(x_train_padded, y_train_categoricals, epochs=20,
                    batch_size=256, validation_data=(x_val_padded, y_val_categoricals),
                    callbacks=[es_callback])

history

Epoch 1/20
[1m212/212[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m230s[0m 1s/step - accuracy: 0.1011 - loss: 1.3656 - val_accuracy: 0.1256 - val_loss: 0.1450
Epoch 2/20
[1m212/212[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m224s[0m 1s/step - accuracy: 0.1254 - loss: 0.1238 - val_accuracy: 0.1271 - val_loss: 0.1021
Epoch 3/20
[1m212/212[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m211s[0m 994ms/step - accuracy: 0.1272 - loss: 0.0776 - val_accuracy: 0.1276 - val_loss: 0.0924
Epoch 4/20
[1m212/212[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m245s[0m 1s/step - accuracy: 0.1279 - loss: 0.0608 - val_accuracy: 0.1277 - val_loss: 0.0892
Epoch 5/20
[1m212/212[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m244s[0m 1s/step - accuracy: 0.1284 - loss: 0.0500 - val_accuracy: 0.1278 - val_loss: 0.0896
Epoch 6/20
[1m212/212[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m279s[0m 1s/step - accuracy: 0.1288 - loss: 0.0418 - val_accuracy: 0.1277 - val_loss: 0.0932
Epoch 7/20
[1m212/

<keras.src.callbacks.history.History at 0x1326feccf10>

Once our model is trained, we'll vectorize and pad the testing dataset. In the case of the labels, we'll also one-hot encode them.

In [48]:
# Preprocess the test data and test the model.
x_test_seqs = sentence_tokenizer.texts_to_sequences(x_test)
x_test_padded = keras.preprocessing.sequence.pad_sequences(x_test_seqs, padding='post', maxlen=MAX_LENGTH)

y_test_seqs = tag_tokenizer.texts_to_sequences(y_test)
y_test_padded = keras.preprocessing.sequence.pad_sequences(y_test_seqs, padding='post', maxlen=MAX_LENGTH)
y_test_categoricals = keras.utils.to_categorical(y_test_padded)

In [49]:
model.evaluate(x_test_padded, y_test_categoricals)

[1m226/226[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 31ms/step - accuracy: 0.1291 - loss: 0.0974


[0.09869097173213959, 0.12809516489505768]

We can now use our model to tag sentences.

In [50]:
samples = [
    "Brown refused to testify.",
    "Brown sofas are on sale.",
]

The function below takes a list of strings, tokenizes and pads them, then has the model tag them. Note that if a sentence is longer than MAX_LENGTH, it'll be truncated.

In [51]:
def tag_sentences(sentences):
  sentences_seqs = sentence_tokenizer.texts_to_sequences(sentences)
  sentences_padded = keras.preprocessing.sequence.pad_sequences(sentences_seqs,
                                                                maxlen=MAX_LENGTH,
                                                                padding='post')

  # The model returns a LIST of PROBABILITY DISTRIBUTIONS (due to the softmax)
  # for EACH sentence. There is one probability distribution for each PoS tag.
  tag_preds = model.predict(sentences_padded)

  sentence_tags = []

  # For EACH LIST of probability distributions...
  for i, preds in enumerate(tag_preds):

    # Extract the most probable tag from EACH probability distribution.
    # Note how we're extracting tags for only the non-padding tokens.
    tags_seq = [np.argmax(p) for p in preds[:len(sentences_seqs[i])]]

    # Convert the sentence and tag sequences back to their token counterparts.
    words = [sentence_tokenizer.index_word[w] for w in sentences_seqs[i]]
    tags = [tag_tokenizer.index_word[t] for t in tags_seq]
    sentence_tags.append(list(zip(words, tags)))

  return sentence_tags


In [52]:
tagged_sample_sentences = tag_sentences(samples)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 616ms/step


In [53]:
print(tagged_sample_sentences[0])

[('brown', 'noun'), ('refused', 'verb'), ('to', 'prt'), ('testify', 'verb')]


In [54]:
print(tagged_sample_sentences[1])

[('brown', 'adj'), ('sofas', 'noun'), ('are', 'verb'), ('on', 'adp'), ('sale', 'noun')]


So that's one way to build a PoS tagger. Industrial-strength taggers use a lot more data and these days, are powered by more sophisticated models (*transformers*).

# Further Exploration
1. Check out *Sunspring*, a sci-fi short written by an LSTM. The director and actors played the script straight and the result is hilarious.<br>
https://www.youtube.com/watch?v=LY7x2Ihqjmc<br>
https://en.wikipedia.org/wiki/Sunspring<br><br>