# Natural Language Processing | Recurrent Neural Networks

In [1]:
import nltk
import numpy as np
import requests
import tensorflow as tf

from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.callbacks import ModelCheckpoint
from nltk.corpus import treebank, brown, conll2000
from sklearn.model_selection import train_test_split
from tensorflow import keras

# Part-of-Speech Tagging with a Bidirectional LSTM

In [2]:
nltk.download('treebank')
nltk.download('brown')
nltk.download('conll2000')

[nltk_data] Downloading package treebank to /root/nltk_data...
[nltk_data]   Unzipping corpora/treebank.zip.
[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.
[nltk_data] Downloading package conll2000 to /root/nltk_data...
[nltk_data]   Unzipping corpora/conll2000.zip.


True

In their original form, the datasets use different part-of-speech (PoS) tag sets. We need to ensure they all use the same tagset, so we'll download a simplified set called the *universal_tagset* from NLTK.<br>

See Section 2.3 here for a list of tags: https://www.nltk.org/book/ch05.html

In [3]:
nltk.download('universal_tagset')

[nltk_data] Downloading package universal_tagset to /root/nltk_data...
[nltk_data]   Unzipping taggers/universal_tagset.zip.


True

We'll then retrieve the tagged sentences from each dataset, taking care to specify they should use the *universal tagset* we just downloaded. We'll then combine them into one collection.

In [4]:
# Download all PoS-tagged sentences and place them in one list.
tagged_sentences = treebank.tagged_sents(tagset='universal') +\
                   brown.tagged_sents(tagset='universal') +\
                   conll2000.tagged_sents(tagset='universal')

print(tagged_sentences[0])
print(f"Dataset size: {len(tagged_sentences)}")

[('Pierre', 'NOUN'), ('Vinken', 'NOUN'), (',', '.'), ('61', 'NUM'), ('years', 'NOUN'), ('old', 'ADJ'), (',', '.'), ('will', 'VERB'), ('join', 'VERB'), ('the', 'DET'), ('board', 'NOUN'), ('as', 'ADP'), ('a', 'DET'), ('nonexecutive', 'ADJ'), ('director', 'NOUN'), ('Nov.', 'NOUN'), ('29', 'NUM'), ('.', '.')]
Dataset size: 72202


Each tagged sentence is actually a list of word-tag tuples (bear in mind that NLTK's universal tagset is a reduced tagset so items such as *proper nouns* are simply tagged as *nouns*).<br>

Our model is going to take in a sequence of words, and output a sequence of PoS tags, so we need to separate the words from the tags in our dataset. The tag sequences will serve as our training labels.

In [5]:
sentences, sentence_tags = [], []

for s in tagged_sentences:
  sentence, tags = zip(*s)
  sentences.append(list(sentence))
  sentence_tags.append(list(tags))

The sentences and their respective tags are now in separate lists.

In [8]:
print(sentences[0], len(sentences[0]))
print(sentence_tags[0], len(sentence_tags[0]))

['Pierre', 'Vinken', ',', '61', 'years', 'old', ',', 'will', 'join', 'the', 'board', 'as', 'a', 'nonexecutive', 'director', 'Nov.', '29', '.'] 18
['NOUN', 'NOUN', '.', 'NUM', 'NOUN', 'ADJ', '.', 'VERB', 'VERB', 'DET', 'NOUN', 'ADP', 'DET', 'ADJ', 'NOUN', 'NOUN', 'NUM', '.'] 18


In [9]:
print(len(sentences), len(sentence_tags))

72202 72202


In [10]:
train_ratio = 0.75
validation_ratio = 0.15
test_ratio = 0.10

x_train, x_test, y_train, y_test = train_test_split(sentences, sentence_tags, 
                                                    test_size=1 - train_ratio, 
                                                    random_state=1)

x_val, x_test, y_val, y_test = train_test_split(x_test, y_test, 
                                                test_size=test_ratio/(test_ratio + validation_ratio), 
                                                random_state=1)

In [11]:
print(len(x_train), len(y_train))
print(len(x_val), len(y_val))
print(len(x_test), len(y_test))

54151 54151
10830 10830
7221 7221


Now that we have our datasets preprocessed, the next step is to vectorize. If you watched the demo on **Word Vectors**, the next few steps should look familiar.<br>
https://www.nlpdemystified.org/course/word-vectors<br>
https://colab.research.google.com/github/nitinpunjabi/nlp-demystified/blob/main/notebooks/nlpdemystified_word_vectors.ipynb

First, we need to create a tokenizer for the sentences and *fit* it to the training dataset to create a vocabulary. We'll just use the default tokenizer settings which applies some light filtering, lowers the case, and separates on spaces. We'll also supply an out-of-vocabulary token (\<OOV\>) in case the tokenizer encounters words during testing/inference which it doesn't during training.<br>
https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer

In [12]:
sentence_tokenizer = keras.preprocessing.text.Tokenizer(oov_token='<OOV>')

In [13]:
sentence_tokenizer.fit_on_texts(x_train)

In [14]:
print(f"Vocabulary size: {len(sentence_tokenizer.word_index)}")

Vocabulary size: 52041


We also need to create *another* tokenizer for the tags since our labels are also sequences. This time, we won't need an OOV token because there are only a handful of tags and, in this case, they'll all be encountered during training.

In [15]:
tag_tokenizer = keras.preprocessing.text.Tokenizer()
tag_tokenizer.fit_on_texts(y_train)

In [16]:
print(f"Number of PoS tags: {len(tag_tokenizer.word_index)}\n")
tag_tokenizer.get_config()

Number of PoS tags: 12



{'num_words': None,
 'filters': '!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',
 'lower': True,
 'split': ' ',
 'char_level': False,
 'oov_token': None,
 'document_count': 54151,
 'word_counts': '{"det": 126968, "verb": 174593, "adj": 80523, "adp": 136453, "noun": 286676, "adv": 51205, ".": 142935, "pron": 44684, "conj": 35060, "num": 21461, "prt": 31229, "x": 6090}',
 'word_docs': '{"adj": 36344, "det": 44747, "adv": 29531, "adp": 43855, ".": 53332, "verb": 50837, "noun": 51171, "pron": 26965, "conj": 24383, "num": 11964, "prt": 21777, "x": 2682}',
 'index_docs': '{"6": 36344, "5": 44747, "7": 29531, "4": 43855, "3": 53332, "2": 50837, "1": 51171, "8": 26965, "9": 24383, "11": 11964, "10": 21777, "12": 2682}',
 'index_word': '{"1": "noun", "2": "verb", "3": ".", "4": "adp", "5": "det", "6": "adj", "7": "adv", "8": "pron", "9": "conj", "10": "prt", "11": "num", "12": "x"}',
 'word_index': '{"noun": 1, "verb": 2, ".": 3, "adp": 4, "det": 5, "adj": 6, "adv": 7, "pron": 8, "conj": 9, "prt": 10, "

In [17]:
# The set of universal PoS tags.
tag_tokenizer.word_index

{'noun': 1,
 'verb': 2,
 '.': 3,
 'adp': 4,
 'det': 5,
 'adj': 6,
 'adv': 7,
 'pron': 8,
 'conj': 9,
 'prt': 10,
 'num': 11,
 'x': 12}

Next, we need to vectorize our sentences and corresponding tags. As we did in the demo on [Word Vectors](https://github.com/nitinpunjabi/nlp-demystified/blob/main/notebooks/nlpdemystified_word_vectors.ipynb), we'll use the tokenizer's *texts_to_sequences* method to convert each sentence to a sequence of integers where each integer maps to a particular token.<br>
https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer#texts_to_sequences

In [18]:
x_train_seqs = sentence_tokenizer.texts_to_sequences(x_train)

In [19]:
print(x_train_seqs[0])

[27, 86, 21, 479, 7, 2, 920, 10903, 20547, 3327, 5644, 337, 4]


We can use the *sequences_to_texts* method to convert a vectorized sentence back to its preprocessed form.<br>
https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer#sequences_to_texts

In [20]:
print(f"Original: {x_train[0]}")
print(f"Reconstructed: {sentence_tokenizer.sequences_to_texts([x_train_seqs[0]])}")

Original: ['This', 'may', 'be', 'due', 'to', 'the', 'heavy', 'interlobular', 'connective', 'tissue', 'barriers', 'present', '.']
Reconstructed: ['this may be due to the heavy interlobular connective tissue barriers present .']


Next, we'll vectorize the labels (i.e. sequences of PoS tags) using its respective tokenizer.

In [21]:
y_train_seqs = tag_tokenizer.texts_to_sequences(y_train)

In [22]:
tag_tokenizer.sequences_to_texts([y_train_seqs[0]])

['det verb verb adj adp det adj adj adj noun noun adv .']

Finally, we'll do the same with the validation inputs and labels.

In [23]:
x_val_seqs = sentence_tokenizer.texts_to_sequences(x_val)
y_val_seqs = tag_tokenizer.texts_to_sequences(y_val)

In [24]:
MAX_LENGTH = len(max(x_train_seqs, key=len))
print(f"Length of longest input sequence: {MAX_LENGTH}")

Length of longest input sequence: 161


We can pad the sequences with the *pad_sequences* method:<br>
https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/sequence/pad_sequences

In [25]:
x_train_padded = keras.preprocessing.sequence.pad_sequences(x_train_seqs, padding='post', 
                                                            maxlen=MAX_LENGTH)

In [26]:
print(x_train_padded[0])

[   27    86    21   479     7     2   920 10903 20547  3327  5644   337
     4     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0]


We'll do the same with the training label (PoS sequences)...

In [27]:
y_train_padded = keras.preprocessing.sequence.pad_sequences(y_train_seqs, padding='post', 
                                                            maxlen=MAX_LENGTH)

...and the validation dataset.

In [28]:
x_val_padded = keras.preprocessing.sequence.pad_sequences(x_val_seqs, padding='post', maxlen=MAX_LENGTH)
y_val_padded = keras.preprocessing.sequence.pad_sequences(y_val_seqs, padding='post', maxlen=MAX_LENGTH)

PoS tagging is a multiclass classification task done at each timestep, so we need to convert every tag for every sentence into a one-hot encoding (we'll look at an alternative approach when we build a language model later in this demo).<br>
https://www.tensorflow.org/api_docs/python/tf/keras/utils/to_categorical<br>

In [29]:
y_train_categoricals = keras.utils.to_categorical(y_train_padded)

The label (PoS tag sequence) for a single sentence is now a **sequence of one-hot encodings**. These will serve as our training targets.

In [30]:
# The one hot encodings for the first label.
print(y_train_categoricals[0])

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 1. ... 0. 0. 0.]
 [0. 0. 1. ... 0. 0. 0.]
 ...
 [1. 0. 0. ... 0. 0. 0.]
 [1. 0. 0. ... 0. 0. 0.]
 [1. 0. 0. ... 0. 0. 0.]]


In [31]:
# One-hot encoding for a single tag.
print(y_train_categoricals[0][0])

[0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]


We can determine the PoS tag from a one-hot encoding by seeing which index is set to 1, then using that to query the tag tokenizer's *index_word* dictionary.

In [32]:
idx = np.argmax(y_train_categoricals[0][0])
print(f"Index: {idx}")

print(f"Tag: {tag_tokenizer.index_word[idx]}")

Index: 5
Tag: det


We'll one-hot encode the validation labels as well.

In [33]:
y_val_categoricals = keras.utils.to_categorical(y_val_padded)

In [35]:
# For the embedding layer. "+ 1" to account for the padding token.
num_tokens = len(sentence_tokenizer.word_index) + 1
embedding_dim = 128

# For the output layer. The number of classes corresponds to the 
# number of possible tags.
num_classes = len(tag_tokenizer.word_index) + 1

In [42]:
# The set_seed call and kernel_initializer parameters are used here to
# ensure you and I get the same results. To get random weight initializations,
# remove them.
tf.random.set_seed(0)

model = keras.Sequential()

model.add(layers.Embedding(input_dim=num_tokens, 
                           output_dim=embedding_dim, 
                           input_length=MAX_LENGTH,
                           mask_zero=True))

model.add(layers.Bidirectional(layers.LSTM(128, return_sequences=True, 
                                           kernel_initializer=tf.keras.initializers.random_normal(seed=1))))

model.add(layers.Dense(num_classes, activation='softmax', 
                       kernel_initializer=tf.keras.initializers.random_normal(seed=1)))

model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])


A few notes about the model summary:<br>

The embedding layer **output** has three dimensions:
- Batch size (it's showing as "None" because we didn't specify it upfront. We'll do it when we call *model.fit*).
- Sequence length (the sequences are all the same length now after our padding step).
- Embedding dimension.
<br><br>

The LSTM outputs a vector *twice* the size of what we specified because it's bidirectional. Recall from the slides that the outputs from the two LSTMs will be concatenated before going to the output layer.
<br><br>

The final layer's **output** also has three dimensions:
- Batch size
- Sequence length
- Output dimension (the number of possible tags).

The output will be a **sequence of probability distributions** for each input sequence. One probability distribution per tag.



In [43]:
model.summary()

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_2 (Embedding)     (None, 161, 128)          6661376   
                                                                 
 bidirectional_1 (Bidirectio  (None, 161, 256)         263168    
 nal)                                                            
                                                                 
 dense_2 (Dense)             (None, 161, 13)           3341      
                                                                 
Total params: 6,927,885
Trainable params: 6,927,885
Non-trainable params: 0
_________________________________________________________________


As we did in previous demos, we'll use *early stopping* with some *patience* to halt training once validation loss stops improving.<br>

The model will compare its output (sequences of softmax-generated probability distributions) against the one-hot encoded targets.



In [44]:
es_callback = keras.callbacks.EarlyStopping(monitor='val_loss', patience=3)

history = model.fit(x_train_padded, y_train_categoricals, epochs=20, 
                    batch_size=256, validation_data=(x_val_padded, y_val_categoricals), 
                    callbacks=[es_callback])

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20


Once our model is trained, we'll vectorize and pad the testing dataset. In the case of the labels, we'll also one-hot encode them.

In [45]:
# Preprocess the test data and test the model.
x_test_seqs = sentence_tokenizer.texts_to_sequences(x_test)
x_test_padded = keras.preprocessing.sequence.pad_sequences(x_test_seqs, padding='post', maxlen=MAX_LENGTH)

y_test_seqs = tag_tokenizer.texts_to_sequences(y_test)
y_test_padded = keras.preprocessing.sequence.pad_sequences(y_test_seqs, padding='post', maxlen=MAX_LENGTH)
y_test_categoricals = keras.utils.to_categorical(y_test_padded)

In [46]:
model.evaluate(x_test_padded, y_test_categoricals)



[0.10138235986232758, 0.9698208570480347]

We can now use our model to tag sentences.

In [47]:
samples = [
    "Brown refused to testify.",
    "Brown sofas are on sale.",
]

The function below takes a list of strings, tokenizes and pads them, then has the model tag them. Note that if a sentence is longer than MAX_LENGTH, it'll be truncated.

In [53]:
def tag_sentences(sentences):
  sentences_seqs = sentence_tokenizer.texts_to_sequences(sentences)
  sentences_padded = keras.preprocessing.sequence.pad_sequences(sentences_seqs, 
                                                                maxlen=MAX_LENGTH, 
                                                                padding='post')

  # The model returns a LIST of PROBABILITY DISTRIBUTIONS (due to the softmax)
  # for EACH sentence. There is one probability distribution for each PoS tag.
  tag_preds = model.predict(sentences_padded)
  sentence_tags = []

  # For EACH LIST of probability distributions...
  for i, preds in enumerate(tag_preds):

    # Extract the most probable tag from EACH probability distribution.
    # Note how we're extracting tags for only the non-padding tokens.
    tags_seq = [np.argmax(p) for p in preds[:len(sentences_seqs[i])]]

    # Convert the sentence and tag sequences back to their token counterparts.
    words = [sentence_tokenizer.index_word[w] for w in sentences_seqs[i]]
    tags = [tag_tokenizer.index_word[t] for t in tags_seq]
    sentence_tags.append(list(zip(words, tags)))

  return sentence_tags


In [54]:
tagged_sample_sentences = tag_sentences(samples)



In [55]:
print(tagged_sample_sentences[0])

[('brown', 'noun'), ('refused', 'verb'), ('to', 'prt'), ('testify', 'verb')]


In [56]:
print(tagged_sample_sentences[1])

[('brown', 'adj'), ('sofas', 'noun'), ('are', 'verb'), ('on', 'adp'), ('sale', 'noun')]


So that's one way to build a PoS tagger. Industrial-strength taggers use a lot more data and these days, are powered by more sophisticated models which we'll learn about when we cover *transformers*.