Category 4 of the TF certification exam is NLP text classification with real-world text dataset. This notebook covers how we prepare text data for training and the possible layers for handling text data. It also covers how we may use text data to predict the following word, which allows us to generate our own text.

In [0]:
try:
  # %tensorflow_version only exists in Colab.
  %tensorflow_version 2.x
except Exception:
  pass
import tensorflow as tf

import tensorflow_datasets as tfds
import numpy as np
import io

from tensorflow import keras as k

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

import matplotlib.pyplot as plt


The IMDB movie review dataset is used for these notes.




In [2]:
imdb, info = tfds.load('imdb_reviews', with_info=True,
                       as_supervised=True)

[1mDownloading and preparing dataset imdb_reviews/plain_text/1.0.0 (download: 80.23 MiB, generated: Unknown size, total: 80.23 MiB) to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0...[0m


HBox(children=(IntProgress(value=1, bar_style='info', description='Dl Completed...', max=1, style=ProgressStyl…

HBox(children=(IntProgress(value=1, bar_style='info', description='Dl Size...', max=1, style=ProgressStyle(des…






HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteSJQFGG/imdb_reviews-train.tfrecord


HBox(children=(IntProgress(value=0, max=25000), HTML(value='')))



HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteSJQFGG/imdb_reviews-test.tfrecord


HBox(children=(IntProgress(value=0, max=25000), HTML(value='')))



HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteSJQFGG/imdb_reviews-unsupervised.tfrecord


HBox(children=(IntProgress(value=0, max=50000), HTML(value='')))

[1mDataset imdb_reviews downloaded and prepared to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0. Subsequent calls will reuse this data.[0m


In [0]:
train_data, test_data = imdb['train'], imdb['test']

In [0]:
# Store training data
x_train = []
y_train = []

x_test = []
y_test = []

# Add train data
for sent, l in train_data:
  x_train.append(str(sent.numpy()))
  y_train.append(l.numpy())
  
# Add test data
for sent, l in test_data:
  x_test.append(str(sent.numpy()))
  y_test.append(l.numpy())

# labels must be cast to numpy arrays
y_train = np.array(y_train)
y_test = np.array(y_test)

Given our neural networks operate on numbers, we need a numerical representation of the words in our text. This is handled by the `tensorflow.keras.preprocessing.text.Tokenizer` class. It converts our words to indices and stores this as the `word_index` attribute. By default all punctuation is stripped and the strings are lowered, this can be prevented by passing the `filter` and `lower` arguments to the [tokenizer](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer). 

`vocab_size` - number of unique words to store in the vocab.

`embed_dim` - size of embedding dimension

`max_length` - max allowed length of sequences

`trunc` - whether to truncate at the beginning or end if the sequence exceeds `max_length`

`pad` - pre or post padding of sequences smaller than `max_length`

`oov` - token to use for out of vocabulary words

In [0]:
vocab_size = 10000
embed_dim = 16
max_length = 120
trunc = 'post'  # 'pre' or 'post'
pad = 'pre' # 'pre' or 'post'
oov = '<OOV>'

# Initialize tokenizer
tokenizer = Tokenizer(num_words = vocab_size,
                      oov_token=oov)

# Generate the vocab from our text data
tokenizer.fit_on_texts(x_train)

# Get word2idx for decoding
word2idx = tokenizer.word_index

# Transform train/test texts to sequences
sequences = tokenizer.texts_to_sequences(x_train)
test_seq = tokenizer.texts_to_sequences(x_test)

# Pad and truncate train/test sequences
padded = pad_sequences(sequences, padding=pre, maxlen=max_length, truncating=trunc)
test_pad = pad_sequences(test_seq, maxlen=max_length)

We may be asked to decode sentences from our sequences, simply reverse the word2idx and cast to a dictionary to generate `idx2word`.

In [6]:
idx2word = dict([(v,k) for (k,v) in word2idx.items()])

def decode(text):
  # Decodes a sequence using idx2word dict
    return ' '.join([idx2word.get(i, '?') for i in text])

# Check it works - the initial ?'s correspond to padding
print(decode(padded[1]))
print(x_train[1])

? ? ? ? ? ? ? b'i have been known to fall asleep during films but this is usually due to a combination of things including really tired being warm and comfortable on the <OOV> and having just eaten a lot however on this occasion i fell asleep because the film was rubbish the plot development was constant constantly slow and boring things seemed to happen but with no explanation of what was causing them or why i admit i may have missed part of the film but i watched the majority of it and everything just seemed to happen of its own <OOV> without any real concern for anything else i cant recommend this film at all '
b'I have been known to fall asleep during films, but this is usually due to a combination of things including, really tired, being warm and comfortable on the sette and having just eaten a lot. However on this occasion I fell asleep because the film was rubbish. The plot development was constant. Constantly slow and boring. Things seemed to happen, but with no explanation of 

After passing the text sequences to the embedding layer (this simply learns [embeddings](https://en.wikipedia.org/wiki/Word_embedding) for each word), we have a few options for building the rest of the network:

*   `layers.LSTM` - Long term short term memory
*   `layers.GRU` - Gated recurrent units
*   `layers.Bidirectional` wrapper - makes LSTM/GRU bidrectional
*   `layers.Conv1D` - temporal convolutions

Note: Temporal convolutions will be followed up with `GlobalAveragePooling1D` or `GlobalMaxPooling1D` prior to passing to a Dense layer. 


The `build_model` function below is just an exhaustion of the above possibilities.

In [0]:
def build_model(units = 32, layer_type = 'lstm', bidirectional = False, rec_count = 1):
  
  # units - units in recurrent layers
  # layer_type - one of {'lstm', 'gru', 'conv'}
  # bidirectional - bidirectional recurrent layers
  # rec_count - number of recurrent layers

  layer_types = ['lstm', 'gru', 'conv']

  if layer_type not in layer_types:
    raise ValueError("Invalid layer type. Expected one of: {}".format(repr(layer_types)))
  
  model = k.Sequential()
  # Text modelling begins with embeddings
  model.add(k.layers.Embedding(vocab_size, embed_dim, input_length=max_length))

  if layer_type == 'lstm':

    if bidirectional:
      for _ in range(rec_count-1):
        # Recurrent layers which pass their outputs to another recurrent layer 
        # need the `return_sequences` argument set to True.
        model.add(k.layers.Bidirectional(k.layers.LSTM(units, return_sequences=True, activation='relu')))
      # Final recurrent layer does not need to return sequences
      model.add(k.layers.Bidirectional(k.layers.LSTM(units, activation='relu')))

    else:
      for _ in range(rec_count-1):
        model.add(k.layers.LSTM(units, activation='relu', return_sequences=True))
      model.add(k.layers.LSTM(units, activation='relu'))

  elif layer_type == 'gru':

    if bidirectional:
      for _ in range(rec_count-1):
        model.add(k.layers.Bidirectional(k.layers.GRU(units, activation='relu', return_sequences=True)))
      model.add(k.layers.Bidirectional(k.layers.GRU(units, activation='relu')))

    else:
      for _ in range(rec_count-1):
        model.add(k.layers.GRU(units, activation='relu', return_sequences=True))
      model.add(k.layers.GRU(units, activation='relu'))

  else:

    model.add(k.layers.Conv1D(128, 5, activation='relu'))
    
    # Either average or max pooling may be used - you could even use both
    # if using the functional API
    model.add(k.layers.GlobalAveragePooling1D())
    #model.add(k.layers.GlobalMaxPooling1D())

  # Dense layers form the head of our network as usual
  model.add(k.layers.Dense(32, activation='relu'))
  model.add(k.layers.Dense(1, activation='sigmoid'))

  return model

In [0]:
model = build_model(32, 'conv', True, 2)

In [0]:
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])

We can now train the model as usual. It clearly overfits very fast, this can be mitigated through dropouts or increased weight decay.

In [10]:
epochs = 10
model.fit(padded, y_train, epochs=epochs, validation_data=(test_pad, y_test))

Train on 25000 samples, validate on 25000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7f04f46ea668>

Word embeddings can be visualized using the [Embedding Projector](https://projector.tensorflow.org/) provided by TF. We simply need to save our words and their embeddings into .tsv files and then upload them onto the Projector site. Whilst this is unlikely to be tested it is still very cool to play around with.

In [0]:
# Grab the embedding layer
e = model.layers[0]
# Grab the embeddings
weights = e.get_weights()[0]

In [0]:
# out_v stores our embeddings (vectors)
# out_m stores the words (metadata)

out_v = io.open('vecs.tsv', 'w', encoding='utf-8')
out_m = io.open('meta.tsv', 'w', encoding='utf-8')
for idx in range(1, vocab_size):
  word = idx2word[idx]
  embeddings = weights[idx]
  out_m.write(word + "\n")
  out_v.write('\t'.join([str(x) for x in embeddings]) + "\n")
out_v.close()
out_m.close()

In [0]:
# Use this to download them from colab

try:
  from google.colab import files
except ImportError:
  pass
else:
  files.download('vecs.tsv')
  files.download('meta.tsv')

We have now seen how to classify text, we now move onto generation. Generation can be achieved by simply setting a sequence of words as the input with the following word as the label. This turns generation into a form of prediction - the model is attempting to predict the next word from some sequence of words. A collection of Irish songs is used for this section.

In [15]:
!wget --no-check-certificate \
    https://storage.googleapis.com/laurencemoroney-blog.appspot.com/irish-lyrics-eof.txt \
    -O /tmp/irish-lyrics-eof.txt

--2020-03-18 11:01:55--  https://storage.googleapis.com/laurencemoroney-blog.appspot.com/irish-lyrics-eof.txt
Resolving storage.googleapis.com (storage.googleapis.com)... 74.125.71.128, 2a00:1450:400c:c0a::80
Connecting to storage.googleapis.com (storage.googleapis.com)|74.125.71.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 68970 (67K) [text/plain]
Saving to: ‘/tmp/irish-lyrics-eof.txt’


2020-03-18 11:01:56 (139 MB/s) - ‘/tmp/irish-lyrics-eof.txt’ saved [68970/68970]



In [0]:
tokenizer = Tokenizer() 

data = open('/tmp/irish-lyrics-eof.txt').read()

# lowercase and split sentences
text = data.lower().split('\n')

tokenizer.fit_on_texts(text)

# Add 1 for OOV words
word_count = len(tokenizer.word_index) + 1 

With our data seperated into sentences, we are going to grab sentences one at a time and generate n-grams from them. For example, if our sentence has been numericalized as such:

[1, 2, 3, 4, 5, 6, 7]

Then we form the following n-gram sequences:

[1, 2]

[1, 2, 3]

[1, 2, 3, 4]

[1, 2, 3, 4, 5]

etc.

In [0]:
sequences = []

for line in text:
  # Returns a list in a list, so index at 0 to grab the list
	token_list = tokenizer.texts_to_sequences([line])[0]
	for i in range(1, len(token_list)):
   # Create incremental n-gram sequences
		n_gram_sequence = token_list[:i+1]
		sequences.append(n_gram_sequence)

We now take our n-gram sequences and pad them to the length of the largest sequence. Here we use pre padding since it allows us to easily grab the label from a sequence (remember, the label will be the last number in the sequence - post padding would make this complicated).

In [0]:
# Get length of longest sequence
max_sequence_len = max([len(x) for x in sequences])
# Pad all sequences to this length
padded_sequences = np.array(pad_sequences(sequences,
                                         maxlen=max_sequence_len, 
                                         padding='pre'))

We now take each sequence (suppose they are all length `X`) and use the first `X-1` numbers as input and the number at `X` as the label. For example, given a sequence:

[1, 2, 3, 4, 5]

The input becomes:

[1, 2, 3, 4]

and the label becomes:

[5]

In [0]:
# Set input/label pairs
x_train, y_train = padded_sequences[:,:-1], padded_sequences[:,-1]

Since the model aims to predict the following word (represented as a number) we need to 1-hot encode our labels. This can be done easily using `keras.utils.to_categorical`

In [0]:
# 1-hot encode the labels
y_train_1hot = tf.keras.utils.to_categorical(y_train, num_classes=word_count)

Now that we have prepared our data, we are ready to define and train the model.

In [0]:
model = k.Sequential()
# Subtract 1 since we used the last number in each seq as label
model.add(k.layers.Embedding(word_count, 100, input_length=max_sequence_len-1))
model.add(k.layers.Bidirectional(k.layers.LSTM(150)))
model.add(k.layers.Dense(word_count, activation='softmax'))

In [0]:
# Categorical since it is aiming to predict a word out of many possibilies
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

In [45]:
history = model.fit(x_train, y_train_1hot, epochs=100, verbose=1)

Train on 12038 samples
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/10

Now that our model has been trained to predict a word from a given sequence of words, we are ready to generate some text.

Below, we are simply passing our model some input text and getting it to predict the next word. We then add this to the input text and pass it back to the model to predict another word. We continue in this manner til the desired number of words have been generated.

In [47]:
# Initial sequence of words to feed to model
seed_text = 'Irish songs are'
# Generate this many new words
gen_to = 100

for _ in range(gen_to):
  # Numericalize seed
  token_list = tokenizer.texts_to_sequences([seed_text])[0]
  # Pad
  token_list = pad_sequences([token_list], maxlen=max_sequence_len-1,
                             padding='pre')
  # Get prediction from input
  prediction = model.predict_classes(token_list, verbose=0)
  output_word = ""

  # Run through words til we find the prediction index
  for word, index in tokenizer.word_index.items():
    if index == prediction:
      # Set the word corresponding to the prediction
      output_word = word
      break
  # Add to our current text
  seed_text += " " + output_word

print(seed_text)

Irish songs are made him a man again has mother been in her hand no more love while proud old narrow warm casey doneen affray skibbereen jenny warm time jenny captain havent dance sweethearts tears bent mythology slower casey alas fingers and margin parlour a score grand loud rags arms and i was side my fathers stone seen he had noise field no affection i had cheeks in a neat locality id died famine arms comes then time jenny get them jenny take me sweethearts one fought for brings started mavrone side sweethearts sweethearts tears but tears sir ivy boreen stand find sir


It is worth noting that word repetition is more likely to occur if we have used forward recurrent layers. For better generation we should use bidirectional variants.

Finally, the above method for prediction may not be suited for a large corpus since we will require a large amount of RAM to store the 1-hot encodings corresponding to large number of words. This can be mitigated using character sequencing, that is, giving the model a sequence of characters and asking it to predict the follow character.