### Subword Tokenization with the IMDB Reviews Dataset

##### Importing libraries

In [1]:
import tensorflow as tf
import tensorflow_datasets as tfds
import matplotlib.pyplot as plt
import keras_nlp

In [2]:
imdb = tfds.load("imdb_reviews", as_supervised=True)


In [3]:
#extracting reviews and labels

train_reviews = imdb['train'].map(lambda review, label: review)
test_reviews = imdb['test'].map(lambda review, label: review)
train_labels = imdb['train'].map(lambda review, label: label)
test_labels = imdb['test'].map(lambda review, label: label)

In [4]:
list(train_reviews.take(2))[0]

<tf.Tensor: shape=(), dtype=string, numpy=b"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it.">

##### Subword Tokenization

In [5]:
#parameters for tokenization and padding

vocab_size = 10000
max_length = 120
padding_type = "pre"
truncating_type = "post"

In [6]:
#instantiating vectorization layer
vectorize_layer = tf.keras.layers.TextVectorization(max_tokens=vocab_size)

#generating vocabulary based on training reviews
vectorize_layer.adapt(train_reviews)


In [9]:
def padding_func(sequences):

    sequences = sequences.ragged_batch(batch_size=sequences.cardinality())

    sequences = sequences.get_single_element()

    padded_sequences = tf.keras.utils.pad_sequences(sequences.numpy(), 
                padding=padding_type,
                truncating=truncating_type,
                maxlen=max_length)

    padded_sequences = tf.data.Dataset.from_tensor_slices(padded_sequences)

    return padded_sequences            

In [10]:
train_sequences = train_reviews.map(vectorize_layer).apply(padding_func)

The cell above uses a vocab_size of 10000 but you'll find that it's easy to find OOV tokens when decoding using the lookup dictionary it created

In [16]:
#get the vocabulary

imdb_vocab_fillword = vectorize_layer.get_vocabulary()

#get a sample integer sequence
sample_sequence = train_sequences.take(1).get_single_element()

#lookup each token in the vocabulary
decoded_text = [imdb_vocab_fillword[index] for index in sample_sequence]
decoded_text = " ".join(decoded_text)
print(decoded_text)

    this was an absolutely terrible movie dont be [UNK] in by christopher walken or michael [UNK] both are great actors but this must simply be their worst role in history even their great acting could not redeem this movies ridiculous storyline this movie is an early nineties us propaganda piece the most pathetic scenes were those when the [UNK] rebels were making their cases for [UNK] maria [UNK] [UNK] appeared phony and her [UNK] affair with walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning i am disappointed that there are movies like this ruining actors like christopher [UNK] good name i could barely sit through it


* For binary classifiers, this might not have a big impact but you may have other applications that will benefit from avoiding OOV tokens when training the model (e.g. text generation). If you want the tokenizer above to not have OOVs, then you might have to increase the vocabulary size to more than 88k. Right now, it's only at 10k. This can slow down training and bloat the model size. The encoder also won't be robust when used on other datasets which may contain new words, thus resulting in OOVs again.

* Subword text encoding gets around this problem by using parts of the word to compose whole words. This makes it more flexible when it encounters uncommon words.

list