What's in a (sub)word?
In this colab, we'll work with subwords, or words made up of the pieces of larger words, and see how that impacts our network and related embeddings.
We’ve worked with full words before for our sentiment models, and found some issues right at the start of the lesson when using character-based tokenization. Subwords are another approach, where individual words are broken up into the more commonly appearing pieces of themselves. This helps avoid marking very rare words as OOV when you use only the most common words in a corpus.

As shown in the video, this can further expose an issue affecting all of our models up to this point, in that they don’t understand the full context of the sequence of words in an input. The next lesson on recurrent neural networks will help address this issue.

https://video.udacity-data.com/topher/2020/March/5e6fb669_subwords/subwords.png

Subword Datasets

There are a number of already created subwords datasets available online. If you check out the IMDB dataset on TFDS https://www.tensorflow.org/datasets/catalog/imdb_reviews, for instance, by scrolling down you can see datasets with both 8,000 subwords as well as 32,000 subwords in a corpus (along with regular full-word datasets).

But how to creat TensorFlow’s SubwordTextEncoder and its build_from_corpus function to create one from the reviews dataset we used previously is shown below:


In [None]:
import tensorflow as tf

from tensorflow.keras.preprocessing.sequence import pad_sequences

In [None]:
!wget --no-check-certificate \
    https://drive.google.com/uc?id=13ySLC_ue6Umt9RJYSeM2t-V0kCv-4C-P \
    -O /tmp/sentiment.csv
    
    # https://www.kaggle.com/marklvl/sentiment-labelled-sentences-data-set

In [None]:
path = tf.keras.utils.get_file('sentiment.csv', 'https://drive.google.com/uc?id=13ySLC_ue6Umt9RJYSeM2t-V0kCv-4C-P')

In [None]:
import pandas as pd

#dataset = pd.read_csv('/tmp/sentiment.csv')
dataset = pd.read_csv(path)

# Just extract out sentences and labels first - we will create subwords here
sentences = dataset['text'].tolist()
labels = dataset['sentiment'].tolist()

@todo : We can use the existing Amazon and Yelp reviews dataset with `tensorflow_datasets`'s `SubwordTextEncoder` functionality. `SubwordTextEncoder.build_from_corpus()` will create a tokenizer we can use this functionality to get subwords from a much larger corpus of text as well.

The Amazon and Yelp dataset we are using isn't super large, so we'll create a subword `vocab_size` of only the 1,000 most common words, as well as cutting off each subword to be at most 5 characters. Documentation [here](https://www.tensorflow.org/datasets/api_docs/python/tfds/features/text/SubwordTextEncoder#build_from_corpus).

In [None]:
# note this is the code in the past examples without using subwords
#tokenizer = Tokenizer(num_words = vocab_size, oov_token=oov_tok)
#tokenizer.fit_on_texts(training_sentences)

In [None]:
import tensorflow_datasets as tfds

In [None]:
vocab_size = 1000
tokenizer = tfds.deprecated.text.SubwordTextEncoder.build_from_corpus(sentences, vocab_size, max_subword_length=5)

In [None]:
# Check that the tokenizer works appropriately
num = 5
print(sentences[num])
encoded = tokenizer.encode(sentences[num])
print(encoded)

In [None]:
# Separately print out each subword, decoded
for i in encoded:
  print(tokenizer.decode([i]))

In [None]:
'''
Replace sentence data with encoded subwords Now, we'll re-create the dataset to be used for training by actually 
encoding each of the individual sentences. 
This is equivalent to text_to_sequences with the Tokenizer we used in earlier exercises.
'''
for i, sentence in enumerate(sentences):
  sentences[i] = tokenizer.encode(sentence)

In [None]:
# Check the sentences are appropriately replaced
print(sentences[1])

In [None]:
# Before training, we still need to pad the sequences, as well as split into training and test sets.

import numpy as np

max_length = 50
trunc_type='post'
padding_type='post'

# Pad all sentences
sentences_padded = pad_sequences(sentences, maxlen=max_length, 
                                 padding=padding_type, truncating=trunc_type)

# Separate out the sentences and labels into training and test sets
training_size = int(len(sentences) * 0.8)

training_sentences = sentences_padded[0:training_size]
testing_sentences = sentences_padded[training_size:]
training_labels = labels[0:training_size]
testing_labels = labels[training_size:]

# Make labels into numpy arrays for use with the network later
training_labels_final = np.array(training_labels)
testing_labels_final = np.array(testing_labels)

In [None]:
embedding_dim = 16
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
    tf.keras.layers.GlobalAveragePooling1D(),
    tf.keras.layers.Dense(6, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

model.summary()

In [None]:
num_epochs = 30
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
history = model.fit(training_sentences, 
                    training_labels_final, 
                    epochs=num_epochs, 
                    validation_data=(testing_sentences, testing_labels_final))

In [None]:
# Does there appear to be a difference in how validation accuracy and loss is trending compared to with full words?
import matplotlib.pyplot as plt


def plot_graphs(history, string):
  plt.plot(history.history[string])
  plt.plot(history.history['val_'+string])
  plt.xlabel("Epochs")
  plt.ylabel(string)
  plt.legend([string, 'val_'+string])
  plt.show()
  
plot_graphs(history, "accuracy")
plot_graphs(history, "loss")

In [None]:
# First get the weights of the embedding layer
e = model.layers[0]
weights = e.get_weights()[0]
print(weights.shape) # shape: (vocab_size, embedding_dim)

Note that the below code does have a few small changes to handle the different way text is encoded in our dataset compared to before with the built in `Tokenizer`. You may get an error like "Number of tensors (999) do not match the number of lines in metadata (992)." As long as you load the vectors first without error and wait a few seconds after this pops up, you will be able to click outside the file load menu and still view the visualization.

In [None]:
import io

# Write out the embedding vectors and metadata
out_v = io.open('vecs.tsv', 'w', encoding='utf-8')
out_m = io.open('meta.tsv', 'w', encoding='utf-8')
for word_num in range(0, vocab_size - 1):
  word = tokenizer.decode([word_num])
  embeddings = weights[word_num]
  out_m.write(word + "\n")
  out_v.write('\t'.join([str(x) for x in embeddings]) + "\n")
out_v.close()
out_m.close()

In [None]:
# Download the files
try:
  from google.colab import files
except ImportError:
  pass
else:
  files.download('vecs.tsv')
  files.download('meta.tsv')

    You’ve already learned an amazing amount of material on Natural Language Processing with TensorFlow in this
    lesson.You started with Tokenization by:
    
    You’ve already learned an amazing amount of material on Natural Language
    Processing with TensorFlow in this lesson.
    You started with Tokenization by:
    Tokenizing input text
    Creating and padding sequences
    Incorporating out of vocabulary words
    Generalizing tokenization and sequence methods to real world datasets
    
    From there, you moved onto Embeddings, where you:

    transformed tokenized sequences into embeddings
    developed a basic sentiment analysis model
    visualized the embeddings vector
    tweaked hyperparameters of the model to improve it
    and diagnosed potential issues with using pre-trained subword tokenizers when the network doesn’t have sequence context 
    In the next lesson, you’ll dive into Recurrent Neural Networks, which will be able to understand the sequence of
    inputs, and you'll learn how to generate new text.