In this tutorial we will train a word embedding usiang a simple Keras model for a sentiment calssification task, and then visualize them in the Embedding Projector.

## Representing text as numbers

ML models take vectors as input. When working with text, the first thing you must do is come up with a strategy to convert string to numbers (or to vectorize the text) before feeding it to the model. In this section, we will look at three startegies for doing so.

### One-hot encodings

As a first idea, you migth "one-hot" encode wach word in your vocabulary. To represent each word, you will create a zero vector with length equal to the vocabulary, then place a one in the index that corresponds to the word.

** This approach is inefficient. A one-hot encoded vector is sparse (meaning, most indices are zero). 

###Encode each word with a unique number

A second approach you migth try is to encode each word using a unique number. Assign to each word an integer and construct a vector with the integers of the words.

This approach is efficient. Instead of a sparse vector, you now have a dense one (where all elements are full).

### Word embeddings

Word embeddings give us a way to use an efficient, dense representation in which similar words have a similar encoding. Importantly, you don't have to specify this encoding by hand. An embedding is a dense vector of floating point values (the length of a vector is a parameter you specify). Instead of specifying the values for the emvedding manually, they are trainable parameters (weigths learned by the model during the training, in the same way a model learns weights for a dense layer).

It is common to see word embeddings that are 8-dimensional (for small datasets), up to 1024-dimensions when working with large datasets. A higher dimensional embedding can camputure fine-grained relationships between words, but takes more data to learn.

## Setup

In [None]:
import io
import os
import re
import shutil
import string
import tensorflow as tf

from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense, Embedding, GlobalAveragePooling1D
from tensorflow.keras.layers import TextVectorization

## Download the IMDb Dataset

You will use the Large Movie Review Dataset through the tutorial. You will train a sentiment classifier model on this dataset and in the process learn embeddings from scratch.

In [None]:
url = "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"

dataset = tf.keras.utils.get_file("aclImdb_v1.tar.gz", url, untar=True, cache_dir='.', cache_subdir='')

dataset_dir = os.path.join(os.path.dirname(dataset), 'aclImdb')
os.listdir(dataset_dir)

Take a look at the train/ directory. It has pos and neg folders with movie reviews labelled as positive and negative respectively. You will use reviews from pos and neg folders to train a binary classification model.

In [None]:
train_dir = os.path.join(dataset_dir, 'train')
os.listdir(train_dir)

The train directory also has additional folders which should be removed before creating training dataset.

In [None]:
remove_dir = os.path.join(train_dir, 'unsup')
shutil.rmtree(remove_dir)

Next, create a tf.data.Dataset using tf.keras.utils.text_dataset_from_directory.

In [None]:
batch_size = 1024
seed = 123

train_ds = tf.keras.utils.text_dataset_from_directory('aclImdb/train', batch_size=batch_size, validation_split=0.2, subset='training', seed=seed)
val_ds = tf.keras.utils.text_dataset_from_directory('aclImdb/train', batch_size=batch_size, validation_split=0.2, subset='validation', seed=seed)


Take a look at a few movie reviews and their labels (1: positive, 0: negative) from the train dataset.

In [None]:
for text_batch, label_batch in train_ds.take(1):
  for i in range(5):
    print(label_batch[i].numpy(), text_batch.numpy()[i])

## Configure the dataset for performance

These are two important methods you should use when loading data to make sure that I/O does not become blocking.
* .cache() keeps data in memory after it's loaded off disk. This will ensure the dataset does not become a bottleneck while training your model. If your dataset is too large to fit into memory, you can also use this method to create a performant on-disk cache, which is more efficient to read than many small files.
* .prefetch() overlaps data preprocessing and model execution while training.

In [None]:
AUTOTUNE = tf.data.AUTOTUNE

train_ds = train_ds.cache().prefetch(buffer_size=AUTOTUNE)
val_ds = val_ds.cache().prefetch(buffer_size=AUTOTUNE)

## Using the Embedding layer

The Embedding layer can be understood as a lookup table that maps from integer indices (which stand for a specific word) to dense vectors (their embeddings). The dimensionallity (or width) of the embedding is a parameter you can experiment with to see what works well for your problem, much in the same way you would experiment with the number of neurons in a Dense layer.

In [None]:
# Embed a 1000 word vocabulary into 5 dimensions.
embedding_layer = tf.keras.layers.Embedding(1000, 5)

When you create an Embedding layer, the weights for the embedding are randomly initialized (just like any other layer). During training, they are gradually adjusted via backpropagation. Once trained, the learned word embeddings will roughly encode similarities between words (as they were learned for the specific problem your model is trained on).

If you pass an integer to an embedding layer, the result replaces each integer with the vector from the embedding table:

In [None]:
result = embedding_layer(tf.constant([1,2,3]))
result.numpy()

For text or sequence problems, the Embedding layer takes a 2D tensor of integers, of shape (samples, sequence_length), where each entry is a sequence of integers. It can embed sequences of variable lengths. You could feed into the embedding layer above batches with shapes (32, 10) (batch of 32 sequences of length 10) or (64, 15) (batch of 64 sequences of length 15).

The returned tensor has one more axis than the input, the embedding vectors are aligned along the new last axis. Pass it a (2, 3) input batch and the output is (2, 3, N).

In [None]:
result = embedding_layer(tf.constant([[0, 1, 2], [3, 4, 5]]))
result.shape

When given a batch of sequences as input, an embedding layer returns a 3D floating point tensor, of shape (samples, sequence_length, embedding_dimensionality). To convert this sequence of variable length to a fixed representation there are a variety of standard approaches. You could use an RNN, Attention, or pooling layer before passing it to a Dense layer.

## Text preprocessing

Next, define the dataset preprocessing steps required for your sentiment classification model. Initialize a TextVectorization layer with the desire parameters to vectorize movie reviews.

In [None]:
# Create a custom standardization funcition to strip HTML break tags '<br />'
def custom_standardization(input_data):
  lowercase = tf.strings.lower(input_data)
  stripped_html = tf.strings.regex_replace(lowercase, '<br />', ' ')
  return tf.strings.regex_replace(stripped_html, '[%s]' % re.escape(string.punctuation), '')

In [None]:
# Vocubulary size and number of words in a sequence
vocab_size = 10000
sequence_length = 100

In [None]:
# Use the text vectorization layer to normalize, split, and map strings to integers.
# Note that the layer uses the custom stadardization defined above.
# Set maximun_sequence length as all samples are not of the same length.
vectorized_layer = TextVectorization(
    standardize = custom_standardization,
    max_tokens= vocab_size,
    output_mode = 'int',
    output_sequence_length =  sequence_length,
)

# Make a text-only dataset (no labels) and call adapt to build the vocabulary.
text_ds = train_ds.map(lambda x, y:x)
vectorized_layer.adapt(text_ds)

## Create a classification model

Use the Keras Sequential API to define the sentiment classification model. In this case it is a "Continuous bag of words" style model.
* The TextVectorization layer transforms strings into vocabulary indices. You have already initialized vectorize_layer as a TextVectorization layer and built its vocabulary by calling adapt on text_ds. Now vectorize_layer can be used as the first layer of your end-to-end classification model, feeding transformed strings into the Embedding layer.
* The Embedding layer takes the integer-encoded vocabulary and looks up the embedding vector for each word-index. These vectors are learned as the model trains. The vectors add a dimension to the output array. The resulting dimensions are: (batch, sequence, embedding).
* The GlobalAveragePooling1D layer returns a fixed-length output vector for each example by averaging over the sequence dimension. This allows the model to handle input of variable length, in the simplest way possible.
* The fixed-length output vector is piped through a fully-connected (Dense) layer with 16 hidden units.
* The last layer is densely connected with a single output node.


** Caution: This model doesn't use masking, so the zero-padding is used as part of the input and hence the padding length may affect the output.

In [None]:
embedding_dim = 16

model = Sequential([
    vectorized_layer,
    Embedding(vocab_size, embedding_dim, name="embedding"),
    GlobalAveragePooling1D(),
    Dense(16, activation='relu'),
    Dense(1),
])

### Compile and train the model

You will use TensorBoard to visualize metrics including loss and accuracy. Create a tf.keras.callbacks.TensorBoard

In [None]:
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir = "logs")

Compile and train the model using Adam optimizer and BinatyCrossentropy loss.

In [None]:
model.compile(
    optimizer = 'adam',
    loss = tf.keras.losses.BinaryCrossentropy(from_logits = True),
    metrics = ['accuracy'],
)

In [None]:
train_epochs = 15

model.fit(
    train_ds,
    validation_data = val_ds,
    epochs = train_epochs,
    callbacks = [tensorboard_callback]
)

With this approach the model reaches a validation accuracy of around 78% (note that the model is overfitting since training accuracy is higher).

You can look into the model summary to learn more about each layer of the model.

In [None]:
model.summary()

Visualize the model metrics in TensorBoard.

In [None]:
#docs_infra: no_execute
%load_ext tensorboard
%tensorboard --logdir logs

## Retrieve the trained word embeddings and save them to disk

Next, retrieve the word embeddings learned during training. The embeddings are weights of the Embedding layer in the model. The weights matrix is of shape (vocab_size, embedding_dimension).

Obtain the weights from the model using get_layer() and get_weights(). The get_vocabulary() function provides the vocabulary to build a metadata file with one token per line.

In [None]:
weights = model.get_layer('embedding').get_weights()[0]
vocab = vectorized_layer.get_vocabulary()

In [None]:
out_v = io.open('vectors.tsv', 'w', encoding='utf-8')
out_m = io.open('metadata.tsv', 'w', encoding='utf-8')

for index, word in enumerate(vocab):
  if index == 0:
    continue  # skip 0, it's padding.
  vec = weights[index]
  out_v.write('\t'.join([str(x) for x in vec]) + "\n")
  out_m.write(word + "\n")
out_v.close()
out_m.close()

If you are running this tutorial in Colab, you can use the following snippet to download these files to your local machine (or use the file browser, View -> Table of contents -> File browser).



In [None]:
try:
  from google.colab import files
  files.download('vectors.tsv')
  files.download('metadata.tsv')
except Exception:
  pass

## Visualize the embeddings

To visualize the embeddings, upload them to the embedding projector.

Open the [Embedding Projector](http://projector.tensorflow.org/) (this can also run in a local TensorBoard instance).

* Click on "Load data".

* Upload the two files you created above: vecs.tsv and meta.tsv.

The embeddings you have trained will now be displayed. You can search for words to find their closest neighbors. For example, try searching for "beautiful". You may see neighbors like "wonderful".