# Introduction to Static Word Embeddings

Code from [here](https://www.tensorflow.org/text/guide/word_embeddings).

This tutorial contains an introduction to word embeddings. <br>
You will train your own word embeddings using a simple Keras model for a **sentiment classification** task, and then visualize them.

In [None]:
# import libraries
import io
import os
import re
import shutil
import string
import tensorflow as tf

from tensorboard.plugins import projector

from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense, Embedding, GlobalAveragePooling1D
from tensorflow.keras.layers import TextVectorization


## Download IMDb Datset.

There is a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There are two classes: positive reviews and negative reviews.

In [None]:
url = "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"

dataset = tf.keras.utils.get_file("aclImdb_v1.tar.gz", url,
                                  untar=True, cache_dir='.',
                                  cache_subdir='')

dataset_dir = os.path.join(os.path.dirname(dataset), 'aclImdb')
os.listdir(dataset_dir)

The `train/` directory has pos and neg folders with movie reviews labelled as positive and negative respectively. You will use reviews from pos and neg folders to train a binary classification model.

In [None]:
train_dir = os.path.join(dataset_dir, 'train')
# remove extra data
#remove_dir = os.path.join(train_dir, 'unsup')
#shutil.rmtree(remove_dir)
os.listdir(train_dir)

Use the `train` directory to create both train and validation datasets with a split of 20% for validation.

In [None]:
batch_size = 1024
seed = 123
train_ds = tf.keras.utils.text_dataset_from_directory(
    'aclImdb/train', batch_size=batch_size, validation_split=0.2,
    subset='training', seed=seed)
val_ds = tf.keras.utils.text_dataset_from_directory(
    'aclImdb/train', batch_size=batch_size, validation_split=0.2,
    subset='validation', seed=seed)


Take a look at a few movie reviews and their labels (1: positive, 0: negative) from the train dataset.

In [None]:
for text_batch, label_batch in train_ds.take(1):
  for i in range(5):
    print(label_batch[i].numpy(), text_batch.numpy()[i])


### Configure the dataset for performance

These are two important methods you should use when loading data to make sure that I/O does not become blocking.

* `.cache()` keeps data in memory after it's loaded off disk. This will ensure the dataset does not become a bottleneck while training your model. If your dataset is too large to fit into memory, you can also use this method to create a performant on-disk cache, which is more efficient to read than many small files.

* `.prefetch()` overlaps data preprocessing and model execution while training.

You can learn more about both methods, as well as how to cache data to disk in the [data performance guide](https://www.tensorflow.org/guide/data_performance).

In [None]:
AUTOTUNE = tf.data.AUTOTUNE

train_ds = train_ds.cache().prefetch(buffer_size=AUTOTUNE)
val_ds = val_ds.cache().prefetch(buffer_size=AUTOTUNE)


### Using the Embedding layer

Keras makes it easy to use word embeddings. The `Embedding` layer can be understood as a lookup table that maps from integer indices (which stand for specific words) to dense vectors (their embeddings). The dimensionality (or width) of the embedding is a parameter you can experiment with to see what works well for your problem, much in the same way you would experiment with the number of neurons in a Dense layer.

In [None]:
# Embed a 1,000 word vocabulary into 5 dimensions.
embedding_layer = tf.keras.layers.Embedding(1000, 5)

When you create an Embedding layer, the weights for the embedding are **randomly initialized** (just like any other layer). During training, they are gradually adjusted via **backpropagation**. Once trained, the learned word embeddings will roughly encode similarities between words (as they were learned for the specific problem your model is trained on).

In [None]:
result = embedding_layer(tf.constant([1, 2, 3]))  # extract embeddings for words with indices 1, 2, 3
print('--- initial embeddings for words with indices 1, 2, 3 ---')
result.numpy()

## Preprocess Text Data

In [None]:
# Create a custom standardization function to strip HTML break tags '<br />'.
def custom_standardization(input_data):
  lowercase = tf.strings.lower(input_data)                                    # convert to lower case
  stripped_html = tf.strings.regex_replace(lowercase, '<br />', ' ')          # replace break by blank
  return tf.strings.regex_replace(stripped_html,
                                  '[%s]' % re.escape(string.punctuation), '') # replace punctuation


# Vocabulary size and number of words in a sequence.
vocab_size = 10000
sequence_length = 100

# Use the text vectorization layer to normalize, split, and map strings to
# integers. Note that the layer uses the custom standardization defined above.
# Set maximum_sequence length as all samples are not of the same length.
vectorize_layer = TextVectorization(
    standardize=custom_standardization,
    max_tokens=vocab_size,                    # replace rare tokens
    output_mode='int',
    output_sequence_length=sequence_length)   # all examples have same length

# Make a text-only dataset (no labels) and call adapt to build the vocabulary.
text_ds = train_ds.map(lambda x, y: x)
vectorize_layer.adapt(text_ds)   # Computes a vocabulary of string terms from tokens in a dataset.


In [None]:
vectorize_layer.adapt?

Print the original data and the vectorized data.

In [None]:
i=0
for x, y in train_ds:
  print('len(x)',len(x))
  print(x)
  print('len(y)',len(y))
  print(y)
  if i>2:
    break
  i +=1

## Define the Classification Model

Use the Keras Sequential API to define the sentiment classification model. In this case it is a "Continuous bag of words" style model.

*  The `TextVectorization` layer transforms strings into vocabulary indices. <br> You have already initialized `vectorize_layer` as a TextVectorization layer and built its vocabulary by calling `adapt` on the text data `text_ds`. <br>
Now vectorize_layer can be used as the first layer of your end-to-end classification model, feeding transformed strings into the Embedding layer.

*  The `Embedding` layer takes the integer-encoded vocabulary and looks up the embedding vector for each word-index. These vectors are learned as the model trains. The vectors add a dimension to the output array. The resulting dimensions are:
  * element index in batch
  * index of word in sequence
  * index inside embedding

* The `GlobalAveragePooling1D` layer returns a fixed-length output vector for each example by averaging over the sequence dimension (i.e. average over all word embeddings of a sequence). This allows the model to handle input of variable length, in the simplest way possible.

* The fixed-length output vector is piped through a fully-connected (Dense) layer with 16 hidden units.

* The last layer is densely connected with a single output node.


In [None]:
embedding_dim=16

model = Sequential([
  vectorize_layer,                                         # vectorization on the fly
  Embedding(vocab_size, embedding_dim, name="embedding"),  # get embedding for each word
  GlobalAveragePooling1D(),                                # get the average embeddings of a text
  Dense(32, activation='relu'),                            # apply nonlinear layer
  Dense(1)                                                 # compute a single output y.
                                                           # this is converted to probability loss by the loss function
])


In [None]:
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir="logs")

The loss function `BinaryCrossentropy(from_logits=True)` gets the input value $x$ and the observed class value $z\in\{ 0,1\}$. It computes
$$ z * -log(sigmoid(x)) + (1 - z) * -log(1 - sigmoid(x))$$
yielding
* $-\log(sigmoid(x))$ for $z=1$ and
* $-\log(1 - sigmoid(x))$ for $ z=0$

Note that $sigmoid (x) = \exp(x)/(1+\exp(x)$ and always output values in $(0.0,1.0)$.

In [None]:
model.compile(optimizer='adam',
              loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),  # need high probability for correct class
              metrics=['accuracy'])
model.summary()

In [None]:
model.fit(
    train_ds,
    validation_data=val_ds,
    epochs=15,
    callbacks=[tensorboard_callback])

0.7898 for 16 <br>
0.8002 for 32

In [None]:
#docs_infra: no_execute
%load_ext tensorboard
%tensorboard --logdir logs

Extract the weights and the vocabulary and save them to files

In [None]:
weights = model.get_layer('embedding').get_weights()[0]
vocab = vectorize_layer.get_vocabulary()
out_v = io.open('vectors.tsv', 'w', encoding='utf-8')
out_m = io.open('metadata.tsv', 'w', encoding='utf-8')

for index, word in enumerate(vocab):
  if index == 0:
    continue  # skip 0, it's padding.
  vec = weights[index]
  out_v.write('\t'.join([str(x) for x in vec]) + "\n")
  out_m.write(word + "\n")
out_v.close()
out_m.close()
!ls

Read files

In [None]:
try:
  from google.colab import files
  files.download('vectors.tsv')   # Downloads the file to the user's local disk via a browser download action.
  files.download('metadata.tsv')
except Exception:
  pass


Open [embedding projector](http://projector.tensorflow.org/)

This can also be done in the [notebook](https://www.tensorflow.org/tensorboard/tensorboard_projector_plugin).