# DL Lab 3.2 - Semantic Word Embeddings

Welcome to the DL Lab! In this lab, you will train a **word embedding** from scratch and investigate its interesting properties.

## Today's Learning Objectives

- Use **Tokenizers** for **word vectorization**.
- Train models containing **word embeddings**.
- Compute **word similarities** and retrieve similar and analogous words.

***

**Note**: Training DNNs is a computationally expensive process. Most of the computations can be parallelized very efficently, making them a perfect fit for GPU-acceleration. In order to enable a GPU for your Colab session, do the following steps:
- Click '*Runtime*' -> '*Change runtime type*'
- In the pop-up window for '*Hardware accelerator*', select '*GPU*'
- Click '*Save*'

# 1 - Word Embeddings

Word embeddings give us a way to use an efficient, **dense representation** in which **similar words** have a **similar encoding**. Importantly, we do not have to specify this encoding by hand!

An embedding is a dense vector of floating point values. The length of the vector is a parameter you specify. The values of the embedding are trainable parameters, i.e., weights learned by the model during training. It is common to see word embeddings that are 8-dimensional for small datasets, up to 1024-dimensions when working with large datasets. A higher dimensional embedding can capture fine-grained relationships between words, but requires more data to learn.

An intuitive way to think of an embedding is as lookup table. After the embeddings weights have been learned, we can encode each word by looking up the dense vector it corresponds to in the table.

# 2 - Using the Embedding Layer

So much for the motivation. Let's get started and use embeddings!

Keras makes it easy to use word embeddings by means of its [`Embedding`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding) layer.

The first argument `input_dim` specifies the size of the vocabulary. The second argument `output_dim` is the dimensionality of the embeddings, hence the length of the dense vectors. The `output_dim` is a parameter you can tune and experiment with in the same way you would experiment with the number of neurons in a dense layer.

**Task**: Initialize a layer for embedding words of a vocabulary of 1000 words into 5 dimensions.

In [None]:
import tensorflow as tf
import numpy as np
from tensorflow import keras
from tensorflow.keras import layers

In [None]:
### START YOUR CODE HERE ###  (≈1 LOC)
embedding_layer =
### END YOUR CODE HERE ###

When you create an `Embedding` layer, the weights for the embedding are randomly initialized just as for any other layer. During training, they are gradually adjusted via backpropagation. Once trained, the learned word embeddings will roughly encode similarities between words as learned for the specific problem your model was trained on.

The `Embedding` layer can be understood as a lookup table that maps from integer indices (denoting specific words) to dense vectors (their embeddings). Hence, passing a list of integers to an embedding layer, the result replaces each integer with its corresponding vector from the embedding table.

**Task**: Pass a 1D numpy array of integers to the `embedding_layer` and print the result.

In [None]:
### START YOUR CODE HERE ###  (≈2 LOC)
input_word_indices =
result =
### END YOUR CODE HERE ###

print("input.shape:", input_word_indices.shape, "\n")
print("result:", result.numpy(), "\n")
print("result.shape:", result.shape, "\n")

The returned tensor has one more axis than the input and the embedding vectors are aligned along the new last axis. Hence, the shape of the embedded tensor is `(samples, sequence_length, embedding_size)`.

# 3 - Training Embeddings from Scratch

You can actually train word embeddings in a model for solving a certain task, just as next word prediction. For this task, you will use the same text data as for the DL Lab 3.1, i.e., the **[AG_NEWS](http://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html)** dataset containing news articles of 4 different categories.

Execute the cell below for downloading and preprocessing the text data.

In [None]:
import tensorflow_datasets as tfds

BATCHSIZE = 128

dataset = tfds.load('ag_news_subset')
train_ds = dataset['train']
val_ds = dataset['test']

classes = ['World', 'Sports', 'Business', 'Sci/Tech']
num_classes = len(classes)

def extract_text(x):
    return x['title'] + ' ' + x['description']

def tupelize(x):
    return (extract_text(x), x['label'])

AUTOTUNE = tf.data.AUTOTUNE

train_ds_opt = train_ds.map(tupelize).cache().shuffle(1000).batch(BATCHSIZE).prefetch(AUTOTUNE)
val_ds_opt = val_ds.map(tupelize).cache().batch(1000).prefetch(AUTOTUNE)

**Task**: Initialize a word-level [`TextVectorization` layer](https://www.tensorflow.org/api_docs/python/tf/keras/layers/TextVectorization). Use a maximum vocabulary size of 10'000 words and a max sequence length of 100. Then `adapt` the vectorizer on 10'000 samples of the `train_ds`.

In [None]:
from tensorflow.keras import layers

max_vocab_size = 10000
max_sequence_length = 100

### START YOUR CODE HERE ###  (≈2 LOC)
vectorizer =
vectorizer.adapt(train_ds.take( ).map(extract_text))
### END YOUR CODE HERE ###

The `TextVectorization.get_vocabulary` function provides the vocabulary:

In [None]:
# Get the unique words in the vocabulary
vocab = vectorizer.get_vocabulary()

# Length of the vocabulary
vocab_size = len(vocab)
print(f"Number of words in vocab: {vocab_size}")

# most common tokens (notice the [UNK] token for "unknown" words)
top_5_words = vocab[:5]
print(f"Top 5 most common words: {top_5_words}")

# least common tokens
bottom_5_words = vocab[-5:]
print(f"Bottom 5 least common words: {bottom_5_words}")

The function `build_embedding_bag_model` returns a simple classification model based on the average of the embedding vectors:

In [None]:
from tensorflow.keras.losses import SparseCategoricalCrossentropy
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.metrics import SparseCategoricalAccuracy

def build_embedding_bag_model(vectorizer, vocab_size, embedding_size, num_classes):

  input = layers.Input(shape=(1,), dtype=tf.string)
  x = vectorizer(input)
  x = layers.Embedding(vocab_size, embedding_size)(x)
  x = layers.GlobalAveragePooling1D()(x)
  x = layers.Dense(16, activation='relu')(x)
  output = layers.Dense(num_classes, activation='softmax')(x)

  model = tf.keras.models.Model(input, output)
  model.compile(
      loss=SparseCategoricalCrossentropy(),
      optimizer=Adam(),
      metrics=[SparseCategoricalAccuracy()]
  )

  print(model.summary())

  return model


Let's create one model and train it on the AG news dataset:

In [None]:
# @title define `plot_history()`
from matplotlib import pyplot as plt

def plot_history(history):
  fig, (ax1, ax2) = plt.subplots(2,1, sharex=True, dpi=150)
  ax1.plot(history.history['loss'], label='training')
  ax1.plot(history.history['val_loss'], label='validation')
  ax1.set_ylabel('Loss')
  ax1.set_yscale('log')
  if history.history.__contains__('lr'):
    ax1b = ax1.twinx()
    ax1b.plot(history.history['lr'], 'g-', linewidth=1)
    ax1b.set_yscale('log')
    ax1b.set_ylabel('Learning Rate', color='g')
  ax1.legend()

  key = None
  for k in sorted(history.history.keys()):
    if 'acc' in k and not 'val_' in k:
      key = k
      break
  if key:
    ax2.plot(history.history[key], label='training')
    ax2.plot(history.history['val_'+key], label='validation')
    ax2.set_ylabel('Accuracy')
    ax2.set_xlabel('Epochs')
  plt.show()

In [None]:
embedding_size = 16

bag_model = build_embedding_bag_model(vectorizer, max_vocab_size, embedding_size, num_classes)

bag_history = bag_model.fit(
    train_ds_opt,
    validation_data=val_ds_opt,
    epochs=10
)

plot_history(bag_history)

In [None]:
plot_history(bag_history)

# 4 - Operations on Embeddings

Let's retrieve the learned word embedding.
This will be a matrix of shape `(vocab_size, embedding_size)`.

**Task**: Use the `get_weights()` method on the correct layer of the `bag_model` to obtain the embedding matrix.

In [None]:
### START YOUR CODE HERE ###
embedding_layer_index =
### END YOUR CODE HERE ###

emb_matrix = bag_model.layers[embedding_layer_index].get_weights()[0]
print(emb_matrix.shape)

Using the word indices along with the vocabulary defined by the vectorizer, we can now retrieve word embedding vectors from the embedding matrix.

**Task**: Complete the function `get_embedding_vector` for returning the embedding vector of a given word `word_in`.

In [None]:
def get_embedding_vector( word_in, vocabulary, embedding_matrix ):

  if not word_in in vocabulary:
    print(f'WARNING: "{word_in}" not in vocabulary. Falling back to "[UNK]" token.')
    word_in = "[UNK]"

  # Get the word index in the vocabulary
  word_idx = vocabulary.index(word_in)

  ### START YOUR CODE HERE ###  (≈1 LOC)
  # Lookup the embedding vector
  emb_vec =
  ### END YOUR CODE HERE ###

  return emb_vec

In [None]:
vocab = vectorizer.get_vocabulary()

get_embedding_vector('news', vocab, emb_matrix)

## 4.1 - Word Similarity

To measure how similar two words are, we need a way to measure the degree of similarity between two embeddings vectors for the two words. Given two word vectors $u$ and $v$, cosine similarity is defined by the cosine of the angle $\theta$ between the two vectors:

$$\text{CosineSimilarity(u, v)} = cos(\theta) = \frac {u \cdot v} {\Vert u \Vert \Vert v \Vert} $$

where $u \cdot v$ is the dot product of two vectors, $\Vert u \Vert$ is the norm (or length) of the vector $u$, and $\theta$ is the angle between $u$ and $v$.

If $u$ and $v$ are very similar, their cosine similarity will be close to 1. If they are dissimilar, the cosine similarity will take a smaller value down to -1.

**Note**: The norm of $u$ is defined as $ \Vert u \Vert = \sqrt{\sum_{i=1}^{n} u_i^2}$.

**Task**: Complete the function `cosine_similarity`.

In [None]:
def cosine_similarity(u, v):

  ### START YOUR CODE HERE ###  (≈3 LOC)

  ### END YOUR CODE HERE ###

  return cos_similarity

Let's test some combinations:

In [None]:
def print_pair_similarity(word_a, word_b):
  print(cosine_similarity(
      get_embedding_vector( word_a, vocab, emb_matrix ),
      get_embedding_vector( word_b, vocab, emb_matrix )
  ))

print_pair_similarity("queen", "woman")
print_pair_similarity("queen", "man")
print_pair_similarity("queen", "king")

## 4.2 - Similar Word Retrieval

We can also use the word embedding for retrieval of $k$ semantically close words:

In [None]:
def closest_word(word_in, vocabulary, emb_matrix, top_k=5):

  # Get embedding vector of input word
  word_in_emb = get_embedding_vector(word_in, vocabulary, emb_matrix)

  # Compute similarities
  similarity = [ cosine_similarity(w_emb, word_in_emb) for w_emb in emb_matrix ]

  # Top-k words having largest similarity
  idxs = np.argsort( similarity )[::-1][1:top_k+1]

  return [ [vocabulary[i], similarity[i]] for i in idxs ]

Let's test some words:

In [None]:
for word in ['president', 'phone', 'soccer', 'science']:
  print(word, closest_word(word, vocab, emb_matrix))
  print()

## 4.3 - Word Analogy

Another interesting task is the "Word analogy task", where we complete the sentence "`a` is to `b` as `c` is to __". In detail, we are trying to find a word `d`, such that the associated word vectors $e_a$, $e_b$, $e_c$, $e_d$, are related as follows:
$$ e_b - e_a \approx e_d - e_c. $$

The similary between $ e_b - e_a $ and $ e_d - e_c $ is measured using cosine similarity.

In [None]:
def complete_analogy(word_a, word_b, word_c):

  # Convert words to lower case
  word_a, word_b, word_c = word_a.lower(), word_b.lower(), word_c.lower()

  # Get the embedding vectors
  e_a = get_embedding_vector(word_a, vocab, emb_matrix)
  e_b = get_embedding_vector(word_b, vocab, emb_matrix)
  e_c = get_embedding_vector(word_c, vocab, emb_matrix)

  max_cosine_sim = -100
  best_word = None

  # Loop over the whole word vector set
  for w_idx, w in enumerate(emb_matrix):

    # To avoid best_word being one of the input words, pass on them.
    if (w == [e_a, e_b, e_c]).all(1).any():
      continue

    # Compute cosine similarity between the vector (e_b - e_a) and the vector (w - e_c)
    ### START YOUR CODE HERE ###  (≈1 LOC)
    cosine_sim = cosine_similarity( , )
    ### END YOUR CODE HERE ###

    if cosine_sim > max_cosine_sim:
      # Set new max similarity
      max_cosine_sim = cosine_sim
      # Select new best_word
      best_word_idx = w_idx

      print(cosine_sim, vocab[best_word_idx])

  return vocab[best_word_idx]

In [None]:
print(complete_analogy('man', 'king', 'woman'))

In [None]:
print(complete_analogy('germany', 'german', 'china'))

You may try different word combinations. However, our word embedding is not very powerful as we used a rather simple model for training, and more importantly, a very small text corpus.

## 4.4 - GloVe Embedding

Execute the cell below to download a more powerful word embedding, i.e., a 50-dimensional GloVe word embedding trained on Wikipedia (2014) and the Gigaword 5 corpus. The [GloVe embedding](https://nlp.stanford.edu/projects/glove/) is provided by the NLP research group at Stanford University.

In [None]:
#@title Download GloVe Embedding

import requests, os, zipfile
import numpy as np

data_path = '/tmp/glove'
glove_file = os.path.join(data_path, 'glove.6B.50d.txt')
!rm -rf $data_path
os.makedirs(data_path)

# download glove file
!wget -nv -t 0 --show-progress -O $glove_file 'https://cloud.tu-ilmenau.de/s/m558re2RpoW8X2s/download/glove.6B.50d.txt'
!sleep 1

def read_glove_vecs(glove_file):
  with open(glove_file, 'r') as f:
    words = set()
    word_to_vec_map = {}

    for line in f:
      line = line.strip().split()
      curr_word = line[0]
      words.add(curr_word)
      word_to_vec_map[curr_word] = np.array(line[1:], dtype=np.float64)

  return words, word_to_vec_map

In [None]:
words, word_to_vec_map = read_glove_vecs(glove_file)

You can get the embedding vector of a string by a lookup in the `word_to_vec_map`. Let's compute some word similarities using the GloVe embedding:

In [None]:
def print_pair_similarity_glove(word_a, word_b):
  print(cosine_similarity(
      word_to_vec_map[word_a],
      word_to_vec_map[word_b]
  ))

print_pair_similarity_glove("queen", "woman")
print_pair_similarity_glove("queen", "man")
print_pair_similarity_glove("queen", "king")

Let's see if the GloVe Embedding allows for better analogies:

In [None]:
def complete_analogy_glove(word_a, word_b, word_c):

  # Convert words to lower case
  word_a, word_b, word_c = word_a.lower(), word_b.lower(), word_c.lower()

  # Get the embedding vectors
  e_a = word_to_vec_map[word_a]
  e_b = word_to_vec_map[word_b]
  e_c = word_to_vec_map[word_c]

  words = word_to_vec_map.keys()
  max_cosine_sim = -100
  best_word = None

  # Loop over the whole word vector set
  for w in words:

    # To avoid best_word being one of the input words, pass on them.
    if w in [word_a, word_b, word_c] :
      continue

    # Compute cosine similarity between the vector (e_b - e_a) and the vector ((w's vector representation) - e_c)
    cosine_sim = cosine_similarity(e_b - e_a, word_to_vec_map[w] - e_c)

    if cosine_sim > max_cosine_sim:
      # Set new max similarity
      max_cosine_sim = cosine_sim
      # Select new best_word
      best_word = w

  return best_word

Try out a few! Nice triplets are
- `('king', 'man', 'queen')`
- `('germany', 'german', 'china')`
- `('india', 'delhi', 'japan')`
- `('man', 'woman', 'boy')`

In [None]:
#complete_analogy_glove('king', 'man', 'queen')
complete_analogy_glove('germany', 'german', 'china')
#complete_analogy_glove('india', 'delhi', 'japan')
#complete_analogy_glove('man', 'woman', 'boy')

And for the most similar words:

In [None]:
def closest_word_glove(embedding_vector, remove_words=[], top_k=5):

  # Get vocabulary
  vocabulary = list(word_to_vec_map.keys())
  # To avoid top words being one of the input words, remove them from list
  for w in remove_words:
    vocabulary.remove(w)

  # Compute embeddings of all words
  w_embeddings = np.array([word_to_vec_map[w] for w in vocabulary])

  # Compute similarities
  similarity = [ cosine_similarity(w_emb, embedding_vector) for w_emb in w_embeddings ]

  # Index of max similary
  idxs = np.argsort( similarity )[::-1][:top_k]

  return [ [vocabulary[i], similarity[i]] for i in idxs ]


for word in ['president', 'phone', 'soccer', 'science']:
  print(closest_word_glove(word_to_vec_map[word], [word]))
  print()