## Skip-gram from scratch

Please refer to Mikolov et al. (2013). "Efficient Estimation of Word Representations in Vector Space." and Mikolov et al. (2013b). "Distributed Representations of Words and Phrases and their Compositionality"

I train the skip-gram model here, which is a method for predicting the context words given a central word. Another model from Mikolov is called CBOW.

We want to minimize:

<div style="text-align: center;">
    <img src="https://tensorflow.org/text/tutorials/images/word2vec_skipgram_objective.png" width="400">
</div>

where 'Wt' is the focused word and 'c' is the window size. As for prediction, to calculate probability:

<div style="text-align: center;">
    <img src="https://tensorflow.org/text/tutorials/images/word2vec_full_softmax.png" width="400">
</div>

where 'v' and 'v`' are target and context vector representations of words and 'W' is vocabulary size.

#### Setup

In [1]:
import io
import re
import string
import tqdm

import numpy as np

import tensorflow as tf
from tensorflow.keras import layers

In [2]:
%load_ext tensorboard

In [3]:
SEED = 42
AUTOTUNE = tf.data.AUTOTUNE

#### Skip-gram model

Compile the essential steps in word2vec_test into a function that can be called on a list of vectorized sentences obtained from any text dataset.

In [5]:
def generate_training_data(sequences, window_size, num_ns, vocab_size, seed):
  targets, contexts, labels = [], [], []

  ''' 
  Randomly drop overly frequent words to improve embedding quality. 
  For example, word at index 0 (most frequent) has 0.3% chance to be kept → almost always dropped;
  Word at higher index like 4000 is always kept → it's rare. In this case we can drop words like 'the' that do not carry semantic information.
  '''
  sampling_table = tf.keras.preprocessing.sequence.make_sampling_table(vocab_size)

  # Many sequences (sentences) in the dataset.
  for sequence in tqdm.tqdm(sequences):

    # positive pairs
    positive_skip_grams, _ = tf.keras.preprocessing.sequence.skipgrams(
          sequence,
          vocabulary_size=vocab_size,
          sampling_table=sampling_table,
          window_size=window_size,
          negative_samples=0)

    # negative samples
    for target_word, context_word in positive_skip_grams:
      context_class = tf.expand_dims(tf.constant([context_word], dtype="int64"), 1)
      negative_sampling_candidates, _, _ = tf.random.log_uniform_candidate_sampler(
          true_classes=context_class,
          num_true=1,
          num_sampled=num_ns,
          unique=True,
          range_max=vocab_size,
          seed=seed,
          name="negative_sampling")

      context = tf.concat([tf.squeeze(context_class,1), negative_sampling_candidates], 0)
      label = tf.constant([1] + [0]*num_ns, dtype="int64")

      targets.append(target_word)
      contexts.append(context)
      labels.append(label)

  return targets, contexts, labels

#### Training data

In [6]:
path_to_file = tf.keras.utils.get_file('shakespeare.txt', 'https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt')

In [7]:
with open(path_to_file) as f:
    lines = f.read().splitlines()
    print(len(lines))
print('\n')
for line in lines[:10]:
    print(line)

40000


First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:


Use the non empty lines to construct a 'tf.data.TextLineDataset' object for the next steps:

In [8]:
text_ds = tf.data.TextLineDataset(path_to_file).filter(lambda x: tf.cast(tf.strings.length(x), bool))

#### Vectorize sentences from the corpus

The following lines are for preparing text so that the model can work with integer token IDs instead of raw text, which is required for Word2Vec, LSTMs, Transformers, etc. vocab_size=4096 means: keep only the 4096 most common words in the dataset; sequence_length=10 means: each piece of text will be turned into a sequence of exactly 10 integers; batch(1024): number of sentences for training each time.

In [9]:
def custom_standardization(input_data):
  lowercase = tf.strings.lower(input_data)
  return tf.strings.regex_replace(lowercase, '[%s]' % re.escape(string.punctuation), '')

vocab_size = 4096
sequence_length = 10

vectorize_layer = layers.TextVectorization(
    standardize=custom_standardization,
    max_tokens=vocab_size,
    output_mode='int',
    output_sequence_length=sequence_length)

In [10]:
vectorize_layer.adapt(text_ds.batch(1024))

In [11]:
vectorize_layer(["The cute ginger cat sits comfortably on the mat."])

<tf.Tensor: shape=(1, 10), dtype=int64, numpy=array([[   2,    1,    1, 3783, 1211,    1,   47,    2,    1,    0]])>

Here '1' or '2' means this word does not exist. Larger numbers match the exact index. Having '0's indicates that the sequence length is shorter than 10.

The following function returns a list of all vocabulary tokens sorted (descending) by their frequency.

In [12]:
# Save the created vocabulary for reference.
inverse_vocab = vectorize_layer.get_vocabulary()
print(inverse_vocab[:20])

['', '[UNK]', np.str_('the'), np.str_('and'), np.str_('to'), np.str_('i'), np.str_('of'), np.str_('you'), np.str_('my'), np.str_('a'), np.str_('that'), np.str_('in'), np.str_('is'), np.str_('not'), np.str_('for'), np.str_('with'), np.str_('me'), np.str_('it'), np.str_('be'), np.str_('your')]


In [13]:
# Vectorize the data in text_ds.
text_vector_ds = text_ds.batch(1024).prefetch(AUTOTUNE).map(vectorize_layer).unbatch()

In [14]:
# flatten the dataset into a list of sentence vector sequences
sequences = list(text_vector_ds.as_numpy_iterator())
sequences[:20]

[array([ 89, 270,   0,   0,   0,   0,   0,   0,   0,   0]),
 array([138,  36, 982, 144, 673, 125,  16, 106,   0,   0]),
 array([34,  0,  0,  0,  0,  0,  0,  0,  0,  0]),
 array([106, 106,   0,   0,   0,   0,   0,   0,   0,   0]),
 array([ 89, 270,   0,   0,   0,   0,   0,   0,   0,   0]),
 array([   7,   41,   34, 1286,  344,    4,  200,   64,    4, 3690]),
 array([34,  0,  0,  0,  0,  0,  0,  0,  0,  0]),
 array([1286, 1286,    0,    0,    0,    0,    0,    0,    0,    0]),
 array([ 89, 270,   0,   0,   0,   0,   0,   0,   0,   0]),
 array([  89,    7,   93, 1187,  225,   12, 2442,  592,    4,    2]),
 array([34,  0,  0,  0,  0,  0,  0,  0,  0,  0]),
 array([  36, 2655,   36, 2655,    0,    0,    0,    0,    0,    0]),
 array([ 89, 270,   0,   0,   0,   0,   0,   0,   0,   0]),
 array([  72,   79,  506,   27,    3,   56,   24, 1390,   57,   40]),
 array([644,   9,   1,   0,   0,   0,   0,   0,   0,   0]),
 array([34,  0,  0,  0,  0,  0,  0,  0,  0,  0]),
 array([  32,   54, 2863,  885

#### Generate training examples from sequences

Call the 'generate_training_data' function to iterate over each word from each sequence to collect positive and negative context words. Length of target, contexts and labels should be the same, representing the total number of training examples.

In [21]:
targets, contexts, labels = generate_training_data(
    sequences=sequences,
    window_size=2,
    num_ns=5,
    vocab_size=vocab_size,
    seed=SEED)

targets = np.array(targets)
contexts = np.array(contexts)
labels = np.array(labels)

print('\n')
print(f"targets.shape: {targets.shape}")
print(f"contexts.shape: {contexts.shape}")
print(f"labels.shape: {labels.shape}")

100%|██████████████████████████████████████████████████████████████████████████| 32777/32777 [00:11<00:00, 2838.91it/s]




targets.shape: (65770,)
contexts.shape: (65770, 6)
labels.shape: (65770, 6)


This step is to create an object with '(target_word, context_word), (label)' elements to train word2vec model.

In [22]:
BATCH_SIZE = 1024
BUFFER_SIZE = 10000
dataset = tf.data.Dataset.from_tensor_slices(((targets, contexts), labels))
dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)
print(dataset)

<_BatchDataset element_spec=((TensorSpec(shape=(1024,), dtype=tf.int64, name=None), TensorSpec(shape=(1024, 6), dtype=tf.int64, name=None)), TensorSpec(shape=(1024, 6), dtype=tf.int64, name=None))>


In [23]:
dataset = dataset.cache().prefetch(buffer_size=AUTOTUNE)
print(dataset)

<_PrefetchDataset element_spec=((TensorSpec(shape=(1024,), dtype=tf.int64, name=None), TensorSpec(shape=(1024, 6), dtype=tf.int64, name=None)), TensorSpec(shape=(1024, 6), dtype=tf.int64, name=None))>


### Training

The word2vec model can be implemented as a classifier to distinguish between true context words from skip-grams and false context words obtained through negative sampling. You can perform a dot product multiplication between the embeddings of target and context words to obtain predictions for labels and compute the loss function against true labels in the dataset.

#### Subclassed word2vec model

In the following function:

'target_embedding': looks up the embedding of a word when it appears as a target word. The number of parameters in this layer is (vocab_size * embedding_dim).
'context_embedding': looks up the embedding of a word when it appears as a context word. The number of parameters in this layer is the same as 'target_embedding'.
Basically, it can be understood that there is 1 hidden layer: with 'target_embedding' we transfer the one-hot encodings of words into their embeddings, then with 'context_embedding' they are transferred to possibilities of context words and compare with their one-hot encodings. 
'dots': computes the dot product of target and context embeddings from a training pair.
'flatten': flattens the results of 'dots' layer into logits.

'call()' accepts (target, context) pairs which can then be passed into their corresponding embedding layer.

In [24]:
class Word2Vec(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim):
    super(Word2Vec, self).__init__()
    self.target_embedding = layers.Embedding(vocab_size, embedding_dim, name="w2v_embedding")
    self.context_embedding = layers.Embedding(vocab_size, embedding_dim)

  def call(self, pair):
    target, context = pair
    if len(target.shape) == 2:
      target = tf.squeeze(target, axis=1)
    # target: (batch,)
    word_emb = self.target_embedding(target)
    # word_emb: (batch, embed)
    context_emb = self.context_embedding(context)
    # context_emb: (batch, context, embed)
    dots = tf.einsum('be,bce->bc', word_emb, context_emb)
    # dots: (batch, context)
    return dots

#### Define loss function and compile model


In [25]:
embedding_dim = 300
word2vec = Word2Vec(vocab_size, embedding_dim)
word2vec.compile(optimizer='adam',
                 loss=tf.keras.losses.CategoricalCrossentropy(from_logits=True),
                 metrics=['accuracy'])

In [26]:
# Log training statistics for TensorBoard
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir="logs")

In [27]:
word2vec.fit(dataset, epochs=50, callbacks=[tensorboard_callback])

Epoch 1/50
[1m64/64[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 32ms/step - accuracy: 0.1932 - loss: 1.7901  
Epoch 2/50
[1m64/64[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 30ms/step - accuracy: 0.6537 - loss: 1.7430
Epoch 3/50
[1m64/64[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 28ms/step - accuracy: 0.6014 - loss: 1.6393
Epoch 4/50
[1m64/64[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 33ms/step - accuracy: 0.5799 - loss: 1.5019
Epoch 5/50
[1m64/64[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 34ms/step - accuracy: 0.6163 - loss: 1.3553
Epoch 6/50
[1m64/64[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 35ms/step - accuracy: 0.6668 - loss: 1.2155
Epoch 7/50
[1m64/64[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 33ms/step - accuracy: 0.7143 - loss: 1.0865
Epoch 8/50
[1m64/64[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 24ms/step - accuracy: 0.7503 - loss: 0.9692
Epoch 9/50
[1m64/64[0m [32m━━━━━━━━━━━━━━━━

<keras.src.callbacks.history.History at 0x187eda69bd0>

In [28]:
#docs_infra: no_execute
%tensorboard --logdir logs

Reusing TensorBoard on port 6006 (pid 32980), started 14:37:15 ago. (Use '!kill 32980' to kill it.)

<!-- <img class="tfo-display-only-on-site" src="images/word2vec_tensorboard.png"/> -->

#### Embedding lookup and analysis

Obtain the weights from the model, and the vocabulary to build a metadata file with one token per line.

In [29]:
weights = word2vec.get_layer('w2v_embedding').get_weights()[0]
vocab = vectorize_layer.get_vocabulary()

In [30]:
out_v = io.open('vectors.tsv', 'w', encoding='utf-8')
out_m = io.open('metadata.tsv', 'w', encoding='utf-8')

for index, word in enumerate(vocab):
  if index == 0:
    continue  # skip 0, it's padding.
  vec = weights[index]
  out_v.write('\t'.join([str(x) for x in vec]) + "\n")
  out_m.write(word + "\n")
out_v.close()
out_m.close()

Analyze the obtained embeddings in the [Embedding Projector](https://projector.tensorflow.org/):

#### Use the model for similarity comparison

In [31]:
word2vec.target_embedding.weights

[<Variable path=word2_vec_1/w2v_embedding/embeddings, shape=(4096, 300), dtype=float32, value=[[-4.45919409e-02  2.82944553e-02  1.90933794e-03 ...  2.82108448e-02
    1.62904151e-02  4.34674062e-02]
  [ 3.80206168e-01 -3.36683877e-02 -3.18730265e-01 ... -1.84193611e-01
   -1.13965794e-01  2.87139446e-01]
  [ 1.33108020e-01 -4.98787723e-02  1.80316687e-01 ... -2.15081140e-01
   -1.44193619e-01 -2.21308470e-02]
  ...
  [ 7.14742020e-02 -2.31458619e-01  6.52562454e-02 ... -3.47672075e-01
   -1.76008776e-01 -1.34053543e-01]
  [ 1.09059431e-01  2.44962633e-01 -3.30420360e-02 ...  7.81400427e-02
   -3.16389924e-04 -2.72853106e-01]
  [ 1.12155490e-01  1.34904440e-02 -9.99084488e-02 ... -1.69383109e-01
    5.24358684e-03 -1.75879076e-01]]>]

In [32]:
embedding_matrix = word2vec.get_layer("w2v_embedding").get_weights()[0]

In [33]:
def embed(word):
    index = inverse_vocab.index(word)  # find word index
    return embedding_matrix[index]     # return its embedding vector

In [34]:
v1 = embed("king")
v2 = embed("queen")
v3 = embed("woman")
v4 = embed("man")

In [35]:
def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

In [36]:
cosine_similarity((v1+v3-v4), v2)

np.float32(-0.016529579)

In [37]:
def most_similar(query_word, top_k=10):
    query_vec = embed(query_word)
    sims = np.dot(embedding_matrix, query_vec) / (
        np.linalg.norm(embedding_matrix, axis=1) * np.linalg.norm(query_vec)
    )
    top_indices = sims.argsort()[-top_k:][::-1]
    return [(inverse_vocab[i], sims[i]) for i in top_indices]

most_similar("king")

[(np.str_('king'), np.float32(1.0)),
 (np.str_('richard'), np.float32(0.4736628)),
 (np.str_('ii'), np.float32(0.44646168)),
 (np.str_('iii'), np.float32(0.42145312)),
 (np.str_('3'), np.float32(0.4196664)),
 (np.str_('vi'), np.float32(0.3986055)),
 (np.str_('iv'), np.float32(0.39721936)),
 (np.str_('xi'), np.float32(0.35582554)),
 (np.str_('conqueror'), np.float32(0.32454354)),
 (np.str_('birds'), np.float32(0.32224503))]