# Word 2 Vec

Word2Vec is a family of model architectures and optimizations that can be used to learn word embeddings from large datasets.


Two common methods for learning representations of words:

    1. **Continuous Bag-of-Words Model**: Predicts the middle word based on surrounding context words. Context is few words befor and after the current middle word. Note: Order is not important, only context.
    2. **Continuous Skip-gram Model**:Predicts words within a certain range before and after the current word in the same sentence.

## Skip-Gram Example

### Overview of the Skip-gram model

In short, a skig-gram predicts the context of a word, given the word. A model is trained on skip-grams. Skip-grams are n-grams that allow tokens to be skipped.

The context of a word can be represented through a set of skip-gram pairs of <code> (text_word, context_word)</code>. The context word appears in the neighbooring context of the target word. The context words are given by a window size.

As an example, the sentence: <br> The wide road shimmered in the hot sun </br> would produce the following skip-gram pairs for the word <br> shimmered </br> and window size 2:

<code> (shimmered, road), (shimmered, wide), (shimmered, in), (shimmered, the)</code>

---

### Objective function

**The Objective of Skip-Gram**: maximizing the probability of predicting appropriate context words for a given target word.


For a sequence $w_1,...,w_n$, the objective can be written as the average log probability:



$$
\dfrac{1}{T}\displaystyle\sum_{t=1}^T\sum_{-c\leq j\leq c, j\neq 0} \log p(w_{t+j}| w_t)
$$

were $c$ is the training context.


The basic skip-gram formulation defines this probability using the softmax function:
$$
p(w_o| w_I) = \dfrac{\exp(v_{w_o}'v_{w_I})}{\displaystyle\sum_{w=1}^W\exp(v_w'v_{w_I})}
$$

were $v$ are the vector representations of the words and $W$ is thesize of the vocabulary.



---

Estimating such a loss function can be computationally intractable since the denominator involves multipliying 10 to the power of 5 to 7 operations. Hence, an approximation of the softmax function is performed by using the $NCE$ loss function.

### Negative Sampling

Further approximation can be done on the NCE loss function since the objective is to learn the word embeddings, and not really the distribution of the words. This is done by using **negative sampling**.

**Negative Sampling**: Draw the Context Word from *num_ns* negative samples as in a classification problem. This is done by considering the negative samples to be drawn from a noise distribution $P_W(w)$ of words.


A **negative sample** is defined as (target word, context_word) such that the context **does not appear in the window_size neighborhood** of the target_word.


As an example in our sentence, negative samples of window size 2 might be.:

<code> (hot, shimmered), (wide, hot), (wide, sun) </code>

## Generating Skip-grams for a Single Sentence

In [2]:
import io
import itertools
import numpy as np
import os
import re
import string
import tensorflow as tf
import tqdm

from tensorflow.keras import Model, Sequential
from tensorflow.keras.layers import Activation, Dense, Dot, Embedding, Flatten, GlobalAveragePooling1D, Reshape
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

In [3]:
SEED = 42
AUTOTUNE = tf.data.experimental.AUTOTUNE

### Vectorize a simple sentence

The steps to follow are:
1. We tokenize a sentence
2. We create a vocabulary dictionary and the inverse mapping.
3. Map the sentence using our vocabulary

In [7]:
# Tokenize

sentence = "The wide road shimmered in the hot sun"
tokens = sentence.lower().split()
print(tokens, len(tokens))

['the', 'wide', 'road', 'shimmered', 'in', 'the', 'hot', 'sun'] 8


In [10]:
# Vocabulary

vocab, index = {}, 1 # We start indexing from 1, since 0 is left for the padding token.
vocab["<pad>"] = 0
for token in tokens:
    if token not in vocab.keys():
        vocab[token] = index
        index += 1
vocab_size = len(vocab.keys())
print(vocab)


inverse_vocab = {index: token for token, index in vocab.items()}
print(inverse_vocab)

{'<pad>': 0, 'the': 1, 'wide': 2, 'road': 3, 'shimmered': 4, 'in': 5, 'hot': 6, 'sun': 7}
{0: '<pad>', 1: 'the', 2: 'wide', 3: 'road', 4: 'shimmered', 5: 'in', 6: 'hot', 7: 'sun'}


In [12]:
# Map the sentence

example_sentence = [vocab[token] for token in tokens]
print(example_sentence)

[1, 2, 3, 4, 5, 1, 6, 7]


### Skip-Grams: Positive and Negative

In [27]:
window_size = 2
positive_skip_grams, _ = tf.keras.preprocessing.sequence.skipgrams(example_sentence, vocabulary_size=vocab_size,
                                                               window_size=window_size, negative_samples=0)
print(positive_skip_grams, len(positive_skip_grams))
for target, context in positive_skip_grams[:5]:
    print(f"target: {inverse_vocab[target]}, context: {inverse_vocab[context]}")

[[4, 3], [1, 6], [6, 1], [1, 5], [5, 3], [3, 5], [1, 3], [5, 1], [4, 1], [1, 7], [4, 5], [2, 1], [3, 4], [6, 7], [5, 4], [4, 2], [7, 6], [2, 3], [3, 2], [2, 4], [1, 4], [7, 1], [3, 1], [5, 6], [1, 2], [6, 5]] 26
target: shimmered, context: road
target: the, context: hot
target: hot, context: the
target: the, context: in
target: in, context: road


We generated all possible positive skipgram with window size 2. To generate negative samples, we need to sample num_ns from the vocabulary, conditioning that they are not in the context.

In [35]:
target_word, context_word = positive_skip_grams[0]
print(target_word, context_word)
num_ns = 4

context_class = tf.reshape(tf.constant(context_word, dtype="int64"), (1,1))

negative_sampling_candidates, _, _ = tf.random.log_uniform_candidate_sampler(
        true_classes=context_class, # Regarded as the positive class
        num_true=1, # Each example should be different
        num_sampled=num_ns,
        unique=True,
        range_max=vocab_size,
        seed=SEED,
        name="negative_sampling")

print(negative_sampling_candidates)
print([inverse_vocab[index.numpy()] for index in negative_sampling_candidates])

4 3
tf.Tensor([6 4 2 0], shape=(4,), dtype=int64)
['hot', 'shimmered', 'wide', '<pad>']


Key Point: num_ns (number of negative samples per positive context word) between [5, 20] is shown to work best for smaller datasets, while num_ns between [2,5] suffices for larger datasets.

4

In [37]:
# Constructing the training example.

# Add a dimension so you can use concatenation (on the next step).
negative_sampling_candidates = tf.expand_dims(negative_sampling_candidates, 1)


# Concat positive context word with negative sampled words.
context = tf.concat([context_class, negative_sampling_candidates], 0)


# Label first context word as 1 (positive) followed by num_ns 0s (negative).
label = tf.constant([1]+[0]*num_ns, dtype="int64")


# Reshape target to shape (1,) and context and label to (num_ns+1,).
target = tf.squeeze(target_word)
context = tf.squeeze(context)
label =  tf.squeeze(label)


In [41]:
print(f"target_index    : {target}")
print(f"target_word     : {inverse_vocab[target_word]}")
print(f"context_indices : {context}")
print(f"context_words   : {[inverse_vocab[c.numpy()] for c in context]}")
print(f"label           : {label}")

target_index    : 4
target_word     : shimmered
context_indices : [3 6 4 2 0]
context_words   : ['road', 'hot', 'shimmered', 'wide', '<pad>']
label           : [1 0 0 0 0]


The training example is the tuple <code> (target_word, context_words, labels)</code>

In [42]:
print(f"target  :", target)
print(f"context :", context )
print(f"label   :", label )

target  : tf.Tensor(4, shape=(), dtype=int32)
context : tf.Tensor([3 6 4 2 0], shape=(5,), dtype=int64)
label   : tf.Tensor([1 0 0 0 0], shape=(5,), dtype=int64)


In [43]:
![title](img/word2vec_negative_sampling.png.png)

zsh:1: unknown file attribute: i
