# Training and Exploring `Word2Vec` models

<a href="https://colab.research.google.com/drive/1fBIX57Op-lcyjqO-hH5jb7RqJu0cZuCT" target="_blank">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab">
</a>

Return to the [castle](https://github.com/Nkluge-correa/TeenyTinyCastle).

Nearly all contemporary language models, rooted in neural network architectures, rely on dense word embeddings as a cornerstone for language representation. These embeddings, often derived from pre-trained models like `Word2Vec` and `GloVe`, or contextualized representations like `ELMo` or `BERT`, encode semantic and syntactic information into fixed-size vectors. This process transforms words into continuous numerical vectors that capture their contextual meanings and relationships within a given corpus.

These embedding matrices serve as a go-to target for individuals seeking to delve into and interpret the inner workings of language models, particularly in understanding how they represent language. By examining the geometric relationships between word embeddings in the high-dimensional space, researchers can gain insights into how language models organize and process linguistic information. This exploration provides valuable insights into the semantic and syntactic structures encoded within the embeddings, shedding light on how language models capture nuances in meaning and context.

<img src="https://miro.medium.com/v2/resize:fit:4800/format:webp/1*T8WWibd7u8b7gfgeG0LgAA.gif" width=400 />

[Source](https://towardsdatascience.com/text-classification-with-nlp-tf-idf-vs-word2vec-vs-bert-41ff868d1794).

In this tutorial, we will analyze (while also training) a `Word2Vec` model. `Word2Vec` is a popular natural language processing technique representing words in a high-dimensional vector space. It is a neural network-based approach used to create distributed representations of words based on their co-occurrence patterns in a given text corpus.

The basic idea behind `Word2Vec` is that words used in similar contexts tend to have similar meanings. So, if two words appear in similar contexts, they should be close to each other in the vector space.

> **Note:** To learn more, "_[Efficient Estimation of Word Representations in Vector Space](https://arxiv.org/abs/1301.3781)_" is the orginal study in which `Word2Vec` was proposed.

Two main techniques are used to create a `Word2Vec` model: Continuous Bag of Words (**CBOW**) and **Skip-gram**.

- **CBOW** is an algorithm to predict a target word based on its surrounding context words. The algorithm takes a window of context words as input and generates a probability distribution over the vocabulary of words for the target word.

- **Skip-gram**, on the other hand, is an algorithm to predict context words given a target word. The algorithm takes a target word as input and generates a probability distribution over the vocabulary of words for the context words.

In this tutorial, we will explore a `skip-gram` approach. First, we will explore what skip-grams are, and finally, we will train a word2vec us with the [News Category Dataset](xxx), available on Hub. ðŸ¤—

While in CBOW, we predict a word based on the words that come before and after it, a skip-gram model seeks to predict the words that come before and after a given word (which is the inverse of CBOW). The model is trained using special groups of words called skip-grams. _But what is a skip-gram?_

Let us consider the following sentence:


In [1]:
sentence = """There is a missing word in this sentence."""

When counting skip-grams, we need to define a window size. The window size represents the context window for this sentence. In other words, the window size determines the span of words on either side of a target word that can be considered a context word. For example, a window of 2 means we only look up to two words to the left and right, and so forth.

In [3]:
# Initialize an empty list to store skip-grams
skip_grams = []

# Iterate through each word in the sentence
for i, word in enumerate(sentence.split()):

    # Create skip-grams within a window of size 2
    # Forward direction: iterate over words within the next 2 positions
    for j in range(i+1, min(i+3, len(sentence.split()))):
        skip_grams.append((word, sentence.split()[j]))

    # Backward direction: iterate over words within the previous 2 positions
    for j in range(max(i-2, 0), i):
        skip_grams.append((word, sentence.split()[j]))

print(f"""First 10 skip_grams of window_size 2 in '{sentence}'.""")
print("Sentence size: ", len(sentence.split()), "\n")

for skip in skip_grams[:10]:
    print(skip)

First 10 skip_grams of window_size 2 in 'There is a missing word in this sentence.'.
Sentence size:  8 

('There', 'is')
('There', 'a')
('is', 'a')
('is', 'missing')
('is', 'There')
('a', 'missing')
('a', 'word')
('a', 'There')
('a', 'is')
('missing', 'word')


In simple terms, the skip-gram model tries to guess the words that will likely appear around a given word. The goal is to make the model good at predicting these surrounding words. This objective can be written as the average log probability:

$$\frac{1}{T} \sum_{t=1}^{T} \sum_{-c \leq j \leq c, j\neq0} \log p(w_{t+j} | w_{t})$$

Where:

- $T$ is the total number of words in the training corpus.
- $c$ is the size of the context window.
- $w_t$ is the target word at position $t$ in the corpus.
- $w_{t+j}$ is the context word at position $t+j$ in the same context window.
- $p(w_{t+j} | w_{t})$ is the conditional probability of the context word given the target word, which the skip-gram model estimates.


The softmax formulation for the skip-gram model can then be written as:

$$p(w_O | w_I) = \frac{\exp(v'{w_O} \cdot v{w_I})}{\sum_{w=1}^{W} \exp(v'{w} \cdot v{w_I})}$$

Where:

- $w_I$ is the input (target) word.
- $w_O$ is the output (context) word.
- $v_{w_I}$ and $v'_{w_O}$ are the input and output vector representations of words $w_I$ and $w_O$ respectively.
- $W$ is the size of the vocabulary of words.
- The dot (Â·) represents the dot product of two vectors.

While the numerator computes the similarity between the input and output word vectors using the dot product, the denominator is a normalization term that sums up the input word's similarities with all the vocabulary words. The resulting probability distribution is over all the words in the vocabulary and is used to estimate the conditional probability of observing an output word given an input word.

Instead of using a softmax, which, when involving a lot of words, might be slow to calculate, we can use noise contrastive estimation (NCE) to make the computation more efficient, which simplifies the process by using negative sampling. The idea behind negative sampling is to randomly select a few words unrelated to the target word and use them to train the model. The model learns to distinguish between the context word and randomly chosen words, which helps it better understand the target word.

In this simplified approach, we select a few random words (called negative samples) and try to train the model to distinguish them from the context word. A negative sample is a pair of words where the context word is not near the target word. For example, if the target word is "_missing_" and the context window is two, then a negative sample could be "_algebra_" because "_algebra_" is not in the window size neighborhood of "_missing_" in our sentence example.

In practice, our model will not work with words but with tokens. Thus, let us create a tokenization dictionary for our custom sentence.




In [4]:
# Initialize an empty dictionary to store the vocabulary and an index counter
vocab, index = {}, 1

# Assign index 0 to an empty string, which serves as a padding token
vocab[''] = 0

# Iterate through each token (word) in the sentence
for token in sentence.split():
    # If the token is not already in the vocabulary
    if token not in vocab:
        # Add the token to the vocabulary with its corresponding index
        vocab[token] = index
        # Increment the index counter for the next token
        index += 1

# Calculate the size of the vocabulary
vocab_size = len(vocab)

# Create an inverse vocabulary mapping index to token
inverse_vocab = {index: token for token, index in vocab.items()}

print(vocab)
print(inverse_vocab)

print("Our tokenized sequence: ", [vocab[word] for word in sentence.split()])

print("Decoded sequence: ", [inverse_vocab[index] for index in [vocab[word] for word in sentence.split()]])


{'': 0, 'There': 1, 'is': 2, 'a': 3, 'missing': 4, 'word': 5, 'in': 6, 'this': 7, 'sentence.': 8}
{0: '', 1: 'There', 2: 'is', 3: 'a', 4: 'missing', 5: 'word', 6: 'in', 7: 'this', 8: 'sentence.'}
Our tokenized sequence:  [1, 2, 3, 4, 5, 6, 7, 8]
Decoded sequence:  ['There', 'is', 'a', 'missing', 'word', 'in', 'this', 'sentence.']


We could use the for loop implemented in our second code cell to create `skip-grams`. However, there is no need to reinvent the wheel. The [`tf.keras.preprocessing.sequence`](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/sequence) module provides the [`tf.keras.preprocessing.sequence.skipgrams`](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/sequence/skipgrams) that can do this heavy lifting for us.

In [5]:
import tensorflow as tf

# Tokenize the input sentence using the previously created vocabulary
tokenized_sentence = [vocab[word] for word in sentence.split()]

# Generate positive skip-grams using TensorFlow's skipgrams function
positive_skip_grams, _ = tf.keras.preprocessing.sequence.skipgrams(
      tokenized_sentence,
      vocabulary_size=vocab_size,
      window_size=2,
      negative_samples=0)

for target, context in positive_skip_grams[:10]:
    print(f"({inverse_vocab[target]}, {inverse_vocab[context]})")



(word, missing)
(a, There)
(There, is)
(is, a)
(missing, word)
(There, a)
(in, missing)
(word, this)
(this, sentence.)
(a, missing)


The skip-grams function looks for pairs of words that appear together within a certain window span. These pairs are called **positive skip-grams**.

However, we also need **negative samples**. As mentioned, these are pairs of words that don't appear together. To create negative samples, we randomly choose words from the vocabulary that are not in the same window as the **positive skip-grams**.

We use a function called [`tf.random.log_uniform_candidate_sampler`](https://www.tensorflow.org/api_docs/python/tf/random/log_uniform_candidate_sampler) to do this. This function randomly selects words from the vocabulary to create negative samples. We tell the function how many negative samples we want (`num_ns`) and give it the positive skip-gram's target word and context word. The context word is marked as `True` so that it won't be chosen as a negative sample.

In [6]:
# Extract the target word and context word from the first positive skip-gram
target_word, context_word = positive_skip_grams[0]

# Define the number of negative samples
num_ns = 4

# Reshape the context word to a tensor shape required by the negative sampling function
context_class = tf.reshape(tf.constant(context_word, dtype="int64"), (1, 1))

# Generate negative samples using TensorFlow's random log uniform candidate sampler
negative_sampling_candidates, _, _ = tf.random.log_uniform_candidate_sampler(
    true_classes=context_class,
    num_true=1,
    num_sampled=num_ns,
    unique=True,
    range_max=vocab_size,
    seed=42,
    name="negative_sampling"
)

# Print original sentence, positive skip-gram, and negative samples
print("Original Sentence: ", sentence)
print(f"Positive skip-grams: ({inverse_vocab[target_word]}, {inverse_vocab[context_word]})")
print("Negative samples: ", [inverse_vocab[index.numpy()] for index in negative_sampling_candidates])


Original Sentence:  There is a missing word in this sentence.
Positive skip-grams: (word, missing)
Negative samples:  ['a', 'There', 'word', 'is']


Now that we have positive and negative samples, we can combine them to create a set of training examples. For each positive skip-gram pair (`target_word`, `context_word`), we also have `num_ns` negative samples (words that don't appear in the same window). We group these positive and negative samples into a single set. Each positive sample is labeled 1, and each negative sample is labeled 0. So, for every target word, we end up with a set of positive skip-grams and negative samples that can be used to train the model.

In [7]:
context = tf.concat([tf.squeeze(context_class, 1), negative_sampling_candidates], 0)
label = tf.constant([1] + [0]*num_ns, dtype="int64")
target = target_word

print(f"""
One training sample: {{
target token    : {target}
target word     : {inverse_vocab[target_word]}
context tokens : {context}
context words   : {[inverse_vocab[c.numpy()] for c in context]}
labels           : {label}
}}
""")


One training sample: {
target token    : 5
target word     : word
context tokens : [4 3 1 5 2]
context words   : ['missing', 'a', 'There', 'word', 'is']
labels           : [1 0 0 0 0]
}



When we have a large dataset, we also have a lot of words to work with. Some words, like "_the_", "_is_", and "_on_", appear very frequently and don't provide much useful information to the model. However, we can remove some of these very frequent words from the training data to deal with this.

The [`tf.keras.preprocessing.sequence.skipgrams`](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/sequence/skipgrams) function can be used to subsample these frequent words by giving it a list of probabilities that tell it how likely each word is to be sampled. To get these values, we can use the [`tf.keras.preprocessing.sequence.make_sampling_table`](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/sequence/make_sampling_table) function. This function generates a list of probabilities based on the frequency of each word in the dataset.

Finally, now that we have described all the necessary steps to preprocess text data for training word embeddings using the skip-gram model, we can compile them into a function. Once this function is defined, we can use it in the later sections to preprocess our text data and prepare it for training our `Word2Vec` model.

In [8]:
import tqdm

def generate_training_data(sequences, window_size, num_ns, vocab_size, seed):
  """
    Generate training data for a skip-gram model using negative sampling.

    Args:
        sequences: A list of sequences, where each sequence is a list of integers
        representing words.
        window_size: An integer, the size of the window for generating skip-grams.
        num_ns: An integer, the number of negative samples to use for each positive sample.
        vocab_size: An integer, the size of the vocabulary.
        seed: An integer, the random seed to use for sampling.

    Returns:
        Three lists: targets, contexts, and labels.
        Targets is a list of integers representing target words, contexts is a list of lists
        of integers representing context words and negative samples, and labels is a list of
        lists of integers representing the labels for each context. Specifically, each label
        list has a 1 in the first position (representing the positive sample) and 0s in the
        remaining positions (representing the negative samples).
  """
  targets, contexts, labels = [], [], []

  sampling_table = tf.keras.preprocessing.sequence.make_sampling_table(vocab_size)

  for sequence in tqdm.tqdm(sequences):

    positive_skip_grams, _ = tf.keras.preprocessing.sequence.skipgrams(
          sequence,
          vocabulary_size=vocab_size,
          sampling_table=sampling_table,
          window_size=window_size,
          negative_samples=0)

    for target_word, context_word in positive_skip_grams:

      context_class = tf.expand_dims(tf.constant([context_word], dtype="int64"), 1)

      negative_sampling_candidates, _, _ = tf.random.log_uniform_candidate_sampler(
          true_classes=context_class,
          num_true=1,
          num_sampled=num_ns,
          unique=True,
          range_max=vocab_size,
          seed=seed,
          name="negative_sampling")

      context = tf.concat([tf.squeeze(context_class,1), negative_sampling_candidates], 0)
      label = tf.constant([1] + [0]*num_ns, dtype="int64")

      targets.append(target_word)
      contexts.append(context)
      labels.append(label)

  return targets, contexts, labels

Now, we need some text. For this, we will use the [News Category Dataset](https://huggingface.co/datasets/AiresPucrs/News-Category-Dataset), created by [Rishabh Misra](https://arxiv.org/abs/2209.11429).

In [18]:
!pip install datasets -q

from datasets import load_dataset

dataset = load_dataset("AiresPucrs/News-Category-Dataset", split="train")

display(dataset.to_pandas())

Downloading readme:   0%|          | 0.00/314 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/27.5M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/209527 [00:00<?, ? examples/s]

Unnamed: 0,text,labels
0,Over 4 Million Americans Roll Up Sleeves For O...,U.S. NEWS
1,"American Airlines Flyer Charged, Banned For Li...",U.S. NEWS
2,23 Of The Funniest Tweets About Cats And Dogs ...,COMEDY
3,The Funniest Tweets From Parents This Week (Se...,PARENTING
4,Woman Who Called Cops On Black Bird-Watcher Lo...,U.S. NEWS
...,...,...
209522,RIM CEO Thorsten Heins' 'Significant' Plans Fo...,TECH
209523,Maria Sharapova Stunned By Victoria Azarenka I...,SPORTS
209524,"Giants Over Patriots, Jets Over Colts Among M...",SPORTS
209525,Aldon Smith Arrested: 49ers Linebacker Busted ...,SPORTS


Now, we will create a folder to store the examples we will use in our dataset, given that we will be using the [`TextLineDataset`](https://www.tensorflow.org/api_docs/python/tf/data/TextLineDataset) from TensorFlow, which creates a tf.dataset comprising lines from one or more text files.

In [19]:
import os
import tqdm

# Convert the dataset to a pandas DataFrame
dataset = dataset.to_pandas()

# Create a new directory named "dataset"
os.makedirs("dataset/", exist_ok=True)

# Iterate over unique labels (categories) in the dataset
for category in tqdm.tqdm(dataset.labels.unique()):
    # Create a subdirectory for each category within the "dataset" directory
    os.mkdir(f"dataset/{category}")

    # Filter the dataset to include only samples belonging to the current category
    dff = dataset[dataset['labels'] == category]

    # Iterate over each sample in the filtered dataset
    for i, sample in enumerate(list(dff.text)):
        # Write the text sample to a text file named "{i}.txt" within the corresponding category subdirectory
        with open(f'dataset/{category}/{i}.txt', 'w', encoding='utf-8') as fp:
            fp.write(sample)
            fp.close()

print('Dataset Folder Created!')


100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 42/42 [00:23<00:00,  1.81it/s]

Dataset Folder Created!





This dataset contains 42 directories with 209.527 files. However, this tutorial will use only a portion (a little more than 35K samples from the **"POLITICS"** folder).

> **Note:** You can use more folders or a completely different text dataset. Remember that the more text you give, the longer it will be to train the model.


In [21]:
import os

filenames = []

for folder in os.listdir("dataset/POLITICS"):
    filenames.append(os.path.join("dataset/POLITICS", folder))

print(f"Found {len(filenames)} files.")

Found 35602 files.


With all of our text files listed in `filenames`, we can create a dataset using the `tf.data.TextLineDataset`, which loads text from text files and creates a dataset where each line of the files becomes an element of the dataset.

In [22]:
import random
import tensorflow as tf

# Shuffle the list of filenames randomly
random.shuffle(filenames)

# Create a TextLineDataset from the shuffled filenames
text_ds = tf.data.TextLineDataset(filenames)

# Batch the dataset into batches of size 1024
text_ds = text_ds.batch(1024)

To create a tokenizer (i.e., vectorization layer), we will use the [`tf.keras.layers.TextVectorization`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/TextVectorization), which maps text features to integer sequences. Meanwhile, we will pass a custom standardization function to lower strings and parse punctuations from our text samples.

In [23]:
import re
import string

# Lower all strings and parse punctuation and symbols
def custom_standardization(input_data):
  lowercase = tf.strings.lower(input_data)
  return tf.strings.regex_replace(lowercase,
                                  '[%s]' % re.escape(string.punctuation), '')

# Maximum vocabulary size and will cut sequences with more than 100 tokens
vocab_size = 10000
sequence_length = 100

# Create a vectorization layer and adapt it to the text
vectorize_layer = tf.keras.layers.TextVectorization(
    standardize=custom_standardization,
    max_tokens=vocab_size,
    output_mode='int',
    output_sequence_length=sequence_length,
    encoding='utf-8')

# Fit the TextVectorization layer to the dataset
vectorize_layer.adapt(text_ds)

# Get words back from token indices
word2vec_vocabulary = vectorize_layer.get_vocabulary()

# Save the vocabulary as a text file
with open(f'word2vec_vocabulary.txt', 'w', encoding='utf-8') as fp:
    for word in word2vec_vocabulary:
        fp.write("%s\n" % word)
    fp.close()

To get a dataset ready for training a `Word2Vec` model, you need to convert the dataset into a list of tokenized and batched sequences. Since we are implementing skip-gram, we must go through each sentence in the dataset and use it to create positive and negative examples to feed our model during training. For this, we will re-use our `generate_training_data` function.

> **Note:** This function looks at each word in each sentence and uses them to create examples to teach the model how to predict related words. The function creates three lists - target words, context words, and labels - each list has the same number of items, representing the total number of examples the model will be trained on.

When training a `Word2Vec` model, there are two critical things to consider: how big the `window_size` of words you're looking at and how many negative samples (`num_ns`) you're including.

Different window sizes can be more beneficial depending on what you're trying to accomplish. Generally, smaller window sizes (2-15) will give you embeddings where words with similar meanings are treated as interchangeable, even if they're opposite. Larger window sizes (15-50 or more) will give you embeddings where related words, which are not necessarily interchangeable, will have higher similarity scores.

For a more complete explanation of the effect `window size` has, [watch this video](https://www.youtube.com/watch?v=tAxrlAVw-Tk&t=648s).

Regarding the number of `num_ns`, the [original paper](https://arxiv.org/abs/1301.3781) prescribes 5-20 as being a good number of negative samples.

In [24]:
import numpy as np

# Define AUTOTUNE for TensorFlow data pipeline optimization
AUTOTUNE = tf.data.AUTOTUNE

# Prefetch the text dataset for performance optimization and apply vectorization to each element
text_vector_ds = text_ds.prefetch(AUTOTUNE).map(vectorize_layer).unbatch()

# Convert the text dataset into a list of sequences
sequences = list(text_vector_ds.as_numpy_iterator())
print("We have ", len(sequences), " sequences.")

# Define window size and number of negative samples
window_size = 2
num_ns = 4

# Generate training data (targets, contexts, labels) based on the sequences
targets, contexts, labels = generate_training_data(
    sequences=sequences,
    window_size=window_size,
    num_ns=num_ns,
    vocab_size=vocab_size,
    seed=42)

targets = np.array(targets)
contexts = np.array(contexts)
labels = np.array(labels)

print(f"Targets: {targets.shape}")
print(f"Contexts: {contexts.shape}")
print(f"Labels: {labels.shape}")

We have  35626  sequences.


100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 35626/35626 [03:42<00:00, 160.24it/s]


Targets: (637410,)
Contexts: (637410, 5)
Labels: (637410, 5)


Depending on the chosen `window_size` and `num_ns`, the generation of our dataset can take a while. We created a 10.000 word Word2Vec vocabulary using the below sections from our dataset:

- "dataset/POLITICS".
- "dataset/WORLD NEWS".
- "dataset/ENTERTAINMENT".
- "dataset/ENVIRONMENT".
- "dataset/EDUCATION".
- "dataset/SCIENCE".
- "dataset/WELLNESS".

We also created two sets of [`targets`, `contexts`, and `labels`]. One has a `window_size` of 2, and the other has a `window_size` of 15. You can compare all of them to see how the increase in text data and `window_size` affect your `Word2Vec`model.

All of these are available in the repository tied to the training of our Word2Vece model. To download them, run:

```bash
!git lfs install
!git clone https://huggingface.co/AiresPucrs/Word2Vec
```

We will be using one of these already prepared datasets, but feel free to use or create the dataset as you wish!

In [26]:
import numpy as np

!git lfs install
!git clone https://huggingface.co/AiresPucrs/Word2Vec

# Define window size and number of negative samples
window_size = 2
num_ns = 4

# Load preprocessed training data from files
with open(f'./word2vec/w2v_dataset_w{window_size}_nn{num_ns}.npy', 'rb') as fp:
    targets = np.load(fp)  # Load targets
    contexts = np.load(fp)  # Load contexts
    labels = np.load(fp)    # Load labels
    fp.close()

# Load word2vec vocabulary from file
with open('./word2vec/word2vec_vocabulary.txt', encoding='utf-8') as fp:
    word2vec_vocabulary = [line.strip() for line in fp]  # Read vocabulary lines
    fp.close()

print(f"Targets: {targets.shape}")
print(f"Contexts: {contexts.shape}")
print(f"Labels: {labels.shape}")
print(f"Vocabulary Size: {len(word2vec_vocabulary)}")

# Create a TensorFlow dataset from loaded data
dataset = tf.data.Dataset.from_tensor_slices(((targets, contexts), labels))

# Shuffle and batch the dataset
dataset = dataset.shuffle(10000).batch(1024, drop_remainder=True)

Targets: (1569949,)
Contexts: (1569949, 5)
Labels: (1569949, 5)
Vocabulary Size: 10000


As said before, the `Word2Vec` model is a tool that can help us tell which words go together by looking at how often they appear near each other in sentences. It does this by comparing the meanings of different words and figuring out which ones are similar.

To train the model, we can give it pairs of words and ask it to predict whether they belong together. We can check if the model is correct by comparing its predictions to the actual pairs of words that we already know go together. The model improves over time as it learns from more and more examples of word pairs.

To create your `Word2Vec` model, you can use the Keras Subclassing API. Let us have a breakdown of our implementation:

- The first layer will be the `target_embedding` layer, responsible for finding the meaning of a word when it is a target. The size of this layer depends on the size of our vocabulary and the dimension of your `embeddings`.
- The second layer will be the `context_embedding` layer, responsible for finding the meaning of a word when it is in the context of another word. It has the same number of parameters as the `target_embedding` layer.
- The `dots` layer combines the `target` and `context` embeddings and calculates a dot product.
- The `flatten` layer takes the output of the dots layer and makes it flat.

You can then define a `call()` function that takes a pair of words (`target` and `context`) and passes them through the target and context `embedding layers`, performs a dot product with their output, and returns the flattened result.

In [28]:
import tensorflow as tf

class Word2Vec(tf.keras.Model):
    """
    Word2Vec model class for training word embeddings using skip-gram.
    """

    def __init__(self, vocab_size, embedding_dim):
        """
        Initialize the Word2Vec model.

        Parameters:
            - vocab_size (int): Size of the vocabulary.
            - embedding_dim (int): Dimensionality of word embeddings.
        """
        super(Word2Vec, self).__init__()  # Initialize the parent class (tf.keras.Model)

        # Define target embedding layer
        self.target_embedding = tf.keras.layers.Embedding(vocab_size,
                                                          embedding_dim,
                                                          input_length=1,
                                                          name="w2v_target_embedding")

        # Define context embedding layer
        self.context_embedding = tf.keras.layers.Embedding(vocab_size,
                                                           embedding_dim,
                                                           input_length=num_ns + 1,
                                                           name="w2v_context_embedding")

    def call(self, pair):
        """
        Perform forward pass of the Word2Vec model.

        Parameters:
            - pair (tuple): Tuple containing target and context words.

        Returns:
            - dots (tensor): Dot product between target and context embeddings.
        """
        target, context = pair  # Unpack target and context words from the input pair

        # Squeeze the target tensor if its shape has two dimensions
        if len(target.shape) == 2:
            target = tf.squeeze(target, axis=1)

        # Retrieve embeddings for target and context words using the embedding layers
        word_emb = self.target_embedding(target)  # Embedding for the target word
        context_emb = self.context_embedding(context)  # Embedding for the context words

        # Compute dot product between target and context embeddings using einsum
        dots = tf.einsum('be,bce->bc', word_emb, context_emb)

        return dots  # Return the dot product tensor


Since our labels are already one-hot-encoded, we will use `CategoricalCrossEntropy` as an alternative to the negative sampling loss and `Adam` as the optimizer. Finally, we instantiate our `Word2Vec` class with an embedding dimension of 512 and a vocabulary size of 10.000 words.

> **Note: You can skip this part of the tutorial if you wish to avoid training this model from scratch!**

In [None]:
vocab_size = 10000
embedding_dimension = 512

word2vec = Word2Vec(vocab_size, embedding_dimension)

word2vec.compile(optimizer='adam',
                 loss=tf.keras.losses.CategoricalCrossentropy(from_logits=True),
                 metrics=['accuracy'])

print("Version: ", tf.__version__)
print("Eager mode: ", tf.executing_eagerly())
print("GPU is", "available" if tf.config.list_physical_devices('GPU') else "NOT AVAILABLE")

word2vec.fit(dataset, verbose=1, epochs=20)

Now, we can recover the embeddings from both the `target` and `context` embedding layers. These embeddings will now hold some information about the relationship of words in our text corpus.

> **Note: For easy access, we also made this available in our [`Word2Vec`](https://huggingface.co/AiresPucrs/Word2Vec) repository.**

In [None]:
# Extract the embedding layer from the model
embeddings_target = word2vec.get_layer('w2v_target_embedding').get_weights()[0]
embeddings_context = word2vec.get_layer('w2v_context_embedding').get_weights()[0]

# Save the embeddings as a numpy array
with open('./word2vec/w2v_embeddings_w2.npy', 'wb') as fp:
    np.save(fp, embeddings_target)
    np.save(fp, embeddings_context)
    fp. close()

Let us now load our pre-trained embeddings (trained with a `window_size` of 15) to explore.

In [31]:
# !git lfs install
# !git clone https://huggingface.co/AiresPucrs/Word2Vec

with open('./word2vec/w2v_embeddings_w15.npy', 'rb') as fp:
    embeddings_target = np.load(fp)
    embeddings_context = np.load(fp)
    fp.close()

with open('./word2vec/word2vec_vocabulary.txt', encoding='utf-8') as fp:
    word2vec_vocabulary = [line.strip() for line in fp]
    fp.close()

print(f"Target Embeddings shape: {embeddings_target.shape}")
print(f"Context Embeddings shape: {embeddings_context.shape}")
print(f"Vocabulary Size: {len(word2vec_vocabulary)}")

Target Embeddings shape: (10000, 512)
Context Embeddings shape: (10000, 512)
Vocabulary Size: 10000


To associate each embedding with a human-readable string, we must pair our embeddings with our vocabulary, as done bellow.

In [32]:
# Create a dictionary of "word: embedding"
word2vec_target_embeddings = {}
word2vec_context_embeddings = {}

# Iterating through the elements of the vocabulary
for i, word in enumerate(word2vec_vocabulary):
    # here we skip the embedding/token 0 (""), because is just the PAD token.
    if i == 0:
        continue
    word2vec_target_embeddings[word] = embeddings_target[i]
    word2vec_context_embeddings[word] = embeddings_context[i]

Finally, we can perform some basic operations (`cosine similarity`) to understand and interpret what our model has learned, both for the `target` and `cosine` embeddings. While `target embeddings` hold information on **"relatedness among words"**, `context embeddings` hold information on **"what words usually accompany the target word"**.

In [33]:
import pandas as pd
from numpy.linalg import norm
from IPython.display import Markdown

def compute_cosine_table(string, dictionary,
                         vocabulary, top_n):
    """
    Computes the cosine similarity between a given word and all other words in a dictionary.

    Parameters:
    -----------
    string : str
        The word to compare against.
    dictionary : dict
        A dictionary with words as keys and their corresponding embeddings as values.
    vocabulary : list
        A list of words in the dictionary.
    top_n : int
        The number of closest matches to return.

    Returns:
    --------
    A pandas DataFrame with the closest matches to the input word and their
    corresponding similarity scores. The DataFrame is sorted in descending
    order of similarity score and limited to the top_n matches.
    The index of the DataFrame is set to the closest matches.
    """

    l = vocabulary.copy()
    l.remove(string)

    cos = []
    for word in l[1::]:

        cosine = np.dot(dictionary[string],
                dictionary[word])/(norm(dictionary[string])*norm(dictionary[word]))
        cos.append(cosine)

    return pd.DataFrame({"Closest Match": l[1::],f"Similarity Score": cos})\
        .sort_values(f"Similarity Score", ascending=False)\
        .set_index('Closest Match').head(top_n)

word = "trump"

df = compute_cosine_table(word,
        word2vec_target_embeddings,
        word2vec_vocabulary, 10)

print("Cosine Similarity for Target Embeddings")
display(Markdown(df.to_markdown()))

df = compute_cosine_table(word,
        word2vec_context_embeddings,
        word2vec_vocabulary, 10)

print("Cosine Similarity for Context Embeddings")
display(Markdown(df.to_markdown()))

Cosine Similarity for Target Embeddings


| Closest Match   |   Similarity Score |
|:----------------|-------------------:|
| donald          |           0.194642 |
| counsel         |           0.194392 |
| mexicos         |           0.192862 |
| insider         |           0.184099 |
| arpaio          |           0.183056 |
| pence           |           0.181958 |
| white           |           0.178896 |
| trevor          |           0.17812  |
| trumps          |           0.177152 |
| scarborough     |           0.176633 |

Cosine Similarity for Context Embeddings


| Closest Match   |   Similarity Score |
|:----------------|-------------------:|
| donald          |           0.67039  |
| on              |           0.4248   |
| with            |           0.400963 |
| his             |           0.398482 |
| tweeters        |           0.388154 |
| trumps          |           0.386425 |
| Â¯ãƒ„Â¯            |           0.385741 |
| president       |           0.381984 |
| he              |           0.376554 |
| taunts          |           0.367066 |

We can also perform basic arithmetic operations with these vector embeddings, which is another way to try to understand the knowledge they hold.

In [34]:
def find_closest_match(array, dictionary, vocabulary,
                           word1, word2, top_n):
    """
    Computes the cosine similarity between a given array and all other word
    embeddings in a dictionary except for two specified words.

    Parameters:
    -----------
    array : numpy.ndarray
        An array representing the embedding of a word or phrase.
    dictionary : dict
        A dictionary with words as keys and their corresponding embeddings as values.
    vocabulary : list
        A list of words in the dictionary.
    word1 : str
        The first word to exclude from the matches.
    word2 : str
        The second word to exclude from the matches.
    top_n : int
        The number of closest matches to return.

    Returns:
    --------
        A pandas DataFrame with the closest matches to the input array and
        their corresponding similarity scores. The DataFrame is sorted in
        descending order of similarity score and limited to the top_n matches.
        The index of the DataFrame is set to the closest matches.
    """

    l = vocabulary.copy()
    l.remove(word1)
    l.remove(word2)

    cos = []

    for word in l[1::]:
        cosine = np.dot(array,
                dictionary[word])/(norm(array)*norm(dictionary[word]))
        cos.append(cosine)

    return pd.DataFrame({"Closest Match": l[1::],f"Similarity Score": cos})\
        .sort_values(f"Similarity Score", ascending=False)\
        .set_index('Closest Match').head(top_n)

word1 = 'man'
word2 = 'music'

difference_vec = word2vec_target_embeddings[word1] + word2vec_target_embeddings[word2]

df = find_closest_match(difference_vec, word2vec_target_embeddings,
                           word2vec_vocabulary, word1, word2, 5)

display(Markdown(df.to_markdown()))

| Closest Match   |   Similarity Score |
|:----------------|-------------------:|
| jazz            |           0.214929 |
| christina       |           0.188508 |
| duff            |           0.185525 |
| peek            |           0.184137 |
| centers         |           0.181551 |

Apperently, **"man"** + **"music"** get us close to **"jazz"** ðŸŽ¶ðŸŽ·.

Researchers can uncover interesting semantic relationships and analogies encoded within the embedding space by manipulating word vectors. This approach allows the uncovering of the implicit knowledge and associations embedded in the learned representations, shedding light on how language models understand and interpret linguistic concepts. This exploration can inform the development of more interpretable and transparent NLP models, aiding in tasks such as semantic similarity assessment, concept categorization, and analogy reasoning. Moreover, it provides a means to validate and verify the quality and coherence of learned representations, contributing to the overall trustworthiness and reliability of NLP systems.

---

Return to the [castle](https://github.com/Nkluge-correa/TeenyTinyCastle).