# `Word2vec` - Embedding Words in Vector Space Representation

Return to the [castle](https://github.com/Nkluge-correa/teeny-tiny_castle).

**`Word2vec` is a popular natural language processing technique used to represent words in a high-dimensional vector space. It is a neural network-based approach that is used to create distributed representations of words based on their co-occurrence patterns in a given corpus of text.**

**The basic idea behind `word2vec` is that words that are used in similar contexts tend to have similar meanings. So, if two words appear in similar contexts, they should be close to each other in the vector space. `Word2vec` uses neural networks to learn the relationships between words by analyzing the context in which they appear in a given text.**

**`Word2vec` models can be used to perform `interpretability` research on text data/language models by extracting insights from the learned word embeddings. For `interpretability`, embedding models can be used for, for example:**

1. **Visualize the learned word embeddings.** 
2. **Use word embeddings to perform tasks such as word similarity and analogy tests.** 
3. **Use the word embeddings to identify clusters of related words.**

**However, we first need to know a little more about embeddings, vectors, and the `word2vec` model, proposed by [Tomas Mikolov](https://arxiv.org/search/cs?searchtype=author&query=Mikolov%2C+T), [Kai Chen](https://arxiv.org/search/cs?searchtype=author&query=Chen%2C+K), [Greg Corrado](https://arxiv.org/search/cs?searchtype=author&query=Corrado%2C+G), and [Jeffrey Dean](https://arxiv.org/search/cs?searchtype=author&query=Dean%2C+J), is a good introduction on the subject.**

<img src="https://cdn-images-1.medium.com/fit/t/1600/480/1*T8WWibd7u8b7gfgeG0LgAA.gif" width=400 />

**There are two main techniques used in `word2vec`: Continuous Bag of Words (`CBOW`) and `Skip-gram`.**

- **`CBOW` is an algorithm used to predict a target word based on its surrounding context words. The algorithm takes a window of context words as input and generates a probability distribution over the vocabulary of words for the target word.**
- **`Skip-gram`, on the other hand, is an algorithm used to predict context words given a target word. The algorithm takes a target word as input and generates a probability distribution over the vocabulary of words for the context words.**

**In this tutorial, we will explore a `skip-gram` approach. First, we will explore what skip-grams are, and finally, we will train a word2vec model with the [News Category Dataset](https://www.kaggle.com/datasets/rmisra/news-category-dataset?resource=download), availeble on `Kaggle`.**

**While in `CBOW` we trie to predict a word based on the words that come before and after it, a `skip-gram` model seeks to predict the words that come before and after a given word (is basically the inverse of `CBOW`). The model is trained using special groups of words called `skip-grams`, which allow certain words to be skipped in the prediction process.**

**Let us consider the following sentence:**


In [1]:
sentence = """There is a missing word in this sentence."""

**The context window for this sentence is defined by the window size. The window size determines the span of words on either side of a `target_word` that can be considered a context word. A window of 2 means we only look up to two words to the left and right, and so forth.**

In [2]:
skip_grams = []

for i, word in enumerate(sentence.split()):
    
    for j in range(i+1, min(i+3, len(sentence.split()))):
        skip_grams.append((word, sentence.split()[j]))

    for j in range(max(i-2, 0), i):
        skip_grams.append((word, sentence.split()[j]))

print(f"""First 10 skip_grams of window_size 2 in '{sentence}'.""")
print("Sentence size: ", len(sentence.split()), "\n")

for skip in skip_grams[:10]:
    print(skip)

First 10 skip_grams of window_size 2 in 'There is a missing word in this sentence.'.
Sentence size:  8 

('There', 'is')
('There', 'a')
('is', 'a')
('is', 'missing')
('is', 'There')
('a', 'missing')
('a', 'word')
('a', 'There')
('a', 'is')
('missing', 'word')


**In simple terms, the `skip-gram` model tries to guess the words that are likely to appear around a given word. The goal is to make the model good at predicting these surrounding words. This objective can be written as the average log probability:**

$$\frac{1}{T} \sum_{t=1}^{T} \sum_{-c \leq j \leq c, j\neq0} \log p(w_{t+j} | w_{t})$$


**where:**

- **$T$ is the total number of words in the training corpus.**
- **$c$ is the size of the context window.**
- **$w_t$ is the target word at position $t$ in the corpus.**
- **$w_{t+j}$ is the context word at position $t+j$ in the same context window.**
- **$p(w_{t+j} | w_{t})$ is the conditional probability of context word given the target word, which is estimated by the `skip-gram` model.**


**The `softmax` formulation for the `skip-gram` model can be written as:**

$$p(w_O | w_I) = \frac{\exp(v'{w_O} \cdot v{w_I})}{\sum_{w=1}^{W} \exp(v'{w} \cdot v{w_I})}$$

**where:**

- **$w_I$ is the input (target) word.**
- **$w_O$ is the output (context) word.**
- **$v_{w_I}$ and $v'_{w_O}$ are the input and output vector representations of words $w_I$ and $w_O$ respectively.**
- **$W$ is the size of the vocabulary of words.**
- **The dot (·) represents the dot product of two vectors.**

**The numerator computes the similarity between the input and output word vectors, using the dot product. The denominator is a normalization term, which sums up the similarities of the input word with all the words in the vocabulary. The resulting probability distribution is over all the words in the vocabulary, and is used to estimate the conditional probability of observing an output word given an input word.**

**To make the computation more efficient, we use a technique called noise contrastive estimation (`NCE`) instead of the full `softmax`. This is because the full `softmax` involves a lot of words and can be slow to calculate. `NCE` simplifies the process by using negative sampling.**

**The idea behind negative sampling is to randomly select a few words that are not related to the target word and use them to train the model. The model learns to distinguish between the context word and these randomly selected words, which helps it to better understand the target word.**

**In this simplified approach, we select a few random words (called negative samples) and try to train the model to distinguish them from the context word. A negative sample is a pair of words where the context word is not near the target word. For example, if the target word is "_missing_" and the context window is two, then a negative sample could be "_algebra_" because "_algebra_" is not in the `window_size` neighborhood of "_missing_" in our sentence example.**

**In practice, our model will not be working with words, but with tokens. Thus, let us create a tokenization dictionary for our custom sentence.**

In [3]:
vocab, index = {}, 1

vocab[''] = 0  # a padding token

for token in sentence.split():
  if token not in vocab:
    vocab[token] = index
    index += 1

vocab_size = len(vocab)
inverse_vocab = {index: token for token, index in vocab.items()}

print(vocab)
print(inverse_vocab)

print("Our tokenized sequence: ", [vocab[word] for word in sentence.split()])
print("Decoded sequence: ", [inverse_vocab[index] for index in \
                             [vocab[word] for word in sentence.split()]])

{'': 0, 'There': 1, 'is': 2, 'a': 3, 'missing': 4, 'word': 5, 'in': 6, 'this': 7, 'sentence.': 8}
{0: '', 1: 'There', 2: 'is', 3: 'a', 4: 'missing', 5: 'word', 6: 'in', 7: 'this', 8: 'sentence.'}
Our tokenized sequence:  [1, 2, 3, 4, 5, 6, 7, 8]
Decoded sequence:  ['There', 'is', 'a', 'missing', 'word', 'in', 'this', 'sentence.']


**We could use the for loop implemented in our second code cell to create `skip-grams`. however, there is no need to reinvent the wheel. The `tf.keras.preprocessing.sequence` module provides the `tf.keras.preprocessing.sequence.skipgrams` that can do this heavy lifting for us. However, it does the same job as our two nested for loops.**

In [4]:
import tensorflow as tf

tokenized_sentence = [vocab[word] for word in sentence.split()]

positive_skip_grams, _ = tf.keras.preprocessing.sequence.skipgrams(
      tokenized_sentence,
      vocabulary_size=vocab_size,
      window_size=2,
      negative_samples=0)

for target, context in positive_skip_grams[:10]:
  print(f"({inverse_vocab[target]}, {inverse_vocab[context]})")


(in, this)
(this, word)
(word, missing)
(in, word)
(in, sentence.)
(a, missing)
(missing, in)
(is, a)
(sentence., in)
(this, sentence.)


**The `skip-grams` function looks for pairs of words that appear together within a certain window span. These pairs are called `positive skip-grams`.**

**However, we also need negative samples. As before mentioned, these are pairs of words that don't appear together. To create negative samples, we randomly choose words from the vocabulary that are not in the same window as the `positive skip-grams`.**

**To do this, we use a function called `tf.random.log_uniform_candidate_sampler`. This function randomly selects words from the vocabulary to create negative samples. We tell the function how many negative samples we want (`num_ns`), and also give it the positive skip-gram's target word and context word. The context word is marked as "_true_" so that it won't be chosen as a negative sample.**

In [5]:
# one psotitive skip-gram
target_word, context_word = positive_skip_grams[0]

# number of negative
num_ns = 4

context_class = tf.reshape(tf.constant(context_word, dtype="int64"), (1, 1))

negative_sampling_candidates, _, _ = tf.random.log_uniform_candidate_sampler(
    true_classes=context_class, 
    num_true=1, 
    num_sampled=num_ns, 
    unique=True, 
    range_max=vocab_size, 
    seed=42,
    name="negative_sampling" 
)

print("Original Sentence: ", sentence)

print(f"Positive skip-grams: ({inverse_vocab[target_word]},{inverse_vocab[context_word]})")

print("Negative samples: ", [inverse_vocab[index.numpy()] for index in negative_sampling_candidates])

Original Sentence:  There is a missing word in this sentence.
Positive skip-grams: (in,this)
Negative samples:  ['a', 'There', 'word', 'is']


**Now that we have both positive and negative samples, we can put them together to create a set of training examples. For each positive skip-gram pair (`target_word`, `context_word`), we also have `num_ns` negative samples (words that don't appear in the same window).**

**We group these positive and negative samples together into a single set. Each positive sample is labeled as 1 and each negative sample is labeled as 0.**

**So, for every target word, we end up with a set of positive skip-grams and negative samples that can be used to train the model.**

In [6]:
context = tf.concat([tf.squeeze(context_class, 1), negative_sampling_candidates], 0)
label = tf.constant([1] + [0]*num_ns, dtype="int64")
target = target_word

print(f"""
One training sample: {{
target token    : {target}
target word     : {inverse_vocab[target_word]}
context tokens : {context}
context words   : {[inverse_vocab[c.numpy()] for c in context]}
labels           : {label}
}}
""")


One training sample: {
target token    : 6
target word     : in
context tokens : [7 3 1 5 2]
context words   : ['this', 'a', 'There', 'word', 'is']
labels           : [1 0 0 0 0]
}



**When we have a large dataset, we also have a lot of words to work with. Some words, like "_the_", "_is_", and "_on_", appear very frequently and don't provide much useful information to the model.
To deal with this, we can remove some of these very frequent words from the training data.**

**The `tf.keras.preprocessing.sequence.skipgrams` function can be used to subsample these frequent words by giving it a list of probabilities that tell it how likely each word is to be sampled.
To create this list of probabilities, we can use the `tf.keras.preprocessing.sequence.make_sampling_table` function. This function generates a list of probabilities based on the frequency of each word in the dataset.**

**Now that we have described all the necessary steps to preprocess text data for training word embeddings using the `skip-gram` model, we can compile them into a function. Once this function is defined, we can use it in the later sections to preprocess our text data and prepare it for training our `word2vec` model.**

In [7]:
import tqdm

def generate_training_data(sequences, window_size, num_ns, vocab_size, seed):
  """
    Generate training data for a skip-gram model using negative sampling.

    Args:
        sequences: A list of sequences, where each sequence is a list of integers 
        representing words. 
        window_size: An integer, the size of the window for generating skip-grams.
        num_ns: An integer, the number of negative samples to use for each positive sample.
        vocab_size: An integer, the size of the vocabulary.
        seed: An integer, the random seed to use for sampling.

    Returns:
        Three lists: targets, contexts, and labels. 
        Targets is a list of integers representing target words, contexts is a list of lists 
        of integers representing context words and negative samples, and labels is a list of 
        lists of integers representing the labels for each context. Specifically, each label
        list has a 1 in the first position (representing the positive sample) and 0s in the 
        remaining positions (representing the negative samples).
  """
  targets, contexts, labels = [], [], []

  sampling_table = tf.keras.preprocessing.sequence.make_sampling_table(vocab_size)

  for sequence in tqdm.tqdm(sequences):

    positive_skip_grams, _ = tf.keras.preprocessing.sequence.skipgrams(
          sequence,
          vocabulary_size=vocab_size,
          sampling_table=sampling_table,
          window_size=window_size,
          negative_samples=0)

    for target_word, context_word in positive_skip_grams:

      context_class = tf.expand_dims(tf.constant([context_word], dtype="int64"), 1)

      negative_sampling_candidates, _, _ = tf.random.log_uniform_candidate_sampler(
          true_classes=context_class,
          num_true=1,
          num_sampled=num_ns,
          unique=True,
          range_max=vocab_size,
          seed=seed,
          name="negative_sampling")

      context = tf.concat([tf.squeeze(context_class,1), negative_sampling_candidates], 0)
      label = tf.constant([1] + [0]*num_ns, dtype="int64")

      targets.append(target_word)
      contexts.append(context)
      labels.append(label)

  return targets, contexts, labels

**Now, we need some text. For this, we will use the [News Category Dataset](https://www.kaggle.com/datasets/rmisra/news-category-dataset?resource=download). The original dataset comes as an unformatted JSON file. However, you can download a properly formatted version (in a `pickle` format) on [this link](https://drive.google.com/uc?export=download&id=1EO_DBb-trFK-HWBWRMRhAKOnyMtf5GLL). Nonetheless, all credits go to the authors of the original dataset:**

```markdown
@article{misra2022news,
  title={News Category Dataset},
  author={Misra, Rishabh},
  journal={arXiv preprint arXiv:2209.11429},
  year={2022}
}
```

**After downloading the dataset, run the cell below to create a dataset folder.**

In [9]:
import os
import pickle
import pandas as pd

with open('News_Category_Dataset_v3.pickle', 'rb') as fp:
    news = pickle.load(fp)
    fp.close()

texts = []
category = []

# We are uniting the title and abstract to create a single string.
for i in range(len(news['news'])):
    texts.append(
        f"""{news['news'][i]['headline']} {news['news'][i]['short_description']}""")
    category.append(news['news'][i]['category'])

# Saving strings with their associated category 
# in case we want to perform text classification.
df = pd.DataFrame({
    "category": category,
    "texts": texts
})

df.to_csv('News_Category_Dataset_v3.csv', index=False)


# Check whether the specified path exists or not
if not os.path.exists("dataset/"):

    # Create a new directory because it does not exist
    os.makedirs("dataset/")

    # saving all text samples as txt files, in different 
    # news category folders
    for category in tqdm.tqdm(df.category.unique()):
        os.mkdir(f"dataset/{category}")
        dff = df[df['category'] == category]

        for i, sample in enumerate(list(dff.texts)):
            with open(f'dataset/{category}/{i}.txt', 'w', encoding='utf-8') as fp:
                fp.write(sample)
                fp.close()

    print('Dataset Folder Created!')

else:
    print('Dataset already exists!')

Dataset Folder Created!


**This dataset contains a lot of text (42 directories with 209.527 files). However, we will use only a portion of it (a little more than 35K samples).**

In [9]:
import os

directories = ["dataset/POLITICS"]

filenames = []

for directory in directories:
    for folder in os.listdir(directory):
        filenames.append(os.path.join(directory, folder))

print(f"Using {len(directories)} directories.")
print(f"Found {len(filenames)} files.")

Using 1 directories.
Found 35602 files.


**All of the found files are `txt` files with some text about some news topic.**

**Now, let us shuffle the order of our samples, and create a dataset using the `tf.data.TextLineDataset`, which loads text from text files and creates a dataset where each line of the files becomes an element of the dataset.**

In [10]:
import random
import tensorflow as tf

random.shuffle(filenames)

text_ds = tf.data.TextLineDataset(filenames)
text_ds = text_ds.batch(1024)

**Now, we us the `tf.keras.layers.TextVectorization`, passing a `custom_standardization` function to lower strings and parse punctuations, to create a vectorization layer. Then we adapt the `TextVectorization` layer to our dataset and get our vocabulary out of it.**

In [11]:
import re
import string

# Lower all strings and parse punctuation and symbols
def custom_standardization(input_data):
  lowercase = tf.strings.lower(input_data)
  return tf.strings.regex_replace(lowercase,
                                  '[%s]' % re.escape(string.punctuation), '')

# Maximum vocabulary size and will cut sequences with more than 100 tokens
vocab_size = 10000
sequence_length = 100

# Create a vectorization layer and adapt it to the text
vectorize_layer = tf.keras.layers.TextVectorization(
    standardize=custom_standardization,
    max_tokens=vocab_size,
    output_mode='int',
    output_sequence_length=sequence_length)

# Fit the TextVectorization layer to the dataset
vectorize_layer.adapt(text_ds)

# Get words back from token indices
word2vec_vocabulary = vectorize_layer.get_vocabulary()  

# Save the vocabulary as a text file
with open(f'word2vec_vocabulary.txt', 'w', encoding='utf-8') as fp:
    for word in word2vec_vocabulary:
        fp.write("%s\n" % word)
    fp.close()

**You can use the `vectorize_layer` to create a set of tokenized sequences that represent each piece of text in our `text_ds`. Then, we can apply some operations like `Dataset.batch`, `Dataset.prefetch`, `Dataset.map`, and `Dataset.unbatch` to the collection of text data to make it easier to process. These operations help to group the data into smaller parts, optimize the processing order, and transform it into a format that can be fed into our `word2vec` model.**

**In order to get a dataset ready for training a `word2vec` model, you need to convert the dataset into a list of sequences, where each sequence represents a sentence and is made up of numbers that correspond to the words in the sentence. This is necessary because when you train a `word2vec` model, you need to go through each sentence in the dataset and use it to create both positive and negative examples for the model.**

In [12]:
AUTOTUNE = tf.data.AUTOTUNE
text_vector_ds = text_ds.prefetch(AUTOTUNE).map(vectorize_layer).unbatch()

sequences = list(text_vector_ds.as_numpy_iterator())
print("We have ", len(sequences), " sequences.")

We have  35626  sequences.


**We now have a list called `sequences` which contains sentences that have been turned into sets of numbers/tokens. We can now our `generate_training_data` function that we defined earlier to create training examples for our `word2vec` model.** 

**Basically, this function looks at each word in each sentence and uses them to create examples that will teach the model how to predict words that are related to each other. The function creates three lists - target words, context words, and labels - and each list has the same number of items, which represents the total number of examples that the model will be trained on.**

**When training a `word2vec` model, there are two important things to consider: how big the `window_size` of words is that you're looking at, and how many negative samples (`num_ns`) you're including.** 

**Depending on what you're trying to accomplish, different window sizes can be more useful. Generally, smaller window sizes (2-15) will give you embeddings where words with similar meanings are treated as interchangeable, even if they're opposite in meaning. Larger window sizes (15-50 or more) will give you embeddings where words that are related, but not necessarily interchangeable, will have higher similarity scores.**

**For a more complete explanation of the effect `window size` has, [watch this video](https://www.youtube.com/watch?v=tAxrlAVw-Tk&t=648s).**

**In terms of number of `num_ns`, the [original paper](https://arxiv.org/abs/1301.3781) prescribes 5-20 as being a good number of negative samples.**

In [13]:
import numpy as np

window_size = 2
num_ns = 4

targets, contexts, labels = generate_training_data(
    sequences=sequences,
    window_size=window_size,
    num_ns=num_ns,
    vocab_size=vocab_size,
    seed=42)

targets = np.array(targets)
contexts = np.array(contexts)
labels = np.array(labels)

print(f"Targets: {targets.shape}")
print(f"Contexts: {contexts.shape}")
print(f"Labels: {labels.shape}")

100%|██████████| 35626/35626 [02:22<00:00, 249.60it/s]


Targets: (636101,)
Contexts: (636101, 5)
Labels: (636101, 5)


**Depending on the chosen `window_size` and `num_ns` the generation of our dataset can take a while. However, you can download `word2vec_vocabulary` and the `targets`, `contexts`, and `labels` files directly with [this link](https://drive.google.com/uc?export=download&id=1KtaBxlGb8y4Do9jjbYSuVnrwLZESJk_Q). We created two 10.000 word word2vec vocabularies using all of the below sections from our dataset:**

```python
["dataset/POLITICS",
 "dataset/WORLD NEWS",
 "dataset/ENTERTAINMENT",
 "dataset/ENVIRONMENT",
 "dataset/EDUCATION",
 "dataset/SCIENCE",
 "dataset/WELLNESS"]
```

**We also created two different sets of [`targets`, `contexts`, `labels`]. One with a `window_size` of 2, and the other one with a `window_size` of 15. You can compare all of them to see how the increase in text data and `window_size` affect the `word2vec`model.**

**An alternative dataset, paired whit vocabularies and word embeddings, is also available in Portuguese. To train it, we used the [`Fake.Br Corpus`](https://github.com/roneysco/Fake.br-Corpus).**

**Now, let us load our one of our training datasets.**

In [38]:
import numpy as np

window_size = 2
num_ns = 4

with open(f'data/w2v_dataset_w{window_size}_nn4.npy', 'rb') as fp:
    targets = np.load(fp)
    contexts = np.load(fp)
    labels = np.load(fp)
    fp.close()

with open('data/word2vec_vocabulary.txt', encoding='utf-8') as fp:
    word2vec_vocabulary = [line.strip() for line in fp]
    fp.close()

print(f"Targets: {targets.shape}")
print(f"Contexts: {contexts.shape}")
print(f"Labels: {labels.shape}")
print(f"Vocabulary Size: {len(word2vec_vocabulary)}")

Targets: (1569949,)
Contexts: (1569949, 5)
Labels: (1569949, 5)
Vocabulary Size: 10000


**When you have a lot of training examples, it can be difficult for a computer to process them all at once. To make it easier, we can use the `tf.data.Dataset` API again and group the examples into smaller batches, which can be processed more efficiently.**

In [39]:
dataset = tf.data.Dataset.from_tensor_slices(((targets, contexts), labels))
dataset = dataset.shuffle(10000).batch(1024, drop_remainder=True)

**As said before, the `word2vec` model is a tool that can help us tell which words go together by looking at how often they appear near each other in sentences. It does this by comparing the meanings of different words and figuring out which ones are similar.**

**To train the model, we can give it pairs of words and ask it to predict whether they belong together or not. We can check if the model is correct by comparing its predictions to the actual pairs of words that we already know go together. The model gets better over time as it learns from more and more examples of word pairs.**

**To create your `word2vec` model, you can use the `Keras Subclassing API` with different layers.**

- **The first layer will be the `target_embedding` layer, responsible for finding the meaning of a word when it's used as a target. The size of this layer depends on the size of our vocabulary and the dimension of your `embeddings`.**
- **The second layer will be the `context_embedding` layer, responsible for finding the meaning of a word when it's used in the context of another word. It has the same number of parameters as the target_embedding layer.**
- **The `dots` layer is used to combine the `target` and `context` embeddings and calculate a dot product.**
- **The `flatten` layer takes the output of the dots layer and makes it flat.**

**You can then define a `call()`function that takes a pair of words (`target` and `context`) and passes them through the target and context `embedding layers`, performs a dot product with their output, and returns the flattened result.**

In [40]:
import tensorflow as tf

class Word2Vec(tf.keras.Model):
  
  def __init__(self, vocab_size, embedding_dim):

    super(Word2Vec, self).__init__()
    self.target_embedding = tf.keras.layers.Embedding(vocab_size,
                                      embedding_dim,
                                      input_length=1,
                                      name="w2v_target_embedding")
    self.context_embedding = tf.keras.layers.Embedding(vocab_size,
                                       embedding_dim,
                                       input_length=num_ns+1,
                                       name="w2v_context_embedding")

  def call(self, pair):

    target, context = pair

    if len(target.shape) == 2:
      target = tf.squeeze(target, axis=1)

    word_emb = self.target_embedding(target)

    context_emb = self.context_embedding(context)

    dots = tf.einsum('be,bce->bc', word_emb, context_emb)

    return dots

**Since our labels are already one-hot-encoded, we will use `CategoricalCrossEntropy` as an alternative to the negative sampling loss, and `Adam` as the optimizer. Now, we instantiate our word2vec class with an embedding dimension of 512 and a vocabulary size of 10.000 words. If you don't want to train this model, you can get the embeddings directly when you download [the available dataset for this notebook](https://drive.google.com/uc?export=download&id=1KtaBxlGb8y4Do9jjbYSuVnrwLZESJk_Q).**

In [41]:
vocab_size = 10000
embedding_dimension = 512

word2vec = Word2Vec(vocab_size, embedding_dimension)

word2vec.compile(optimizer='adam',
                 loss=tf.keras.losses.CategoricalCrossentropy(from_logits=True),
                 metrics=['accuracy'])

print("Version: ", tf.__version__)
print("Eager mode: ", tf.executing_eagerly())
print("GPU is", "available" if tf.config.list_physical_devices('GPU') else "NOT AVAILABLE")

word2vec.fit(dataset, verbose=1, epochs=20)

Version:  2.10.1
Eager mode:  True
GPU is available
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x213914bd070>

**Now, we can recover the embeddings from both the `target` and `context` embedding layers. These embeddings will now hold some information about the relationship of words in our text corpus.**

In [42]:
# extract the embedding layer
embeddings_target = word2vec.get_layer('w2v_target_embedding').get_weights()[0]
embeddings_context = word2vec.get_layer('w2v_context_embedding').get_weights()[0]

# save the embeddings as a numpy array
with open('data/w2v_embeddings_w2.npy', 'wb') as fp:
    np.save(fp, embeddings_target)
    np.save(fp, embeddings_context)
    fp. close()

**Again, you can directly load them if you downloaded our dataset.**

In [47]:
with open('data/w2v_embeddings_w15.npy', 'rb') as fp:
    embeddings_target = np.load(fp)
    embeddings_context = np.load(fp)
    fp.close()

with open('data/word2vec_vocabulary.txt', encoding='utf-8') as fp:
    word2vec_vocabulary = [line.strip() for line in fp]
    fp.close()

print(f"Target Embeddings shape: {embeddings_target.shape}")
print(f"Context Embeddings shape: {embeddings_context.shape}")
print(f"Vocabulary Size: {len(word2vec_vocabulary)}")

Target Embeddings shape: (10000, 512)
Context Embeddings shape: (10000, 512)
Vocabulary Size: 10000


**Now, we can pair our embeddings with our vocabulary.** 

In [48]:
# create a dictionary of "word: embedding"
word2vec_target_embeddings = {}
word2vec_context_embeddings = {}
 
# iterating through the elements of the vocabulary
for i, word in enumerate(word2vec_vocabulary):
    # here we skip the embedding/token 0 (""), because is just the PAD token.
    if i == 0:
        continue
    word2vec_target_embeddings[word] = embeddings_target[i]
    word2vec_context_embeddings[word] = embeddings_context[i]

**Finally, we can perform some basic operations (`cosine similarity`) to try to understand and interpret what our model has learned, both for the `target` and `cosine` embeddings. While `target embeddings` hold information on "_relatedness among words_", `context embeddings` hold information on "_what words usually accompany the target word_".**

In [49]:
import pandas as pd
from numpy.linalg import norm
from IPython.display import Markdown 

def compute_cosine_table(string, dictionary, 
                         vocabulary, top_n):
    """
    Computes the cosine similarity between a given word and all other words in a dictionary.
    
    Parameters:
    -----------
    string : str
        The word to compare against.
    dictionary : dict
        A dictionary with words as keys and their corresponding embeddings as values.
    vocabulary : list
        A list of words in the dictionary.
    top_n : int
        The number of closest matches to return.
    
    Returns:
    --------
    A pandas DataFrame with the closest matches to the input word and their 
    corresponding similarity scores. The DataFrame is sorted in descending 
    order of similarity score and limited to the top_n matches.
    The index of the DataFrame is set to the closest matches.
    """

    l = vocabulary.copy()
    l.remove(string)

    cos = []
    for word in l[1::]:

        cosine = np.dot(dictionary[string],
                dictionary[word])/(norm(dictionary[string])*norm(dictionary[word]))
        cos.append(cosine)

    return pd.DataFrame({"Closest Match": l[1::],f"Similarity Score": cos})\
        .sort_values(f"Similarity Score", ascending=False)\
        .set_index('Closest Match').head(top_n)

word = "trump"

df = compute_cosine_table(word, 
        word2vec_target_embeddings, 
        word2vec_vocabulary, 10)

print("Cosine Similarity for Target Embeddings")
display(Markdown(df.to_markdown()))

df = compute_cosine_table(word, 
        word2vec_context_embeddings, 
        word2vec_vocabulary, 10)

print("Cosine Similarity for Context Embeddings")
display(Markdown(df.to_markdown()))

Cosine Similarity for Target Embeddings


| Closest Match   |   Similarity Score |
|:----------------|-------------------:|
| donald          |           0.194642 |
| counsel         |           0.194392 |
| mexicos         |           0.192862 |
| insider         |           0.184099 |
| arpaio          |           0.183056 |
| pence           |           0.181958 |
| white           |           0.178896 |
| trevor          |           0.17812  |
| trumps          |           0.177152 |
| scarborough     |           0.176633 |

Cosine Similarity for Context Embeddings


| Closest Match   |   Similarity Score |
|:----------------|-------------------:|
| donald          |           0.67039  |
| on              |           0.4248   |
| with            |           0.400963 |
| his             |           0.398482 |
| tweeters        |           0.388154 |
| trumps          |           0.386425 |
| ¯ツ¯            |           0.385741 |
| president       |           0.381984 |
| he              |           0.376554 |
| taunts          |           0.367066 |

**We can also perform basic arithmetic operations with these vector embeddings, which is another way to try to understand the knowledge they hold.**

In [50]:
def find_closest_match(array, dictionary, vocabulary, 
                           word1, word2, top_n):
    """
    Computes the cosine similarity between a given array and all other word 
    embeddings in a dictionary except for two specified words.
    
    Parameters:
    -----------
    array : numpy.ndarray
        An array representing the embedding of a word or phrase.
    dictionary : dict
        A dictionary with words as keys and their corresponding embeddings as values.
    vocabulary : list
        A list of words in the dictionary.
    word1 : str
        The first word to exclude from the matches.
    word2 : str
        The second word to exclude from the matches.
    top_n : int
        The number of closest matches to return.
    
    Returns:
    --------
        A pandas DataFrame with the closest matches to the input array and
        their corresponding similarity scores. The DataFrame is sorted in 
        descending order of similarity score and limited to the top_n matches.
        The index of the DataFrame is set to the closest matches.
    """

    l = vocabulary.copy()
    l.remove(word1)
    l.remove(word2)

    cos = []

    for word in l[1::]:
        cosine = np.dot(array,
                dictionary[word])/(norm(array)*norm(dictionary[word]))
        cos.append(cosine)

    return pd.DataFrame({"Closest Match": l[1::],f"Similarity Score": cos})\
        .sort_values(f"Similarity Score", ascending=False)\
        .set_index('Closest Match').head(top_n)

word1 = 'man'
word2 = 'music'

difference_vec = word2vec_target_embeddings[word1] + word2vec_target_embeddings[word2]

df = find_closest_match(difference_vec, word2vec_target_embeddings, 
                           word2vec_vocabulary, word1, word2, 5)

display(Markdown(df.to_markdown()))

| Closest Match   |   Similarity Score |
|:----------------|-------------------:|
| jazz            |           0.214929 |
| christina       |           0.188508 |
| duff            |           0.185525 |
| peek            |           0.184137 |
| centers         |           0.181551 |

**Apperently, "_man_" + "_music_" get us close to "_jazz_" 🎶🎷.**

**If you wanna learn more about the exploration of `word embeddings`, like projecting them into a 3D space, go to [`teeny-tiny_castle/ML Explainability/NLP Interpreter/`](https://github.com/Nkluge-correa/teeny-tiny_castle/tree/master/ML%20Explainability/NLP%20Interpreter) and check the [`investigating_word_embeddings`](https://github.com/Nkluge-correa/teeny-tiny_castle/blob/f68411d51bb6c9bbc877a084d4233218d09acbf7/ML%20Explainability/NLP%20Interpreter/investigating_word_embeddings.ipynb) notebook.**

---

Return to the [castle](https://github.com/Nkluge-correa/teeny-tiny_castle).