<a href="https://colab.research.google.com/github/GuilhermePascon/ml/blob/main/Lecture_14___word2vec_Hands_On.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

###  MC959-MO810 -- Introduction to Self-Supervised (Representation) Learning (SSRL)

* Instructor: Marcelo S. Reis. <a href="mailto:msreis@unicamp.br">msreis@unicamp.br</a>

Campinas, September 30, 2024.


## Lecture 14: word2vec Hands On

*Based on the [word2vec-pytorch](https://github.com/OlgaChernytska/word2vec-pytorch) implementation by Chernytska.*

In this lab, we will carry out a SSRL pipeline using the classic word2vec method. The original word2vec paper proposes two simple models that use as pretext task either the center word prediction (bag-of-words) or the neighbor word prediction (skip-gram). In skip-gram, we feed a model with a pair $(X,Z)$ of words, where $Z = w(i)$ is a neighbor word of $X = w(t)$ (either one of the $w(t-N), \ldots, w(t-1)$ words in the history or one of the $w(t+1), \ldots, w(t+N)$ words in the future) and the pretext task is the prediction what are the neighbor words of a given $X$. An example for $k = 2$:

<img src="http://www.ic.unicamp.br/~msreis/Skip-gram.png">

In this lab, we'll implement in PyTorch the word2vec model using the ski-gram pretext task. We'll fully pretrain that model, visualize the yielded embeddings with a dimensionality reduction technique called t-SNE and also verify the relationship between words through algebraic operations between vectors.

### Summary <a class="anchor" id="topo"></a>

* [Part 1: Solving main dependencies](#part_01).
* [Part 2: Setting up the pipeline](#part_02).
* [Part 3: Loading the WikiText-2 dataset and preprocessing text](#part_03).
* [Part 4: Implementing the word2vec skip-gram model and trainer](#part_04).
* [Part 5: Pretraining of our word2vec skip-gram model](#part_05).
* [Part 6: Visualizing the yielded embeddings with t-SNE](#part_06).
* [Part 7: Algebraic checking of similar words](#part_07).
* [Part 8: Suggestion of exercises on this pipeline](#part_08).


### Part 1: Solving main dependencies <a class="anchor" id="part_01"></a>

Here we load the main libraries that will be used throughout this notebook.



In [None]:
import numpy as np

# Install (if needed) and import torchtext, a package for NLP.
#
# https://pytorch.org/text/stable/index.html
# https://github.com/pytorch/text#installation
#
#!pip install torch==2.3.0
#!pip install torchtext==0.18.0

import torch
import torch.nn as nn
import torch.nn.functional as F    # to import the cosine similarity and cross entropy
import torch.optim as optim
import torchtext

print(torch.__version__)
print(torchtext.__version__)



### Part 2: Setting up the pipeline <a class="anchor" id="part_02"></a>

Initialization of pseudo-random number generator seed, paths, and so forth.



In [None]:
# Global seed (useful for reproducibility of our pipeline)
#
torch.manual_seed(46)

# The following three variables are hyperparameters of the
# pretext task.

# Dimension of the embeddings (word vectors).
#
EMBED_DIMENSION = 300

# Minimum frequency to allow a word to be included into vocabulary.
#
MIN_WORD_FREQUENCY = 50

# Skip-gram window size.
#
SKIPGRAM_N_WORDS = 4


# Maximum number of tokens in a paragraph.
#
MAX_SEQUENCE_LENGTH = 256


# Max value of the Euclidean (L2) norm applied at any embedding.
# Setting here with a real value instead of "None" acts as a
# regularization parameter of the pretraining procedure.
#
EMBED_MAX_NORM = 1

# Path to the folder where the datasets are/should be downloaded (e.g. )
#
DATASET_PATH = "./"

# Path to the folder where the pretrained models are saved.
#
CHECKPOINT_PATH = "./"

# Checking the number of CPU cores
#
import os
NUM_WORKERS = os.cpu_count()

# Ensure that all operations are deterministic on GPU (if used) for reproducibility.
#
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

device = torch.device("cuda:0") if torch.cuda.is_available() else torch.device("cpu")
print("Device:", device)
print("Number of workers:", NUM_WORKERS)

### Part 3: Loading the WikiText-2 dataset and preprocessing text <a class="anchor" id="part_03"></a>

The WikiText-2 is a small text dataset available at Pytorch (~ 2M of tokens, ~ 36K lines). This dataset is composed of a set of verified good and feature articles from English Wikipedia.

In this step, we'll also define some preprocessing functions: one that sets the tokenizer (token extractor from text), one for setting the vocabulary, one to get the dataset iterator, and one to define a collage_fn (a function to dynamically extract batches of different paragraph sizes).

In [None]:
from functools import partial
from torch.utils.data import DataLoader
from torchtext.data import to_map_style_dataset
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from torchtext.datasets import WikiText2


def get_english_tokenizer():
    """
    Basic English tokenizer lowercases text and splits tokens
    by whitespaces. Punctuations are also included as tokens.

    For example, the sentence: I like Earth!

    Generates the tokens "i", "like", "earth" and "!".
    """
    tokenizer = get_tokenizer("basic_english", language="en")
    return tokenizer


def get_data_iterator(ds_type, data_dir):
    """
    Receives a dataset type (train, valid or test) and the dataset directory,
    and returns a map data iterator for the dataset in the directory.
    """
    data_iter = WikiText2(root=data_dir, split=(ds_type))
    data_iter = to_map_style_dataset(data_iter)
    return data_iter


def build_vocab(data_iter, tokenizer):
    """
    This function returns a vocabulary in which it will be
    included only words that appear at least MIN_WORD_FREQUENCY
    in the dataset.

    For a vocabulary of size N, each word included into it will
    have an ID ranging from 1 to N.

    Every word with frequency < 50 will encode with ID = 0.
    """
    vocab = build_vocab_from_iterator(
        map(tokenizer, data_iter),
        specials=["<unk>"],          # In WikiText2, rare words are signed with the token <unk>
        min_freq=MIN_WORD_FREQUENCY, # We'll set that token as a special symbol (ID = 0).
    )
    vocab.set_default_index(vocab["<unk>"])
    return vocab


def collate_skipgram(batch, text_pipeline):
    """
    Creates a customized collate_fn (the batch creator function)
    for the Dataloader.

    "batch" is expected to be list of text paragrahs.

    Context is represented as N=SKIPGRAM_N_WORDS past words
    and N=SKIPGRAM_N_WORDS future words.

    Long paragraphs will be truncated to contain
    no more that MAX_SEQUENCE_LENGTH tokens.

    For a tutorial on modifing collate_fn, refer to:

    https://python.plainenglish.io/understanding-collate-fn-in-pytorch-f9d1742647d3
    """
    batch_input  = []  # each element of this list is a middle word ("X").
    batch_output = []  # each element of this list is a context word ("Z").

    for paragraph in batch:

        text_tokens_ids = text_pipeline(paragraph)

        N = SKIPGRAM_N_WORDS

        # Paragraph too short (can't take even a single skipgram); skip it.
        #
        if len(text_tokens_ids) < 2 * N + 1:
            continue

        # Paragraph too long; truncate it.
        #
        elif len(text_tokens_ids) > MAX_SEQUENCE_LENGTH:
            text_tokens_ids = text_tokens_ids[:MAX_SEQUENCE_LENGTH]

        # With the moving window of size 2N+1 (N history words, middle word,
        # and N future words) it loops through the paragraph.
        #
        for idx in range(len(text_tokens_ids) - 2 * N):

            # Takes a token subset that starts at position idx.
            #
            token_id_sequence = text_tokens_ids[idx : (idx + 2 * N + 1)]

            # Remove the element index N (X, middle word)
            #
            # For N == 4:
            #
            # index:     0  1  2  3  4  5  6  7  8
            #            z  z  z  z  x  z  z  z  z
            #
            input_ = token_id_sequence.pop(N)

            # Keep elements 0..N-1 and N+1..2N (Z, history and future words)
            #
            outputs = token_id_sequence

            # Create 2N pairs (X,Z) in the form:
            #
            #  (input_, output_1), (input_, output_2), ..., (input, output_2N)
            #
            for output in outputs:
                batch_input.append(input_)
                batch_output.append(output)

    # Merge middle words of all paragraphs ("Xs")
    #
    batch_input = torch.tensor(batch_input, dtype=torch.long)

    # Merge context words of all paragraphs ("Zs")
    #
    batch_output = torch.tensor(batch_output, dtype=torch.long)

    return batch_input, batch_output


def get_dataloader_and_vocab(ds_type, data_dir, batch_size, shuffle, vocab=None):
    """
    Receives the dataset type, the data directory, batch size and a shuffle flag.
    It gets the data iterator and the tokenizer, create a vocabulary if the vocab
    argument doesn't contain one and use all of that to create and return a dataloader.
    """
    data_iter = get_data_iterator(ds_type, data_dir)
    tokenizer = get_english_tokenizer()

    if not vocab:
        vocab = build_vocab(data_iter, tokenizer)

    text_pipeline = lambda x: vocab(tokenizer(x))

    dataloader = DataLoader(
        data_iter,
        batch_size = batch_size,
        shuffle = shuffle,
        collate_fn = partial(collate_skipgram, text_pipeline = text_pipeline),
    )

    return dataloader, vocab


print("All functions sucessfully declared.")

### Part 4: Implementing the word2vec skip-gram model and trainer<a class="anchor" id="part_04"></a>

Here we will implement two classes: <code>SkipGram_Model</code> for the word2vec skip-gram model architecture; and <code>Trainer</code>, for a trainer.

<code>SkipGram_Model</code> is a class that uses nn.Module, a base class for all neural network modules.

<code>Trainer</code> is a class that at instantiation initializes a model, the data loader and optimization parameters. It also contains methods for training, validation and saving the yielded results (checkpoints, model, loss).

In [None]:
# For the class Trainer.
#
import json

class SkipGram_Model(nn.Module):
    """
    Implementation of Skip-Gram model described in the word2vec paper.
    https://arxiv.org/abs/1301.3781
    """
    def __init__(self, vocab_size: int):
        super(SkipGram_Model, self).__init__()

        # A simple lookup table that stores embeddings of
        # a fixed dictionary and size.
        #
        self.embeddings = nn.Embedding(
            num_embeddings = vocab_size,
            embedding_dim = EMBED_DIMENSION,
            max_norm = EMBED_MAX_NORM)

        # The model itself.
        #
        self.linear = nn.Linear(
            in_features = EMBED_DIMENSION,
            out_features = vocab_size)

    def forward(self, inputs_):
        x = self.embeddings(inputs_)  # Inputs_ yields projection
        x = self.linear(x)            # Projection yields output

        # Output has no softmax: Pytorch' CrossEntropyLoss uses raw scores.
        #
        return x


class Trainer:
    """
    Main class for model training. It is implemented as
    a typical PyTorch train and validation flow.
    """
    def __init__(
        self,
        model,
        epochs,
        train_dataloader,
        val_dataloader,
        criterion,
        optimizer,
        lr_scheduler,
        device,
        model_dir,
        model_name,
    ):
        self.model = model
        self.epochs = epochs
        self.train_dataloader = train_dataloader
        self.val_dataloader = val_dataloader
        self.criterion = criterion
        self.optimizer = optimizer
        self.lr_scheduler = lr_scheduler
        self.device = device
        self.model_dir = model_dir
        self.model_name = model_name

        self.loss = {"train": [], "val": []}
        self.model.to(self.device)

    def train(self):
        for epoch in range(self.epochs):
            self._train_epoch()
            self._validate_epoch()
            print(
                "Epoch: {}/{}, Train Loss={:.5f}, Val Loss={:.5f}".format(
                    epoch + 1,
                    self.epochs,
                    self.loss["train"][-1],
                    self.loss["val"][-1],
                )
            )

            self.lr_scheduler.step()

            self._save_checkpoint(epoch)

    def _train_epoch(self):
        """
        Run a training epoch, recording the updated loss at each step.
        """
        self.model.train()
        running_loss = []

        for i, batch_data in enumerate(self.train_dataloader, 1):
            inputs = batch_data[0].to(self.device)
            labels = batch_data[1].to(self.device)

            self.optimizer.zero_grad()      # Resets the gradients of all optimized tensors.
            outputs = self.model(inputs)
            loss = self.criterion(outputs, labels)
            loss.backward()                        # Compute the gradient.
            self.optimizer.step()

            running_loss.append(loss.item())

        epoch_loss = np.mean(running_loss)
        self.loss["train"].append(epoch_loss)

    def _validate_epoch(self):
        """
        Run a validation epoch, recording the updated loss at each step.
        """
        self.model.eval()
        running_loss = []

        with torch.no_grad():
            for i, batch_data in enumerate(self.val_dataloader, 1):
                inputs = batch_data[0].to(self.device)
                labels = batch_data[1].to(self.device)

                outputs = self.model(inputs)
                loss = self.criterion(outputs, labels)

                running_loss.append(loss.item())

        epoch_loss = np.mean(running_loss)
        self.loss["val"].append(epoch_loss)

    def _save_checkpoint(self, epoch):
        """
        Save model checkpoint to `self.model_dir` directory.
        """
        epoch_num = epoch + 1
        model_path = "checkpoint_{}.pt".format(str(epoch_num).zfill(3))
        model_path = os.path.join(self.model_dir, model_path)
        torch.save(self.model, model_path)

    def save_model(self):
        """
        Save final model to `self.model_dir` directory.
        """
        model_path = os.path.join(self.model_dir, "model.pt")
        torch.save(self.model, model_path)

    def save_loss(self):
        """
        Save train/val loss as json file to `self.model_dir` directory.
        """
        loss_path = os.path.join(self.model_dir, "loss.json")
        with open(loss_path, "w") as fp:
            json.dump(self.loss, fp)

print("word2vec skip-gram model and trainer sucessfully implemented.")


### Part 5: Pretraining of our word2vec skip-gram model <a class="anchor" id="part_05"></a>

Training for 5 epochs on CPUs (eight Intel i5-8265U CPU cores at 1.60GHz) took around 1 hour. With a good GPU this time can be reduced to less than 20 minutes.


In [None]:
#!pip install torchdata
#!pip install portalocker>=2.0.0
import torchdata

def get_lr_scheduler(optimizer, total_epochs: int, verbose: bool = True):
    """
      Scheduler to linearly decrease learning rate,
      so that learning rate after the last epoch is 0.
    """
    lr_lambda = lambda epoch: (total_epochs - epoch) / total_epochs
    lr_scheduler = LambdaLR(optimizer, lr_lambda = lr_lambda, verbose = verbose)
    return lr_scheduler


def save_vocab(vocab, model_dir: str):
    """
      Save vocab file to `model_dir` directory.
    """
    vocab_path = os.path.join(model_dir, "vocab.pt")
    torch.save(vocab, vocab_path)


train_dataloader, vocab = get_dataloader_and_vocab(
    ds_type = "train",
    data_dir = DATASET_PATH,
    batch_size = 96,
    shuffle = True,
    vocab = None)

vocab_size = len(vocab.get_stoi())
print(f"Vocabulary size: {vocab_size}")

save_vocab(vocab, CHECKPOINT_PATH)
print("Model vocabulary saved to folder:", CHECKPOINT_PATH)



In [None]:
import torch.optim as optim
from torch.optim.lr_scheduler import LambdaLR

# We ignore the validation's vocab, since we already
# got it from the training dataset.
#
val_dataloader, _ = get_dataloader_and_vocab(
    ds_type = "valid",
    data_dir = DATASET_PATH,
    batch_size = 96,
    shuffle = True,
    vocab = vocab)

model = SkipGram_Model(vocab_size = vocab_size)

learning_rate = 0.025

optimizer = optim.Adam(model.parameters(), lr = learning_rate)

n_epochs = 1

trainer = Trainer(
    model = model,
    epochs = n_epochs,
    train_dataloader = train_dataloader,
    val_dataloader = val_dataloader,
    criterion = nn.CrossEntropyLoss(),
    optimizer = optimizer,
    lr_scheduler = get_lr_scheduler(optimizer, n_epochs),
    device = device,
    model_dir = CHECKPOINT_PATH,
    model_name = 'skipgram',
)

trainer.train()
print("Training finished.")

trainer.save_model()
trainer.save_loss()
print("Model files saved to folder:", CHECKPOINT_PATH)


### Part 6: Visualizing the yielded embeddings with t-SNE<a class="anchor" id="part_06"></a>

The produced word vectors have EMBED_DIMENSION (300 in the execution for the lecture) size each. To have a full graphical view of our results is not possible for more than 3 dimensions. Therefore, we need to apply a dimensionality reduction technique.

To this end, we will employ t-SNE, a dimensionality reduction technique that will create a non-linear projection of datapoints in the EMBED_DIMENSION space into a 2D space, whose points we can plot into a graph.  

In [None]:
import pandas as pd
from sklearn.manifold import TSNE
!pip install plotly
import plotly.graph_objects as go

# Getting the embeddings (from the first model layer).
#
# Each embedding is a 1D vector with EMBED_DIMENSION size.
#
# Hence all embeddings are a matrix of size vocab size x EMBED_DIMENSION.
#
embeddings = list(model.parameters())[0]
embeddings = embeddings.cpu().detach().numpy() # convert from tensor to numpy

# Normalization of the embeddings.
#
norms = (embeddings ** 2).sum(axis=1) ** (1 / 2)
norms = np.reshape(norms, (len(norms), 1))
embeddings_norm = embeddings / norms
embeddings_norm.shape

# Get the embeddings into a dataframe.
#
embeddings_df = pd.DataFrame(embeddings)

# t-SNE projection.
#
tsne = TSNE(n_components=2)
embeddings_df_trans = tsne.fit_transform(embeddings_df)
embeddings_df_trans = pd.DataFrame(embeddings_df_trans)

# Getting the token order.
#
embeddings_df_trans.index = vocab.get_itos()

# In the case the token is a number.
#
is_numeric = embeddings_df_trans.index.str.isnumeric()

color = np.where(is_numeric, "green", "black")
fig = go.Figure()

fig.add_trace(
    go.Scatter(
        x=embeddings_df_trans[0],
        y=embeddings_df_trans[1],
        mode="text",
        text=embeddings_df_trans.index,
        textposition="middle center",
        textfont=dict(color=color)))

fig.show()


### Part 7: Algebraic checking of similar words<a class="anchor" id="part_07"></a>

One interesting point of word embedding vectors yielded by word2vec is that, using a similarity distance such as the cosine similarity for a pairwise comparison of vectors, words related to each other should be more similar than words that are unrelated to each other

Two classical examples are "King - Man + Woman = Queen" and "Paris - France + Germany = Berlin".

Let us try now those algebraic operations and see what happens.

In [None]:
def get_top_similar(word: str, N: int = 10):
    word_id = vocab[word]
    if word_id == 0:
        print("Out of vocabulary word")
        return

    word_vec = embeddings_norm[word_id]
    word_vec = np.reshape(word_vec, (len(word_vec), 1))
    dists = np.matmul(embeddings_norm, word_vec).flatten()
    topN_ids = np.argsort(-dists)[1 : N + 1]

    topN_dict = {}
    for sim_word_id in topN_ids:
        sim_word = vocab.lookup_token(sim_word_id)
        topN_dict[sim_word] = dists[sim_word_id]
    return topN_dict


print("The five most similar words to the word 'Germany' are:\n")

for word, sim in get_top_similar("germany", 5).items():
    print("{}: {:.3f}".format(word, sim))

#----------------------------------------------------------#

emb1 = embeddings[vocab["king"]]
emb2 = embeddings[vocab["man"]]
emb3 = embeddings[vocab["woman"]]

emb4 = emb1 - emb2 + emb3

# Calculate the Euclidean norm.
#
emb4_norm = (emb4 ** 2).sum() ** (1 / 2)
emb4 = emb4 / emb4_norm

emb4 = np.reshape(emb4, (len(emb4), 1))
dists = np.matmul(embeddings_norm, emb4).flatten()

print("\n\nThe five most similar words to 'King' - 'Man' + 'Woman' are:\n")

closest_five_words = np.argsort(-dists)[:5]
for word_id in closest_five_words:
    print("{}: {:.3f}".format(vocab.lookup_token(word_id), dists[word_id]))

### Part 8: Suggestions of exercises on this pipeline<a class="anchor" id="part_08"></a>

Some possibilities for further explore the SSL pipeline presented here include:


* Repeat the pipeline using the [WikiText103](https://pytorch.org/text/stable/_modules/torchtext/datasets/wikitext103.html) dataset instead of WikiText-2 (~ 100M tokens, 1.8M lines). Training on that dataset for 5 epochs should take an overnight in a GPU server.


* Use the yielded embeddings for a downstream task. One suggestion is to try the [large movie review dataset](https://ai.stanford.edu/~amaas/data/sentiment/), which performs binary sentiment classification. It is a high-quality labeled dataset with 25K reviews for training and 25K reviews for testing. For this exercise use a simple linear model such as the logistic regression.


* As a baseline for the previous exercise, try to carry out supervised learning of logistic regression directly on the large movie review dataset and compare the results. Did the usage of embeddings in the previous exercise yielded better results than here?


* Instead of the skip-gram pretext task of the original word2vec paper, we can also try the skip-gram with negative sampling pretext task presented in Lecture 13, whose formal procedure is described in [this paper](https://arxiv.org/abs/1310.4546).


* If your embeddings don't improve the downstream task, instead of repeating the training on a very large corpus (e.g., with gigas of tokens), which would take days or even weeks, download a pretrained word-vectors (e.g., [GloVe](https://nlp.stanford.edu/projects/glove/)) and train again the logistic regression with the large movie review dataset.

[Back to summary.](#topo)

