NLP Training 4: Word2Vec (in PyTorch) from scratch
--- 


In [None]:
import os
os.chdir('..')
print(f'Setting working dir to: {os.getcwd()}')

In [None]:
import torch
from torch import nn
import torch.nn.functional as F
import pytorch_lightning as pl
from pytorch_lightning.callbacks.early_stopping import EarlyStopping

from utils.word2vec import get_dataloader_and_vocab

torch.manual_seed(0)

## PyTorch and Word2Vec

In this notebook, we are going to create a simple version of the original Word2Vec model in PyTorch.    
The implementation follows [this blog post](https://towardsdatascience.com/word2vec-with-pytorch-implementing-original-paper-2cd7040120b0) from medium. Make sure not to take a solution before you tried to solve it by yourself 😉.



### Exercise 1 - Data Loader

We already implemented, imported and instantiated a [PyTorch DataLoader](https://pytorch.org/docs/stable/data.html) class and a [PyTorchText Vocablurary](https://pytorch.org/text/stable/vocab.html#id1) class for you (see below).   

The Vocab class is a simple mapping from Words to token IDs. It will do the following:

1. Create a vocabulary of a particular size and store it

The DataLoader will draw batches of samples from the WikiText2 dataset that we will later use to train the Word2Vec model. It will do the follow steps:

1. Take a paragraph from the raw datafile
2. Convert it to lowercase, tokenize, and encode it
3. Make sure paragraph are neither to long nor to short
4. Transform the paragraph into context and target words using a moving window
5. Return both context (input) and target (output) as a batch

The DataLoader is a Python Iterable (implements the `__iter__()`-method).   

Try to do the following:   
- Check the length of the vocablurary we created
- Can you find out, which token ID (aka index) was a assigned to the word "deep"?

In [None]:
dataloader, vocab = get_dataloader_and_vocab(
        ds_name='WikiText2',
        ds_type='train',
        batch_size=4,
        n_window=4,
        shuffle=True,
        vocab=None)

# Add your solution here:
# ...

#### Solution

In [None]:
len_of_vocab = len(vocab)
print(f'Used a vocablurary of length: {len_of_vocab}')

idx_of_deep = vocab.lookup_indices(['deep'])
print(f'The index of the word "deep" is: {idx_of_deep}')

#### Exercise 1a

Next, do the following:
- Draw a single batch from the `dataloader`
- What does it return, what is the size?
- Check the correctness of the input and output by converting the token indices back to raw tokens and thereby reconstruct the original sentence (for simplicity use only the first sample of the batch)

In [None]:
# Add your solution here:
# ...

#### Solution

In [None]:
input, output = next(iter(dataloader))

print(f'Input shape: {input.shape}')
print(f'Output shape: {output.shape}')

In [None]:
first_input_idx = input[0].numpy()
first_output_idx = output[0].numpy()

first_input_tokens = vocab.lookup_tokens(first_input_idx)
first_output_tokens = vocab.lookup_tokens([first_output_idx])

print(f'Left context: {first_input_tokens[:4]}, Target: {first_output_tokens}, Right context: {first_input_tokens[4:]}')

### Exercise 2 - Implementing the Word2Vec CBOW Model

In the following exercise, you should complete the given code for the Word2Vec CBOW model. We already implemented most of the components for a [PyTorch Lightning](https://pytorch-lightning.readthedocs.io/en/latest/common/lightning_module.html) model class.   

See if you can fill in the missing parts.

In [None]:
# Add your solution here:
# -> SEE CAPITAL LETTERS

class CBOW_Model(pl.LightningModule):
    """
    Implementation of CBOW model described in paper:
    https://arxiv.org/abs/1301.3781
    """
    def __init__(self, 
                 vocab_size: int, 
                 emed_dim: int = 300, 
                 embed_max_norm: int = 1):
        super(CBOW_Model, self).__init__()
        
        # We need to define embedding layer first, see if you can figure out
        # how to do it.
        # Hint: PyTorch offers a special embedding layer that can handle this
        self.embeddings = # FILL IN HERE
        
        # Then we need to define linear layer.
        self.linear = # FILL IN HERE
        
        # Finally, we need to define loss function.
        self.loss = 

    def forward(self, inputs_):
        """
        Forward pass of CBOW model.

        Be aware, no softmax activation in output due 
        to the PyTorch CrossEntropyLoss requiring raw unnormalized scores.

        Args:
            inputs_: tensor of shape (batch_size, n_window*2)
                     where n_window is the number of context words.
        """
        
        # First, we need to get embeddings of all context words by passing
        # them through embedding layer.
        x = # FILL IN HERE
        
        # Then we need to average all embeddings.
        x = x.mean(axis=1)
        
        # Finally, we need to pass averaged embeddings through linear layer.
        x = # FILL IN HERE
        return x

    def training_step(self, batch, batch_idx):
        """
        Training step of CBOW model.

        Args:
            batch: tuple of (inputs, targets)
            batch_idx: index of batch
        """
        # Get inputs and targets from batch.
        x, y = batch
        
        # Forward pass will return raw logits
        logits = self.forward(x)
        
        # We need to calculate loss for each batch
        loss = # FILL IN HERE
        
        # We need to log the loss for each batch
        self.log('train_loss', loss)
        return loss

    def validation_step(self, val_batch, batch_idx):
        """
        Validation step of CBOW model.
        
        Args:
            val_batch: tuple of (inputs, targets)
            batch_idx: index of batch
        """
        x, y = val_batch
        logits = self.forward(x)
        loss = self.loss(logits, y)
        self.log('val_loss', loss)

    def configure_optimizers(self):
        """
        Configure optimizers for CBOW model.
        """
        optimizer = torch.optim.Adam(self.parameters(), lr=1e-3)
        return optimizer

#### Solution

In [None]:
class CBOW_Model(pl.LightningModule):
    """
    Implementation of CBOW model described in paper:
    https://arxiv.org/abs/1301.3781
    """
    def __init__(self, 
                 vocab_size: int, 
                 emed_dim: int = 300, 
                 embed_max_norm: int = 1):
        super(CBOW_Model, self).__init__()
        
        self.embeddings = nn.Embedding(
            num_embeddings=vocab_size,
            embedding_dim=emed_dim,
            max_norm=embed_max_norm,
        )
        
        self.linear = nn.Linear(
            in_features=emed_dim,
            out_features=vocab_size,
        )
        
        self.loss = nn.CrossEntropyLoss()

    def forward(self, inputs_):
        """
        Forward pass of CBOW model.

        Be aware, no softmax activation in output due 
        to the PyTorch CrossEntropyLoss requiring raw unnormalized scores.

        Args:
            inputs_: tensor of shape (batch_size, n_window*2)
                     where n_window is the number of context words.
        """
        x = self.embeddings(inputs_)
        x = x.mean(axis=1)
        x = self.linear(x)
        return x

    def training_step(self, batch, batch_idx):
        """
        Training step of CBOW model.

        Args:
            batch: tuple of (inputs, targets)
            batch_idx: index of batch
        """
        x, y = batch
        logits = self.forward(x)
        loss = self.loss(logits, y)
        self.log('train_loss', loss)
        return loss

    def validation_step(self, val_batch, batch_idx):
        """
        Validation step of CBOW model.
        
        Args:
            val_batch: tuple of (inputs, targets)
            batch_idx: index of batch
        """
        x, y = val_batch
        logits = self.forward(x)
        loss = self.loss(logits, y)
        self.log('val_loss', loss)

    def configure_optimizers(self):
        """
        Configure optimizers for CBOW model.
        """
        optimizer = torch.optim.Adam(self.parameters(), lr=1e-3)
        return optimizer

### Exercise 3 - Run the training

In [None]:
# Settings
DATASET = 'WikiText2'
BATCH_SIZE = 128
N_WINDOW = 4

# Data sets
train_dataloader, vocab = get_dataloader_and_vocab(
        ds_name=DATASET,
        ds_type='train',
        batch_size=BATCH_SIZE,
        n_window=N_WINDOW,
        shuffle=True,
        vocab=None,
    )

val_dataloader, _ = get_dataloader_and_vocab(
        ds_name=DATASET,
        ds_type='valid',
        batch_size=BATCH_SIZE,
        n_window=N_WINDOW,
        shuffle=False,
        vocab=vocab,
    )

# Init model
vocab_size = len(vocab.get_stoi())
model = CBOW_Model(vocab_size=vocab_size)

In [None]:
# Run training

# Determine device
device = torch.device('gpu' if torch.cuda.is_available() else 'cpu')
print(f'Using device: {device}')

trainer = pl.Trainer(accelerator=device,
                     max_epochs=5,
                     callbacks=[EarlyStopping(monitor="val_loss", mode="min")])

trainer.fit(model, train_dataloader, val_dataloader)

### Exercise 4 - Get Embeddings and Calculate Similarity

Finally, we need to extract the embeddings from the linear layer. Each row in the embedding corresponds to a word in our vocablurary.   

- Can you proof that the embeddings have to correct shape?
- See how good the model performes by calculating the similarity for the word "mother" with all other words in our vocablurary and find the five most similar ones (use the `utils.word2vec.get_top_similar` function for this)

In [None]:
from utils.word2vec import get_top_similar

# Add your solution here:
# ...

In [None]:
# Get embeddings from the linear layer of our model
w2v_embeddings = model.linear.weight.detach().cpu().numpy()

# Each row is a word embedding
print(f'Shape of embeddings: {w2v_embeddings.shape} and vocab size: {vocab_size}')

# If we would like, we could get the corresponding words from our vocablurary
# print(vocab.get_itos())

get_top_similar(w2v_embeddings, 'mother', vocab, top_n=5)