# Lab — Implementing Word2Vec (CBOW) in PyTorch

**Objective:**  
In this lab, we will build **from scratch** a Word2Vec model using the **CBOW (Continuous Bag of Words)** architecture.

The goal is to learn **vector representations** (embeddings) of words by training a small neural network:

$$\mathbb{P}(w_t \mid h_t) = \text{Softmax}(W^{(2)} \cdot h_t)$$

where:

- The average of the embeddings of the **context words** is given by  
  $$ h_t = \frac{1}{2C} \sum_{i=-C, i\neq 0}^{C} v(w_{t+i}), $$
- $v(w) \in \mathbb{R}^p$ is the embedding vector of word $w$,
- $W^{(1)} \in \mathbb{R}^{p \times N}$ and $W^{(2)} \in \mathbb{R}^{N \times p}$ are the model parameters,
- And the **loss** is the negative log-likelihood:

$$\mathcal{L} = - \sum_{t=1}^{T} \log \mathbb{P}(w_t \mid h_t)$$

We will train our model on a small corpus extracted from *WikiText-2*.

## 1. Import packages and Corpora preprocess

In [1]:
import pandas as pd
import spacy
from tqdm import tqdm
import torch
from torch import nn
from torch.utils.data import Dataset, DataLoader
import random
import torch.nn.functional as F

In [2]:
train_iter = pd.read_csv(
    './wikitext-2-train.csv',
    header=None,
    names=['text'],
    encoding='utf-8'
)

- Apply a tokenization function (e.g. `tokenize_sentence`) to each non-empty text entry in `train_iter['text']`.
- You can use `tqdm` to visualize progress.
- Filter out sentences shorter than 3 tokens.
- Print a small sample (e.g. first 10 tokens) from one sentence to verify the result.

In [3]:
# Import the tokenizer model 
nlp = spacy.load("en_core_web_sm")

In [4]:
def tokenize_sentence(sentence):
    return [token.text.lower() for token in nlp(sentence) if token.is_alpha]

In [1]:
sentences = ...

## 2. Create a vocabulary from the tokenized sentences.

- Count all word occurrences across the corpus using `Counter`.
- Keep only words that appear at least 5 times.
- Build two dictionaries:
  - `word2idx`: maps each word to a unique index.
  - `idx2word`: inverse mapping from index to word.
- Compute the total vocabulary size `V` and print it.

In [None]:
from collections import Counter

...
V = ...
print(f"Taille du vocabulaire : {V}")

## 3. Generate (context, target)

In [4]:
def generate_context_target(sentences, context_size=5):
    """
    Generate (context, target) training pairs for the CBOW model.

    For each word in each sentence, this function extracts:
        - A target word (the central word)
        - Its surrounding context words within a specified window size

    Example:
        Sentence: ["the", "quick", "brown", "fox", "jumps"]
        context_size = 2
        → Target = "brown"
        → Context = ["the", "quick", "fox", "jumps"]

    Args:
        sentences (list of list of str): List of tokenized sentences
        context_size (int): Number of context words to take on each side of the target

    Returns:
        list of tuples: Each tuple (context, target) contains
                        - context: list of word indices (size = 2 * context_size)
                        - target: integer index of the center word

    Meaning:
        The result is a training dataset for the CBOW neural network, where each
        example teaches the model to predict the central word given its surrounding
        context words.
    """
    data = []
    for sentence in sentences:
        indices = [word2idx[w] for w in sentence if w in word2idx]
        for i in range(context_size, len(indices) - context_size):
            context = indices[i - context_size:i] + indices[i + 1:i + context_size + 1]
            target = indices[i]
            data.append((context, target))
    return data

In [None]:
CONTEXT_SIZE = 5
data = generate_context_target(sentences, CONTEXT_SIZE)
print("Exemple :", data[0])

## 4. Define a PyTorch Dataset for CBOW

- Create a class `CBOWDataset` that inherits from `torch.utils.data.Dataset`.
- Store all `(context, target)` pairs in the constructor (`__init__`).
- Implement:
  - `__len__`: returns the total number of samples.
  - `__getitem__`: returns one sample (context and target) as tensors.
- Keep the same method names and docstrings provided in the example.

In [None]:
class CBOWDataset(Dataset):
    """
    Custom PyTorch Dataset for the Continuous Bag-of-Words (CBOW) model.
    Each sample consists of a context (list of word indices) and a target word index.
    """

    def __init__(self, data):
        """
        Initialize the dataset with preprocessed (context, target) pairs.

        Args:
            data (list of tuples): Each tuple contains (context_indices, target_index)
        """
        self.data = data

    def __len__(self):
        """
        Return the total number of (context, target) samples.

        Returns:
            int: Number of samples in the dataset
        """
        return len(self.data)

    def __getitem__(self, idx):
        """
        Retrieve the context-target pair at the given index.

        Args:
            idx (int): Index of the desired sample

        Returns:
            tuple: (context_tensor, target_tensor)
        """
        context, target = self.data[idx]
        return torch.tensor(context, dtype=torch.long), torch.tensor(target, dtype=torch.long)

In [9]:
dataset = CBOWDataset(data)
dataloader = DataLoader(dataset, batch_size=512, shuffle=True)

## 5. Implement the CBOW Neural Network

In [10]:
class CBOW(nn.Module):
    """
    Continuous Bag-of-Words (CBOW) neural network model.

    This model predicts a target word given the embeddings of its surrounding context words.
    It consists of two linear transformations:
        - W1: word embedding lookup table
        - W2: projection from embedding space back to vocabulary space
    """

    def __init__(self, vocab_size, embedding_dim):
        """
        Initialize CBOW model parameters.

        Args:
            vocab_size (int): Number of unique words in the vocabulary
            embedding_dim (int): Dimensionality of the embedding space
        """
        super().__init__()
        ...

    def forward(self, context_words):
        """
        Forward pass of the CBOW model.

        Args:
            context_words (Tensor): Tensor of shape (batch_size, 2C)
                containing indices of context words

        Returns:
            Tensor: Log-probabilities over the vocabulary for each target word
        """
        ...
        return log_probs

In [11]:
embedding_dim = 100
model = CBOW(V, embedding_dim)
print(model)

CBOW(
  (W1): Embedding(19402, 100)
  (W2): Linear(in_features=100, out_features=19402, bias=False)
)


## 6. Train Loop for the CBOW Model

In [None]:
loss_fn = nn.NLLLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

EPOCHS = 5
for epoch in range(EPOCHS):
    ...
    print(f"Epoch {epoch+1}/{EPOCHS} - Loss: {total_loss:.4f}")

## 7. Explore the Learned Embedding Space

**Goal:** Write a function to inspect which words are most and least similar in the learned embedding space.

**Instructions:**
- Define a function `find_nearest(word, top_k=5)` that:
  - Checks if the word exists in the vocabulary; if not, print an informative message.
  - Retrieves the embedding vector of the given word from the model.
  - Computes cosine similarities between this vector and all other word embeddings in `model.W1.weight`.
  - Identifies:
    - The **top-k most similar** words (highest cosine similarity).
    - The **top-k most dissimilar** words (lowest cosine similarity).
  - Prints both lists side by side for comparison.
- Keep the **method name** and **docstring** exactly as in the example.

In [29]:
def find_nearest(word, top_k=5):
    """
    Display the top-k most similar and least similar words to a given word
    based on cosine similarity in the embedding space.

    Args:
        word (str): The query word.
        top_k (int): Number of top and bottom words to display.
    """
    ...

In [None]:
example_words = ["king", "piano", "doctor", "musician", "money"]

for example in example_words:
    print("\n" + "=" * 80)
    print(f"{'Word: ' + example:^80}")
    print("=" * 80)
    find_nearest(example, top_k=5)
    print("\n" + "-" * 80)

## 8. Conclusion

We have implemented:
- **Tokenization** and **vocabulary construction**,  
- **Generation of (context, target)** pairs,  
- A **CBOW neural network** trained using *negative log-likelihood*,  
- And a **qualitative exploration** of the learned embeddings.

This practical session illustrates the fundamental principles of **distributional language learning**.