# TP — Implémentation de Word2Vec (CBOW) en PyTorch

**Objectif :** Dans ce TP, nous allons construire **à la main** un modèle Word2Vec en version **CBOW (Continuous Bag of Words)**.

L’objectif est d’apprendre des **représentations vectorielles** (embeddings) de mots en entraînant un petit réseau de neurones :

$$\mathbb{P}(w_t \mid h_t) = \text{Softmax}(W^{(2)} \cdot h_t)$$

où :

- la moyenne des embeddings des mots du **contexte** est donnée par $$ h_t = \frac{1}{2C} \sum_{i=-C, i\neq 0}^{C} v(w_{t+i}), $$
- $ v(w) \in \mathbb{R}^p $ est le vecteur du mot,
- $ W^{(1)} \in \mathbb{R}^{p \times N} $ et $ W^{(2)} \in \mathbb{R}^{N \times p} $,
- et la **loss** est l’anti-log-vraisemblance :

$$\mathcal{L} = - \sum_{t=1}^{T} \log \mathbb{P}(w_t \mid h_t)$$

Nous travaillerons sur un corpus réduit issu de *WikiText-2*.

## 1. Import des bibliothèques et prétraitement du corpus

In [1]:
import pandas as pd
import spacy
from tqdm import tqdm
import torch
from torch import nn
from torch.utils.data import Dataset, DataLoader
import random
import torch.nn.functional as F

In [2]:
train_iter = pd.read_csv(
    './wikitext-2-train.csv',
    header=None,
    names=['text'],
    encoding='utf-8'
)

In [3]:
# Import the tokenizer model 
nlp = spacy.load("en_core_web_sm")

In [4]:
def tokenize_sentence(sentence):
    return [token.text.lower() for token in nlp(sentence) if token.is_alpha]

In [5]:
sentences = [tokenize_sentence(t) for t in tqdm(train_iter['text'].dropna().tolist())]
sentences = [s for s in sentences if len(s) > 2]
print("Exemple :", sentences[0][:10])

100%|█████████████████████████████████████████| 615/615 [03:20<00:00,  3.07it/s]

Exemple : ['york', 'city', 'season', 'the', 'season', 'was', 'the', 'unk', 'season', 'of']





## 2. Construction du vocabulaire

In [6]:
from collections import Counter

vocab = Counter([w for sent in sentences for w in sent])
vocab = {w: c for w, c in vocab.items() if c >= 5}
word2idx = {w: i for i, w in enumerate(vocab.keys())}
idx2word = {i: w for w, i in word2idx.items()}
V = len(word2idx)
print(f"Taille du vocabulaire : {V}")

Taille du vocabulaire : 19402


## 3. Génération des couples (contexte, mot cible)

In [7]:
CONTEXT_SIZE = 3

def generate_context_target(sentences, context_size=5):
    """
    Generate (context, target) training pairs for the CBOW model.

    For each word in each sentence, this function extracts:
        - A target word (the central word)
        - Its surrounding context words within a specified window size

    Example:
        Sentence: ["the", "quick", "brown", "fox", "jumps"]
        context_size = 2
        → Target = "brown"
        → Context = ["the", "quick", "fox", "jumps"]

    Args:
        sentences (list of list of str): List of tokenized sentences
        context_size (int): Number of context words to take on each side of the target

    Returns:
        list of tuples: Each tuple (context, target) contains
                        - context: list of word indices (size = 2 * context_size)
                        - target: integer index of the center word

    Meaning:
        The result is a training dataset for the CBOW neural network, where each
        example teaches the model to predict the central word given its surrounding
        context words.
    """
    data = []
    for sentence in sentences:
        indices = [word2idx[w] for w in sentence if w in word2idx]
        for i in range(context_size, len(indices) - context_size):
            context = indices[i - context_size:i] + indices[i + 1:i + context_size + 1]
            target = indices[i]
            data.append((context, target))
    return data

data = generate_context_target(sentences, CONTEXT_SIZE)
print("Exemple :", data[0])

Exemple : ([0, 1, 2, 2, 4, 3], 3)


## 4. Définition du Dataset PyTorch

In [8]:
class CBOWDataset(Dataset):
    """
    Custom PyTorch Dataset for the Continuous Bag-of-Words (CBOW) model.
    Each sample consists of a context (list of word indices) and a target word index.
    """

    def __init__(self, data):
        """
        Initialize the dataset with preprocessed (context, target) pairs.

        Args:
            data (list of tuples): Each tuple contains (context_indices, target_index)
        """
        self.data = data

    def __len__(self):
        """
        Return the total number of (context, target) samples.

        Returns:
            int: Number of samples in the dataset
        """
        return len(self.data)

    def __getitem__(self, idx):
        """
        Retrieve the context-target pair at the given index.

        Args:
            idx (int): Index of the desired sample

        Returns:
            tuple: (context_tensor, target_tensor)
        """
        context, target = self.data[idx]
        return torch.tensor(context, dtype=torch.long), torch.tensor(target, dtype=torch.long)

In [9]:
dataset = CBOWDataset(data)
dataloader = DataLoader(dataset, batch_size=512, shuffle=True)

## 5. Définition du modèle CBOW

In [10]:
class CBOW(nn.Module):
    """
    Continuous Bag-of-Words (CBOW) neural network model.

    This model predicts a target word given the embeddings of its surrounding context words.
    It consists of two linear transformations:
        - W1: word embedding lookup table
        - W2: projection from embedding space back to vocabulary space
    """

    def __init__(self, vocab_size, embedding_dim):
        """
        Initialize CBOW model parameters.

        Args:
            vocab_size (int): Number of unique words in the vocabulary
            embedding_dim (int): Dimensionality of the embedding space
        """
        super().__init__()
        self.W1 = nn.Embedding(vocab_size, embedding_dim)
        self.W2 = nn.Linear(embedding_dim, vocab_size, bias=False)

    def forward(self, context_words):
        """
        Forward pass of the CBOW model.

        Args:
            context_words (Tensor): Tensor of shape (batch_size, 2C)
                containing indices of context words

        Returns:
            Tensor: Log-probabilities over the vocabulary for each target word
        """
        embeds = self.W1(context_words)  # (batch_size, 2C, embedding_dim)
        h = embeds.mean(dim=1)           # Average context embeddings
        out = self.W2(h)                 # Project to vocabulary space
        log_probs = torch.log_softmax(out, dim=1)
        return log_probs

In [11]:
embedding_dim = 100
model = CBOW(V, embedding_dim)
print(model)

CBOW(
  (W1): Embedding(19402, 100)
  (W2): Linear(in_features=100, out_features=19402, bias=False)
)


## 6. Boucle d’apprentissage

In [12]:
loss_fn = nn.NLLLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

EPOCHS = 3
for epoch in range(EPOCHS):
    total_loss = 0
    loop = tqdm(dataloader, desc=f"Epoch {epoch+1}/{EPOCHS}", leave=False)
    for context, target in loop:
        optimizer.zero_grad()
        log_probs = model(context)
        loss = loss_fn(log_probs, target)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    print(f"Epoch {epoch+1}/{EPOCHS} - Loss: {total_loss:.4f}")

  from .autonotebook import tqdm as notebook_tqdm
                                                                                

Epoch 1/3 - Loss: 20810.0363


                                                                                

Epoch 2/3 - Loss: 17957.8131


                                                                                

Epoch 3/3 - Loss: 16805.4065




## 7. Exploration des embeddings appris

In [13]:
def find_nearest(word, top_k=5):
    """
    Display the top-k most similar and least similar words to a given word
    based on cosine similarity in the embedding space.

    Args:
        word (str): The query word.
        top_k (int): Number of top and bottom words to display.
    """
    if word not in word2idx:
        print("Mot hors vocabulaire.")
        return

    idx = word2idx[word]
    w_vec = model.W1.weight[idx]

    # Compute cosine similarities between the target word and all embeddings
    sims = F.cosine_similarity(model.W1.weight, w_vec.unsqueeze(0))

    # Get top positive and negative similarities
    best = torch.topk(sims, top_k + 1)           # +1 to skip the word itself
    worst = torch.topk(-sims, top_k)

    print(f"\nMost similar and dissimilar words to '{word}':\n")
    print(f"{'Most similar':<30}{'Most dissimilar':<30}")
    print("-" * 60)

    # Pair them together and print
    for (i_pos, score_pos), (i_neg, score_neg) in zip(
        zip(best.indices[1:], best.values[1:]),
        zip(worst.indices, -worst.values)
    ):
        print(f"{idx2word[i_pos.item()]:<15} cosine={score_pos.item():>6.3f}   "
              f"{idx2word[i_neg.item()]:<15} cosine={score_neg.item():>6.3f}")

In [14]:
example_words = ["king", "piano", "doctor", "musician", "money"]

for example in example_words:
    print("\n" + "=" * 80)
    print(f"{'Word: ' + example:^80}")
    print("=" * 80)
    find_nearest(example, top_k=5)
    print("\n" + "-" * 80)


                                   Word: king                                   

Most similar and dissimilar words to 'king':

Most similar                  Most dissimilar               
------------------------------------------------------------
kingship        cosine= 0.461   steal           cosine=-0.293
anne            cosine= 0.435   sanctioned      cosine=-0.282
martyr          cosine= 0.396   creative        cosine=-0.281
tim             cosine= 0.388   quiet           cosine=-0.272
césar           cosine= 0.385   adaptable       cosine=-0.262

--------------------------------------------------------------------------------

                                  Word: piano                                   

Most similar and dissimilar words to 'piano':

Most similar                  Most dissimilar               
------------------------------------------------------------
deva            cosine= 0.418   postwar         cosine=-0.342
ono             cosine= 0.414   highlightin

## 8. Conclusion

Nous avons implémenté :
- la **tokenisation** et la **création du vocabulaire**,
- la **génération des couples (contexte, cible)**,
- un **réseau de neurones CBOW** entraîné par *negative log likelihood*,
- et l’exploration qualitative des embeddings.

Ce TP illustre les fondements de l’apprentissage distributionnel du langage.