
# üî§ Utilisation d'un Embedding avec un LSTM pour traiter des s√©quences de tokens

Ce notebook montre comment utiliser un `nn.Embedding` pour transformer une s√©quence de **tokens discrets** (entiers repr√©sentant des mots ou des cat√©gories) en vecteurs denses, avant de les faire passer dans un LSTM.

Ce cas est tr√®s courant en traitement automatique du langage (NLP), mais s‚Äôapplique aussi √† toute s√©quence cat√©gorielle.

## üîß Objectif :
1. Cr√©er des s√©quences de tokens al√©atoires.
2. Appliquer un embedding.
3. Entra√Æner un LSTM pour classifier la s√©quence.


In [41]:

import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np


## üì¶ G√©n√©ration d‚Äôun jeu de donn√©es avec des tokens al√©atoires

In [50]:

# Param√®tres
VOCAB_SIZE = 50
SEQ_LEN = 8

def generate_token_batch(batch_size=50):
    sequences = torch.randint(0, VOCAB_SIZE, (batch_size, SEQ_LEN))
    labels = torch.tensor([
        1 if (seq[0] + seq[-1]) % 2 == 0 else 0  # r√®gle arbitraire : parit√© de la somme d√©but+fin
        for seq in sequences
    ])
    return sequences, labels


In [51]:
X, y = generate_token_batch()
print(X)
print(y)

tensor([[21,  6, 37, 33, 18, 48,  1, 25],
        [27, 24, 29, 20, 37, 44, 45,  1],
        [ 9,  6, 29, 37, 31, 45, 40,  1],
        [10, 34, 41,  4, 47, 26, 24, 38],
        [49, 25,  9, 40,  4, 27, 22, 44],
        [ 6,  9,  5, 35,  4, 44, 29, 39],
        [ 2, 44, 43, 34, 12, 16,  3,  2],
        [44,  8, 42,  1, 23, 49, 46, 34],
        [36, 47, 44, 43, 25, 17, 40, 18],
        [ 2, 12, 42, 41, 25, 28, 10, 19],
        [22, 10, 17,  7, 25,  0, 28, 39],
        [ 4, 20, 11, 39, 12, 47, 23,  2],
        [41, 21, 49, 27, 22, 24, 43, 35],
        [15,  5, 42, 32, 28, 19, 14, 38],
        [44, 44,  9, 14,  7, 34, 23,  4],
        [42, 36, 31, 30, 23, 14,  5, 30],
        [ 7, 37,  5, 18, 18, 29, 38,  8],
        [24, 43, 12, 40, 35, 30, 16, 43],
        [ 2, 34, 37, 20, 23,  9, 38, 27],
        [12,  7,  9, 35,  8, 43, 11, 39],
        [37, 48, 33, 41, 42, 45, 27, 25],
        [18, 11, 46, 39, 21, 19, 29, 25],
        [48, 39,  3, 40, 40,  4, 39, 13],
        [ 8, 33, 14, 40, 43, 16, 1

## üß† Mod√®le : Embedding + LSTM + Classification

In [52]:

class TokenLSTMClassifier(nn.Module):
    def __init__(self, vocab_size, embedding_dim=32, hidden_size=64):
        super().__init__()
        self.embedding = nn.Embedding(num_embeddings=vocab_size, embedding_dim=embedding_dim)
        self.lstm = nn.LSTM(input_size=embedding_dim, hidden_size=hidden_size, batch_first=True)
        self.fc = nn.Linear(hidden_size, 2)

    def forward(self, x):
        x = self.embedding(x)  # [B, T] -> [B, T, E]
        _, (hn, _) = self.lstm(x)  # hn: [1, B, H]
        return self.fc(hn.squeeze(0))  # [B, 2]


## üèãÔ∏è Fonction d'entra√Ænement

In [53]:

def train(model, epochs=50):
    optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
    criterion = nn.CrossEntropyLoss()

    for epoch in range(epochs):
        X, y = generate_token_batch()
        outputs = model(X)
        loss = criterion(outputs, y)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        acc = (outputs.argmax(1) == y).float().mean().item()
        print(f"Epoch {epoch+1:02d} - Loss: {loss.item():.4f} - Accuracy: {acc*100:.2f}%")


## üöÄ Entra√Ænons le mod√®le sur des tokens !

In [54]:

model = TokenLSTMClassifier(vocab_size=VOCAB_SIZE)
train(model)


Epoch 01 - Loss: 0.7001 - Accuracy: 52.00%
Epoch 02 - Loss: 0.7429 - Accuracy: 38.00%
Epoch 03 - Loss: 0.6728 - Accuracy: 60.00%
Epoch 04 - Loss: 0.6890 - Accuracy: 52.00%
Epoch 05 - Loss: 0.6883 - Accuracy: 54.00%
Epoch 06 - Loss: 0.7296 - Accuracy: 42.00%
Epoch 07 - Loss: 0.7131 - Accuracy: 52.00%
Epoch 08 - Loss: 0.6996 - Accuracy: 44.00%
Epoch 09 - Loss: 0.7290 - Accuracy: 44.00%
Epoch 10 - Loss: 0.6974 - Accuracy: 52.00%
Epoch 11 - Loss: 0.7083 - Accuracy: 48.00%
Epoch 12 - Loss: 0.6897 - Accuracy: 56.00%
Epoch 13 - Loss: 0.7043 - Accuracy: 50.00%
Epoch 14 - Loss: 0.7078 - Accuracy: 40.00%
Epoch 15 - Loss: 0.7021 - Accuracy: 44.00%
Epoch 16 - Loss: 0.6965 - Accuracy: 46.00%
Epoch 17 - Loss: 0.6953 - Accuracy: 52.00%
Epoch 18 - Loss: 0.6777 - Accuracy: 58.00%
Epoch 19 - Loss: 0.6806 - Accuracy: 54.00%
Epoch 20 - Loss: 0.6999 - Accuracy: 48.00%
Epoch 21 - Loss: 0.6865 - Accuracy: 60.00%
Epoch 22 - Loss: 0.7032 - Accuracy: 44.00%
Epoch 23 - Loss: 0.6958 - Accuracy: 52.00%
Epoch 24 - 


---

## ‚úÖ R√©sum√©

- Les tokens sont encod√©s par un `nn.Embedding`, ce qui permet au LSTM de travailler dans un espace vectoriel dense.
- Le LSTM extrait les d√©pendances s√©quentielles de ces vecteurs.
- Le mod√®le apprend une t√¢che arbitraire sur la base des valeurs de d√©but et de fin de s√©quence.

Vous pouvez facilement adapter ce code √† des s√©quences de texte r√©elles en utilisant un tokenizer (comme ceux de HuggingFace).



In [56]:
# üîÆ Fonction de pr√©diction
def predict_next_token(model, sequence):
    model.eval()
    with torch.no_grad():
        input_tensor = torch.tensor(sequence).unsqueeze(0)  # [1, seq_len]
        logits = model(input_tensor)
        probs = torch.softmax(logits, dim=1)
        predicted_token = torch.argmax(probs, dim=1).item()
    return predicted_token, probs.squeeze().tolist()

# Exemple de pr√©diction
sample_sequence = [21,  6, 37, 33, 18, 48,  1, 25]  # Longueur doit √™tre √©gale √† SEQ_LEN
predicted_token, probs = predict_next_token(model, sample_sequence)

print(f"S√©quence d'entr√©e : {sample_sequence}")
print(f"Token pr√©dit      : {predicted_token}")


S√©quence d'entr√©e : [21, 6, 37, 33, 18, 48, 1, 25]
Token pr√©dit      : 0
