# Homework

## 3. Implementation

### 3.1. Data Processing

1. The first cells of the notebook are the same as in the TP on text convolution. Apply the same preprocessing to get a dataset (with the same tokenizer) with a train and a validation split, with two columns review_ids (list of int) and label (int).

**ANSWER** : Copying what we did in the TP on text convolution.

In [93]:
import numpy as np
import torch
import torch.nn.functional as F
import torch.nn as nn
import math
from torch.utils.data import DataLoader
from torch.utils.data import Dataset
from tabulate import tabulate
from datasets import load_dataset

from tqdm import tqdm
from transformers import BertTokenizer

import functools

In [75]:
print("Version de pytorch : ", torch.__version__)
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
DEVICE

Version de pytorch :  2.2.0+cu121


device(type='cuda')

In [3]:
dataset = load_dataset("scikit-learn/imdb", split="train")
print(dataset)

Dataset({
    features: ['review', 'sentiment'],
    num_rows: 50000
})


In [4]:
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased", do_lower_case=True)

In [5]:
print("Type of the tokenizer:", type(tokenizer.vocab))
VOCSIZE = len(tokenizer.vocab)
print("Length of the vocabulary:", VOCSIZE)
print(str(tokenizer.vocab)[:50])

Type of the tokenizer: <class 'collections.OrderedDict'>
Length of the vocabulary: 30522
OrderedDict({'[PAD]': 0, '[unused0]': 1, '[unused1


In [6]:
def preprocessing_fn(x, tokenizer):
    x["review_ids"] = tokenizer(
        x["review"],
        add_special_tokens=False,
        truncation=True,
        max_length=256,
        padding='max_length',
        return_attention_mask=False,
    )["input_ids"]
    x["label"] = 0 if x["sentiment"] == "negative" else 1
    return x

In [7]:
n_samples = 5000  # the number of training example

# We first shuffle the data !
dataset = dataset.shuffle(seed=0)

# Select 5000 samples
sampled_dataset = dataset.select(range(n_samples))

# Tokenize the dataset
sampled_dataset = sampled_dataset.map(
    preprocessing_fn, fn_kwargs={"tokenizer" : tokenizer}
)

In [8]:
# Remove useless columns
sampled_dataset = sampled_dataset.select_columns(['review_ids','label'])

# Split the train and validation
splitted_dataset = sampled_dataset.train_test_split(test_size=0.2)

document_train_set = splitted_dataset['train']
document_valid_set = splitted_dataset['test']

2. Write a function extract_words_contexts. It should retrieve all pairs of valid $(w, C^+)$ from a list of ids representing a text document. It takes the radius $R$ as an argument. Its output is therefore two lists :

to make sure that every C has the same size, we add padding at the beginning and the end of the sentence. For example the first word of the sentence, will have R paddings corresponding to the R tokens that should be before. We can also use the token itself, so that it has a high dot product with itself.

In [9]:
tokenizer.pad_token_id

0

In [10]:
def extract_words_contexts(sample, R):
    token_ids = sample["review_ids"]
    n_tokens = len(token_ids)
    positive_context = []
    token_ids_with_padding = [0]*R + token_ids + [0]*R
    for i in range(n_tokens) :
        # if out of bounds
        if i<R or i>=n_tokens-R :
            positive_context.append([token_ids_with_padding[i+r] for r in range(R)] + [token_ids_with_padding[i+R+r] for r in range(1,R+1, 1)])
        else :
            positive_context.append([token_ids[i+r] for r in range(-R, 0, 1)] + [token_ids[i+r] for r in range(1, R+1, 1)])
    return token_ids, positive_context

In [11]:
toto, test = extract_words_contexts(document_train_set[2], 3)

In [12]:
print("First 5 tokens :", toto[:5])
print("C+ of the first 5 tokens :")
test[:5]

First 5 tokens : [1045, 3856, 2039, 2023, 3185]
C+ of the first 5 tokens :


[[0, 0, 0, 3856, 2039, 2023],
 [0, 0, 1045, 2039, 2023, 3185],
 [0, 1045, 3856, 2023, 3185, 1998],
 [1045, 3856, 2039, 3185, 1998, 2001],
 [3856, 2039, 2023, 1998, 2001, 14603]]

3. Write a function flatten_dataset_to_list that applies the function extract_words_contexts on a whole dataset.

In [13]:
def flatten_dataset_to_list(dataset, R):
    token_ids = []
    positive_contexts = []
    for sample in dataset:
        sample_token_ids, positive_context = extract_words_contexts(sample, R)
        token_ids.append(sample_token_ids)
        positive_contexts.append(positive_context)
    return token_ids, positive_contexts

4. Apply the function to your initial document_train_set and document_valid_set, and get the corresponding flattened lists.

In [14]:
R = 2
token_ids, positive_contexts = flatten_dataset_to_list(document_train_set, R)

5. Embed these lists in two valid PyTorch Dataset, like in HW 1, call them train_set and valid_set.

In [15]:
class CustomDataset(Dataset):

    def __init__(self, document_set, R):
        self.document_set = document_set
        token_ids, positive_contexts = flatten_dataset_to_list(document_set, R)
        self.token_ids = torch.tensor(token_ids)
        self.positive_contexts = torch.tensor(positive_contexts)

    def __len__(self):
        return len(self.token_ids)

    def __getitem__(self, idx):
        return {
            "word_id" : self.token_ids[idx], 
            "positive_context_ids" : self.positive_contexts[idx],
            # "label" : torch.tensor(self.document_set[idx]["label"])
        }

In [16]:
train_set = CustomDataset(document_train_set, R)
valid_set = CustomDataset(document_valid_set, R)

In [17]:
len(valid_set), len(train_set)

(1000, 4000)

In [18]:
try :
    valid_set[951], train_set[1347:-2000]
except :
    print("error")

6. Write a collate_fn function that adds the negative context to the batch. It should be parametrized by the scaling factor K.

In [19]:
def collate_fn(batch, R, K, VOCSIZE):
    ''' batch is a list of dictionary with keys "word_id", "positive_context_ids" and "label" which contain tensors
    What we want is that the output becomes a dictionary with keys :
    - "word_id", which contains the all the token_ids for every review in the batch. It should be a tensor of shape (batch_size, n_tokens=256)
    - "positive_context_ids", which contains the positive context of all tokens for every review in the batch. 
      It should be a tensor of shape (batch_size, n_tokens, 2R)
    - "negative_context_ids", same thing for negative context. It should be a tensor of shape (batch_size, n_tokens, 2RK)

    '''
    batch_size = len(batch)
    n_tokens = len(batch[0]["word_id"])
    result = dict()
    result["word_id"] = torch.stack([review["word_id"] for review in batch])
    result["positive_context_ids"] = torch.stack([review["positive_context_ids"] for review in batch])
    # sample 2RK tokens from the vocabulary for each token in each review in the batch -> reshape it -> convert to a tensor
    result["negative_context_ids"] = torch.tensor(
        np.random.choice(np.arange(VOCSIZE), 2*R*K*n_tokens*batch_size, replace=True)\
            .reshape(batch_size, n_tokens, 2*R*K)
    )
    return result

7. Wraps everything in a DataLoader, like in HW 1.

In [20]:
batch_size = 32
R = 2
K = 2
collate_fn_with_params = functools.partial(collate_fn, R=R, K=K, VOCSIZE=VOCSIZE)

train_dataloader = DataLoader(
    train_set, batch_size=batch_size, collate_fn=collate_fn_with_params
)   
valid_dataloader = DataLoader(
    valid_set, batch_size=batch_size, collate_fn=collate_fn_with_params
)
n_valid = len(valid_set)
n_train = len(train_set)

8. Make 2 or 3 three iterations in the DataLoader and print R, K and the shapes of all the tensors in the batches (let the output be visible).

In [21]:
print("R =", R)
print("K =", K)

for i, batch in enumerate(train_dataloader):
    print(f"batch {i} :")
    print(batch.keys())
    for key, value in batch.items():
        print(f"'{key}' shape :", value.shape)
    print("-"*50)
    
    if i > 2:
        break

R = 2
K = 2
batch 0 :
dict_keys(['word_id', 'positive_context_ids', 'negative_context_ids'])
'word_id' shape : torch.Size([32, 256])
'positive_context_ids' shape : torch.Size([32, 256, 4])
'negative_context_ids' shape : torch.Size([32, 256, 8])
--------------------------------------------------
batch 1 :
dict_keys(['word_id', 'positive_context_ids', 'negative_context_ids'])
'word_id' shape : torch.Size([32, 256])
'positive_context_ids' shape : torch.Size([32, 256, 4])
'negative_context_ids' shape : torch.Size([32, 256, 8])
--------------------------------------------------
batch 2 :
dict_keys(['word_id', 'positive_context_ids', 'negative_context_ids'])
'word_id' shape : torch.Size([32, 256])
'positive_context_ids' shape : torch.Size([32, 256, 4])
'negative_context_ids' shape : torch.Size([32, 256, 8])
--------------------------------------------------
batch 3 :
dict_keys(['word_id', 'positive_context_ids', 'negative_context_ids'])
'word_id' shape : torch.Size([32, 256])
'positive_conte

### 3.2 Model

9. Write a model named Word2Vec which is a valid torch.nn.Module (i.e.,
write a class that inherits from the torch.nn.Module), and implement the
Word2Vec model. It should be parametrized by the vocabulary size and
the embeddings dimension. Use the module torch.nn.Embedding.

In [77]:
from typing import Any


class Word2Vec(torch.nn.Module):

    def __init__(self, VOCSIZE, emb_dim, *args, **kwargs) -> None:
        super().__init__(*args, **kwargs)
        self.VOCSIZE = VOCSIZE
        self.emb_dim = emb_dim

        # Layers
        self.emb = torch.nn.Embedding(self.VOCSIZE, self.emb_dim, padding_idx=0)
    
    def similarity(self, word_emb, context_emb) -> Any:
        '''Takes the word embeddings, the context_embeddings and compute the dot product between each word embedding and its context embeddings
        word_emb : (B,L,E) = (batch_size, max_length, embedding_dim)
        context_emb : (B,L,C,E)
        output : (B,L,C)
        '''
        word_emb_expanded = word_emb.unsqueeze(2) # (B,L,1,E)
        context_emb_transposed = context_emb.transpose(-1,-2) #(B,L,E,C)
        return torch.matmul(word_emb_expanded, context_emb_transposed).squeeze(2) # Matmul -> (B,L,1,C) -> squeezed into (B,L,C)
    
    def __call__(self, input):
        embeddings = dict()
        for k,v in input.items():
            embeddings[k] = self.emb(v)
        positive_similarity = self.similarity(embeddings["word_id"], embeddings["positive_context_ids"])
        negative_similarity = self.similarity(embeddings["word_id"], embeddings["negative_context_ids"])
        return {"positive_similarity":positive_similarity, "negative_similarity":negative_similarity}

**Quick sanity check**

In [78]:
EMB_DIM = 50
VOCSIZE = tokenizer.vocab_size
model = Word2Vec(VOCSIZE, EMB_DIM)

In [79]:
out = model(batch)
out.keys()

dict_keys(['positive_similarity', 'negative_similarity'])

In [80]:
out["positive_similarity"].shape, out["negative_similarity"].shape

(torch.Size([32, 256, 4]), torch.Size([32, 256, 8]))

10. Train the model. The training should be parametrized by the batch size
B, and the number of epochs E.

If we denote $y = \mathbb{1}_{c \in \mathcal{C}^+}$, then our loss can be seen as a binary cross-entropy loss :
$$ \frac{1}{n} \sum_{i=1}^n - [y_i \log(x_i) + (1-y_i) \log(1-x_i)]$$ 
where $x_i$ is $\sigma(w_i \cdot c_i)$, (and reduce = 'mean'). <br>
Therefore we can use the BCE with logit loss :

In [81]:
class CustomLoss(nn.Module):

    def __init__(self, *args, **kwargs) -> None:
        super().__init__(*args, **kwargs)
        self.BCELoss = nn.BCEWithLogitsLoss()

    def __call__(self, positive_similarity, negative_similarity):
        '''computes the loss
        '''
        # Positive context
        y_positive = torch.ones_like(positive_similarity, dtype=torch.float32)
        loss = self.BCELoss(positive_similarity, y_positive)

        # Negative context
        y_negative = torch.zeros_like(negative_similarity, dtype=torch.float32)
        loss += self.BCELoss(negative_similarity, y_negative)
        return loss

**Sanity check**

In [82]:
MyLoss = CustomLoss()

In [85]:
MyLoss(**out)

tensor(5.1157, grad_fn=<AddBackward0>)

Let's implement the training function

In [125]:
model.to(DEVICE)

Word2Vec(
  (emb): Embedding(30522, 50, padding_idx=0)
)

In [153]:
def validation(model, valid_dataloader, loss_fn):
    model.eval()
    loss_total = 0.
    acc = 0
    with torch.no_grad():
        for batch in tqdm(valid_dataloader):
            batch = {k: v.to(DEVICE) for k, v in batch.items()}
            output = model(batch)
            loss = loss_fn(**output)
            loss_total += loss.detach().cpu().item()
            total_predictions = output["positive_similarity"].shape.numel() + output["negative_similarity"].shape.numel()
            acc += ((torch.sum(output["positive_similarity"]>0)+torch.sum(output["negative_similarity"]<=0))/total_predictions).cpu().item()
    return loss_total / len(valid_dataloader), acc / len(valid_dataloader)

validation(model, valid_dataloader, MyLoss)

100%|██████████| 32/32 [00:00<00:00, 53.96it/s]


(4.811533749103546, 0.5336046405136585)

In [161]:
def training(model, lr, E, B, loss_fn, train_dataloader, valid_dataloader):

    optimizer = torch.optim.Adam(model.parameters(), lr=lr)

    # Performance metric tracking
    list_val_acc = []
    list_train_acc = []
    list_train_loss = []
    list_val_loss = []

    
    for e in range(E):
        # ========== Training ==========
        model.train()
        train_loss = 0.
        acc = 0.
        for batch in tqdm(train_dataloader):
            batch = {k:v.to(DEVICE) for k,v in batch.items()}
            optimizer.zero_grad()
            output = model(batch)
            loss = loss_fn(**output)
            loss.backward()
            optimizer.step()
            train_loss += loss.detach().cpu().item()
            total_predictions = output["positive_similarity"].shape.numel() + output["negative_similarity"].shape.numel()
            acc += ((torch.sum(output["positive_similarity"]>0)+torch.sum(output["negative_similarity"]<=0))/total_predictions).cpu().item()
        list_train_loss.append(train_loss / len(train_dataloader))
        list_train_acc.append(100 * acc / len(train_dataloader))

        # ========== Validation ==========
        l, a = validation(model, valid_dataloader, loss_fn)
        list_val_loss.append(l)
        list_val_acc.append(a * 100)
        print(
            e,
            "\n\t - Train loss: {:.4f}".format(list_train_loss[-1]),
            "Train acc: {:.4f}".format(list_train_acc[-1]),
            "Val loss: {:.4f}".format(l),
            "Val acc:{:.4f}".format(a * 100),
        )
    return list_train_loss, list_train_acc, list_val_loss, list_val_acc

In [156]:
model.to(DEVICE)

Word2Vec(
  (emb): Embedding(30522, 50, padding_idx=0)
)

In [162]:
EPOCHS = 10
lr = 5e-4
training(model, lr=lr, E=EPOCHS, B=32, loss_fn=MyLoss, train_dataloader=train_dataloader, valid_dataloader=valid_dataloader)

100%|██████████| 125/125 [00:02<00:00, 44.09it/s] 
100%|██████████| 32/32 [00:00<00:00, 272.03it/s]


0 
	 - Train loss: 4.6534 Train acc: 53.6473 Val loss: 4.5166 Val acc:53.8154


100%|██████████| 125/125 [00:01<00:00, 103.02it/s]
100%|██████████| 32/32 [00:00<00:00, 295.79it/s]


1 
	 - Train loss: 4.3731 Train acc: 54.0817 Val loss: 4.2585 Val acc:54.2727


100%|██████████| 125/125 [00:01<00:00, 111.95it/s]
100%|██████████| 32/32 [00:00<00:00, 403.53it/s]


2 
	 - Train loss: 4.1268 Train acc: 54.7030 Val loss: 4.0338 Val acc:54.9799


100%|██████████| 125/125 [00:00<00:00, 128.02it/s]
100%|██████████| 32/32 [00:00<00:00, 223.31it/s]


3 
	 - Train loss: 3.9036 Train acc: 55.4203 Val loss: 3.8208 Val acc:55.6834


100%|██████████| 125/125 [00:02<00:00, 48.73it/s]
100%|██████████| 32/32 [00:00<00:00, 186.47it/s]


4 
	 - Train loss: 3.6970 Train acc: 56.0532 Val loss: 3.6315 Val acc:56.2905


100%|██████████| 125/125 [00:02<00:00, 41.71it/s]
100%|██████████| 32/32 [00:00<00:00, 186.87it/s]


5 
	 - Train loss: 3.5044 Train acc: 56.8146 Val loss: 3.4546 Val acc:57.0144


100%|██████████| 125/125 [00:03<00:00, 37.37it/s]
100%|██████████| 32/32 [00:00<00:00, 130.20it/s]


6 
	 - Train loss: 3.3280 Train acc: 57.4771 Val loss: 3.2856 Val acc:57.7198


100%|██████████| 125/125 [00:03<00:00, 35.81it/s]
100%|██████████| 32/32 [00:00<00:00, 124.92it/s]


7 
	 - Train loss: 3.1624 Train acc: 58.3108 Val loss: 3.1285 Val acc:58.5523


100%|██████████| 125/125 [00:03<00:00, 36.02it/s]
100%|██████████| 32/32 [00:00<00:00, 123.43it/s]


8 
	 - Train loss: 3.0049 Train acc: 59.1914 Val loss: 2.9798 Val acc:59.3989


100%|██████████| 125/125 [00:03<00:00, 35.82it/s]
100%|██████████| 32/32 [00:00<00:00, 109.07it/s]

9 
	 - Train loss: 2.8568 Train acc: 59.9872 Val loss: 2.8461 Val acc:60.1314





([4.653419181823731,
  4.373093809127807,
  4.1267561454772945,
  3.9036430168151854,
  3.6969664497375487,
  3.504398693084717,
  3.327981958389282,
  3.162409549713135,
  3.004857526779175,
  2.8567676315307615],
 [53.64733266830444,
  54.081658935546876,
  54.70304570198059,
  55.420314407348634,
  56.05323257446289,
  56.81460165977478,
  57.477077102661134,
  58.31084189414978,
  59.19135947227478,
  59.98721714019776],
 [4.516626849770546,
  4.258542448282242,
  4.033826999366283,
  3.820789195597172,
  3.6315032243728638,
  3.4545966908335686,
  3.2856349423527718,
  3.128538556396961,
  2.9797911643981934,
  2.8460602909326553],
 [53.815398551523685,
  54.27271742373705,
  54.979898408055305,
  55.68336043506861,
  56.290532648563385,
  57.01443590223789,
  57.71983675658703,
  58.55233073234558,
  59.39887557178736,
  60.131360962986946])