# Homework

## 3. Implementation

### 3.1. Data Processing

1. The first cells of the notebook are the same as in the TP on text convolution. Apply the same preprocessing to get a dataset (with the same tokenizer) with a train and a validation split, with two columns review_ids (list of int) and label (int).

**ANSWER** : Copying what we did in the TP on text convolution.

In [1]:
import numpy as np
import torch
import torch.nn.functional as F
import torch.nn as nn
import math
from torch.utils.data import DataLoader
from torch.utils.data import Dataset
from tabulate import tabulate
from datasets import load_dataset

from tqdm.notebook import tqdm
from transformers import BertTokenizer

import functools

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
print("Version de pytorch : ", torch.__version__)
torch.device("cuda" if torch.cuda.is_available() else "cpu")

Version de pytorch :  2.2.0+cu121


device(type='cuda')

In [3]:
dataset = load_dataset("scikit-learn/imdb", split="train")
print(dataset)

Dataset({
    features: ['review', 'sentiment'],
    num_rows: 50000
})


In [4]:
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased", do_lower_case=True)

In [5]:
print("Type of the tokenizer:", type(tokenizer.vocab))
VOCSIZE = len(tokenizer.vocab)
print("Length of the vocabulary:", VOCSIZE)
print(str(tokenizer.vocab)[:50])

Type of the tokenizer: <class 'collections.OrderedDict'>
Length of the vocabulary: 30522
OrderedDict({'[PAD]': 0, '[unused0]': 1, '[unused1


In [6]:
def preprocessing_fn(x, tokenizer):
    x["review_ids"] = tokenizer(
        x["review"],
        add_special_tokens=False,
        truncation=True,
        max_length=256,
        padding='max_length',
        return_attention_mask=False,
    )["input_ids"]
    x["label"] = 0 if x["sentiment"] == "negative" else 1
    return x

In [7]:
n_samples = 5000  # the number of training example

# We first shuffle the data !
dataset = dataset.shuffle(seed=0)

# Select 5000 samples
sampled_dataset = dataset.select(range(n_samples))

# Tokenize the dataset
sampled_dataset = sampled_dataset.map(
    preprocessing_fn, fn_kwargs={"tokenizer" : tokenizer}
)

Map: 100%|██████████| 5000/5000 [00:14<00:00, 352.39 examples/s]


In [8]:
# Remove useless columns
sampled_dataset = sampled_dataset.select_columns(['review_ids','label'])

# Split the train and validation
splitted_dataset = sampled_dataset.train_test_split(test_size=0.2)

document_train_set = splitted_dataset['train']
document_valid_set = splitted_dataset['test']

2. Write a function extract_words_contexts. It should retrieve all pairs of valid $(w, C^+)$ from a list of ids representing a text document. It takes the radius $R$ as an argument. Its output is therefore two lists :

to make sure that every C has the same size, we add padding at the beginning and the end of the sentence. For example the first word of the sentence, will have R paddings corresponding to the R tokens that should be before. We can also use the token itself, so that it has a high dot product with itself.

In [9]:
tokenizer.pad_token_id

0

In [10]:
def extract_words_contexts(sample, R):
    token_ids = sample["review_ids"]
    n_tokens = len(token_ids)
    positive_context = []
    token_ids_with_padding = [0]*R + token_ids + [0]*R
    for i in range(n_tokens) :
        # if out of bounds
        if i<R or i>=n_tokens-R :
            positive_context.append([token_ids_with_padding[i+r] for r in range(R)] + [token_ids_with_padding[i+R+r] for r in range(1,R+1, 1)])
        else :
            positive_context.append([token_ids[i+r] for r in range(-R, 0, 1)] + [token_ids[i+r] for r in range(1, R+1, 1)])
    return token_ids, positive_context

In [11]:
toto, test = extract_words_contexts(document_train_set[2], 3)

In [18]:
print("First 5 tokens :", toto[:5])
print("C+ of the first 5 tokens :")
test[:5]

First 5 tokens : [2004, 1037, 11798, 5470, 1010]
C+ of the first 5 tokens :


[[0, 0, 0, 1037, 11798, 5470],
 [0, 0, 2004, 11798, 5470, 1010],
 [0, 2004, 1037, 5470, 1010, 1045],
 [2004, 1037, 11798, 1010, 1045, 2428],
 [1037, 11798, 5470, 1045, 2428, 2293]]

3. Write a function flatten_dataset_to_list that applies the function extract_words_contexts on a whole dataset.

In [19]:
def flatten_dataset_to_list(dataset, R):
    token_ids = []
    positive_contexts = []
    for sample in dataset:
        sample_token_ids, positive_context = extract_words_contexts(sample, R)
        token_ids.append(sample_token_ids)
        positive_contexts.append(positive_context)
    return token_ids, positive_contexts

4. Apply the function to your initial document_train_set and document_valid_set, and get the corresponding flattened lists.

In [20]:
R = 2
token_ids, positive_contexts = flatten_dataset_to_list(document_train_set, R)

5. Embed these lists in two valid PyTorch Dataset, like in HW 1, call them train_set and valid_set.

In [65]:
class CustomDataset(Dataset):

    def __init__(self, document_set, R):
        self.document_set = document_set
        token_ids, positive_contexts = flatten_dataset_to_list(document_set, R)
        self.token_ids = torch.tensor(token_ids)
        self.positive_contexts = torch.tensor(positive_contexts)

    def __len__(self):
        return len(self.token_ids)

    def __getitem__(self, idx):
        # outputs a dictionary of lists
        # return self.document_set[idx]
        return {
            "word_id" : self.token_ids[idx], 
            "positive_context_ids" : self.positive_contexts[idx],
            "label" : torch.tensor(self.document_set[idx]["label"])
        }

In [66]:
train_set = CustomDataset(document_train_set, R)
valid_set = CustomDataset(document_valid_set, R)

In [67]:
len(valid_set), len(train_set)

(1000, 4000)

In [68]:
valid_set[951], train_set[1345:1347]

({'word_id': tensor([ 2074,  2234,  2067,  2013,  1996,  2034,  4760,  1997,  3937, 12753,
           1016,  1012,  1045,  2001,  2183,  2046,  2009,  3241,  2009,  2052,
           2022, 10231,  7685,  2241,  2006, 19236,  4401,  1998,  1045,  2001,
          27726,  4527,   999,  2065,  2017,  4669,  1996,  2434,  3937, 12753,
           1045,  2228,  2017,  2097,  5959,  1001,  1016,  2074,  2004,  2172,
           2065,  2025,  2062,  1012,  2307,  2466,  2008,  2467,  7906,  2017,
           6603,  1998,  3241,  1012,  1996,  2189,  2003, 21688,  1010, 16360,
           6935,  2075,  1996,  2434,  1005,  1055,  4323,  1012,  2123,  1005,
           1056,  2175,  8074,  2914,  2400,  3430,  1010,  2175,  2000,  2156,
           2009,  2005, 20195,  1998,  4569,  1012,  2008,  1005,  1055,  2054,
           5691,  2024,  2881,  2005,  1011,  1011,  9686, 17695,  2964,  1012,
           1045,  2064,  1005,  1056,  2228,  1997,  1037,  2488,  2126,  2000,
           4019,  2084,  2000

In [69]:
train_set[:5].keys()

dict_keys(['word_id', 'positive_context_ids', 'label'])

In [71]:
train_set[:5]

{'word_id': tensor([[ 2024,  2017,  5220,  ...,  1000,  2190,  1997],
         [ 2023,  2003,  1996,  ...,  2123,  1005,  1056],
         [ 2004,  1037, 11798,  ...,     0,     0,     0],
         [ 2023,  3319,  3397,  ...,  2006,  1037, 13359],
         [ 2045,  2024,  2335,  ...,  2066,  1000,  8840]]),
 'positive_context_ids': tensor([[[    0,     0,  2017,  5220],
          [    0,  2024,  5220,  2007],
          [ 2024,  2017,  2007,  4145],
          ...,
          [ 2061,  2116,  2190,  1997],
          [ 2116,  1000,  1997,     0],
          [ 1000,  2190,     0,     0]],
 
         [[    0,     0,  2003,  1996],
          [    0,  2023,  1996,  2309],
          [ 2023,  2003,  2309,  4602],
          ...,
          [ 1012,  2339,  1005,  1056],
          [ 2339,  2123,  1056,     0],
          [ 2123,  1005,     0,     0]],
 
         [[    0,     0,  1037, 11798],
          [    0,  2004, 11798,  5470],
          [ 2004,  1037,  5470,  1010],
          ...,
          [    0,

6. Write a collate_fn function that adds the negative context to the batch. It should be parametrized by the scaling factor K.

In [138]:
def collate_fn(batch, R, K, VOCSIZE):
    ''' batch is a list of dictionary with keys "word_id", "positive_context_ids" and "label" which contain tensors
    What we want is that the output becomes a dictionary with keys :
    - "word_id", which contains the all the token_ids for every review in the batch. It should be a tensor of shape (batch_size, n_tokens=256)
    - "positive_context_ids", which contains the positive context of all tokens for every review in the batch. 
      It should be a tensor of shape (batch_size, n_tokens, 2R)
    - "negative_context_ids", same thing for negative context. It should be a tensor of shape (batch_size, n_tokens, 2RK)
    - "label"

    '''
    batch_size = len(batch)
    n_tokens = len(batch[0]["word_id"])
    result = dict()
    result["word_id"] = torch.stack([review["word_id"] for review in batch])
    result["positive_context_ids"] = torch.stack([review["positive_context_ids"] for review in batch])
    result["labels"] = torch.stack([review["label"] for review in batch])
    # sample 2RK tokens from the vocabulary for each token for each review in the batch -> reshape it -> convert to a tensor
    result["negative_context_ids"] = torch.tensor(
        np.random.choice(np.arange(VOCSIZE), 2*R*K*n_tokens*batch_size, replace=True)\
            .reshape(batch_size, n_tokens, 2*R*K)
    )
    return result

7. Wraps everything in a DataLoader, like in HW 1.

In [139]:
batch_size = 32
R = 2
K = 2
collate_fn_with_params = functools.partial(collate_fn, R=R, K=K, VOCSIZE=VOCSIZE)

train_dataloader = DataLoader(
    train_set, batch_size=batch_size, collate_fn=collate_fn_with_params
)   
valid_dataloader = DataLoader(
    valid_set, batch_size=batch_size, collate_fn=collate_fn_with_params
)
n_valid = len(valid_set)
n_train = len(train_set)

8. Make 2 or 3 three iterations in the DataLoader and print R, K and the shapes of all the tensors in the batches (let the output be visible).

In [148]:
print("R =", R)
print("K =", K)

for i, batch in enumerate(train_dataloader):
    print(f"batch {i} :")
    print(batch.keys())
    for key, value in batch.items():
        print(f"'{key}' shape :", value.shape)
    print("-"*50)
    
    if i > 2:
        break

R = 2
K = 2
batch 0 :
dict_keys(['word_id', 'positive_context_ids', 'labels', 'negative_context_ids'])
'word_id' shape : torch.Size([32, 256])
'positive_context_ids' shape : torch.Size([32, 256, 4])
'labels' shape : torch.Size([32])
'negative_context_ids' shape : torch.Size([32, 256, 8])
--------------------------------------------------
batch 1 :
dict_keys(['word_id', 'positive_context_ids', 'labels', 'negative_context_ids'])
'word_id' shape : torch.Size([32, 256])
'positive_context_ids' shape : torch.Size([32, 256, 4])
'labels' shape : torch.Size([32])
'negative_context_ids' shape : torch.Size([32, 256, 8])
--------------------------------------------------
batch 2 :
dict_keys(['word_id', 'positive_context_ids', 'labels', 'negative_context_ids'])
'word_id' shape : torch.Size([32, 256])
'positive_context_ids' shape : torch.Size([32, 256, 4])
'labels' shape : torch.Size([32])
'negative_context_ids' shape : torch.Size([32, 256, 8])
--------------------------------------------------
batc

torch.Size([32, 256])