# Homework

## 3. Implementation

### 3.1. Data Processing

1. The first cells of the notebook are the same as in the TP on text convolution. Apply the same preprocessing to get a dataset (with the same tokenizer) with a train and a validation split, with two columns review_ids (list of int) and label (int).

**ANSWER** : Copying what we did in the TP on text convolution.

In [1]:
import numpy as np
import torch
import torch.nn.functional as F
import torch.nn as nn
import math
from torch.utils.data import DataLoader
from tabulate import tabulate
from datasets import load_dataset

from tqdm.notebook import tqdm
from transformers import BertTokenizer


  from .autonotebook import tqdm as notebook_tqdm


In [2]:
print("Version de pytorch : ", torch.__version__)
torch.device("cuda" if torch.cuda.is_available() else "cpu")

Version de pytorch :  2.2.0+cu121


device(type='cuda')

In [3]:
dataset = load_dataset("scikit-learn/imdb", split="train")
print(dataset)

Dataset({
    features: ['review', 'sentiment'],
    num_rows: 50000
})


In [4]:
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased", do_lower_case=True)

In [5]:
print("Type of the tokenizer:", type(tokenizer.vocab))
VOCSIZE = len(tokenizer.vocab)
print("Length of the vocabulary:", VOCSIZE)
print(str(tokenizer.vocab)[:50])

Type of the tokenizer: <class 'collections.OrderedDict'>
Length of the vocabulary: 30522
OrderedDict({'[PAD]': 0, '[unused0]': 1, '[unused1


In [6]:
def preprocessing_fn(x, tokenizer):
    x["review_ids"] = tokenizer(
        x["review"],
        add_special_tokens=False,
        truncation=True,
        max_length=256,
        padding=False,
        return_attention_mask=False,
    )["input_ids"]
    x["label"] = 0 if x["sentiment"] == "negative" else 1
    return x

In [7]:
preprocessing_fn(dataset[19], tokenizer)

{'review': "An awful film! It must have been up against some real stinkers to be nominated for the Golden Globe. They've taken the story of the first famous female Renaissance painter and mangled it beyond recognition. My complaint is not that they've taken liberties with the facts; if the story were good, that would perfectly fine. But it's simply bizarre -- by all accounts the true story of this artist would have made for a far better film, so why did they come up with this dishwater-dull script? I suppose there weren't enough naked people in the factual version. It's hurriedly capped off in the end with a summary of the artist's life -- we could have saved ourselves a couple of hours if they'd favored the rest of the film with same brevity.",
 'sentiment': 'negative',
 'review_ids': [2019,
  9643,
  2143,
  999,
  2009,
  2442,
  2031,
  2042,
  2039,
  2114,
  2070,
  2613,
  27136,
  2545,
  2000,
  2022,
  4222,
  2005,
  1996,
  3585,
  7595,
  1012,
  2027,
  1005,
  2310,
  25

In [8]:
n_samples = 5000  # the number of training example

# We first shuffle the data !
dataset = dataset.shuffle(seed=0)

# Select 5000 samples
sampled_dataset = dataset.select(range(n_samples))

# Tokenize the dataset
sampled_dataset = sampled_dataset.map(
    preprocessing_fn, fn_kwargs={"tokenizer" : tokenizer}
)

In [9]:
# Remove useless columns
sampled_dataset = sampled_dataset.select_columns(['review_ids','label'])

# Split the train and validation
splitted_dataset = sampled_dataset.train_test_split(test_size=0.2)

train_set = splitted_dataset['train']
valid_set = splitted_dataset['test']

In [14]:
train_set[0]

{'review_ids': [2023,
  3185,
  2001,
  2200,
  2200,
  19960,
  3695,
  16748,
  1998,
  2200,
  2200,
  2175,
  2854,
  1012,
  3071,
  2187,
  2037,
  3772,
  8220,
  2012,
  2188,
  1998,
  6135,
  9471,
  2129,
  2000,
  2552,
  1045,
  2812,
  2009,
  2001,
  2061,
  2919,
  1998,
  2018,
  2053,
  2613,
  5436,
  1998,
  11793,
  2545,
  2071,
  2031,
  2517,
  1037,
  2488,
  2466,
  5436,
  3524,
  2054,
  2466,
  5436,
  1012,
  2025,
  2012,
  2035,
  12459,
  999],
 'label': 0}

2. Write a function extract_words_contexts. It should retrieve all pairs of valid $(w, C^+)$ from a list of ids representing a text document. It takes the radius $R$ as an argument. Its output is therefore two lists :

In [16]:
tokenizer.pad_token_id

0

to make sure that every C has the same size, we add padding at the beginning and the end of the sentence. For example the first word of the sentence, will have R paddings corresponding to the R tokens that should be before.

In [72]:
def extract_words_contexts(sample, R):
    token_ids = sample["review_ids"]
    n_tokens = len(token_ids)
    local_window = []
    token_ids_with_padding = [0]*R + token_ids + [0]*R
    for i in range(n_tokens) :
        # if out of bounds
        if i<R or i>=n_tokens-R :
            local_window.append([token_ids_with_padding[i+r] for r in range(R)] + [token_ids_with_padding[i+R+r] for r in range(1,R+1, 1)])
        else :
            local_window.append([token_ids[i+r] for r in range(-R, 0, 1)] + [token_ids[i+r] for r in range(1, R+1, 1)])
    return token_ids, local_window

toto, tata = extract_words_contexts(train_set[0], 2)

In [74]:
tata

[[0, 0, 15373, 2006],
 [0, 1000, 2006, 2148],
 [1000, 15373, 2148, 2395],
 [15373, 2006, 2395, 1000],
 [2006, 2148, 1000, 2003],
 [2148, 2395, 2003, 1037],
 [2395, 1000, 1037, 2152],
 [1000, 2003, 2152, 3177],
 [2003, 1037, 3177, 3689],
 [1037, 2152, 3689, 2055],
 [2152, 3177, 2055, 1037],
 [3177, 3689, 1037, 2235],
 [3689, 2055, 2235, 2051],
 [2055, 1037, 2051, 4735],
 [1037, 2235, 4735, 2040],
 [2235, 2051, 2040, 3402],
 [2051, 4735, 3402, 4858],
 [4735, 2040, 4858, 2370],
 [2040, 3402, 2370, 7861],
 [3402, 4858, 7861, 12618],
 [4858, 2370, 12618, 18450],
 [2370, 7861, 18450, 1999],
 [7861, 12618, 1999, 1996],
 [12618, 18450, 1996, 3450],
 [18450, 1999, 3450, 1997],
 [1999, 1996, 1997, 1037],
 [1996, 3450, 1037, 2177],
 [3450, 1997, 2177, 1997],
 [1997, 1037, 1997, 13009],
 [1037, 2177, 13009, 1012],
 [2177, 1997, 1012, 1996],
 [1997, 13009, 1996, 2895],
 [13009, 1012, 2895, 2003],
 [1012, 1996, 2003, 3591],
 [1996, 2895, 3591, 1999],
 [2895, 2003, 1999, 1037],
 [2003, 3591, 1037, 22

In [None]:
def flatten_dataset_to_list():
    pass