# Dataset and Huggingface submission

In this tp8_submission notebook: Aymane El Hichami, Daniel Maksimov, Paul Malige

- we vizualise the SNLI dataset.
- implement a DistilBert Sequence Classification architecture


## 1/ Dataset Visualisation

First we load the Standford NLI corpus and experiment with it.

We can use the Huggingface online viewer for datasets: https://huggingface.co/datasets/viewer/?dataset=snli


In [2]:
from datasets import load_dataset
snli = load_dataset("snli")
#Removing sentence pairs with no label (-1)
snli = snli.filter(lambda example: example['label'] != -1) 

Reusing dataset snli (/usr/users/gpusdi1/gpusdi1_40/.cache/huggingface/datasets/snli/plain_text/1.0.0/1f60b67533b65ae0275561ff7828aad5ee4282d0e6f844fd148d05d3c6ea251b)


  0%|          | 0/3 [00:00<?, ?it/s]

Loading cached processed dataset at /usr/users/gpusdi1/gpusdi1_40/.cache/huggingface/datasets/snli/plain_text/1.0.0/1f60b67533b65ae0275561ff7828aad5ee4282d0e6f844fd148d05d3c6ea251b/cache-69916534c8046ad9.arrow
Loading cached processed dataset at /usr/users/gpusdi1/gpusdi1_40/.cache/huggingface/datasets/snli/plain_text/1.0.0/1f60b67533b65ae0275561ff7828aad5ee4282d0e6f844fd148d05d3c6ea251b/cache-3eb05855f5b61765.arrow
Loading cached processed dataset at /usr/users/gpusdi1/gpusdi1_40/.cache/huggingface/datasets/snli/plain_text/1.0.0/1f60b67533b65ae0275561ff7828aad5ee4282d0e6f844fd148d05d3c6ea251b/cache-48d9b7dcd4658c32.arrow


The snli object is a dictionary containing three elements: the train, valid and tests datasets.

In [3]:
print(snli)

# we reduce the datasets length for troubleshooting:
train = snli["train"] # .select([i for i in range(100000)])
valid = snli["validation"] # .select([i for i in range(9000)])
test = snli["test"] # .select([i for i in range(1000)])

DatasetDict({
    test: Dataset({
        features: ['premise', 'hypothesis', 'label'],
        num_rows: 9824
    })
    train: Dataset({
        features: ['premise', 'hypothesis', 'label'],
        num_rows: 549367
    })
    validation: Dataset({
        features: ['premise', 'hypothesis', 'label'],
        num_rows: 9842
    })
})


In [4]:
print("train features are: {}".format(train.features))

train features are: {'premise': Value(dtype='string', id=None), 'hypothesis': Value(dtype='string', id=None), 'label': ClassLabel(num_classes=3, names=['entailment', 'neutral', 'contradiction'], names_file=None, id=None)}


In [5]:
print("the first element of the train dataset is:", "\n")
print(train[0], "\n")

the first element of the train dataset is: 

{'premise': 'A person on a horse jumps over a broken down airplane.', 'hypothesis': 'A person is training his horse for a competition.', 'label': 1} 



## 2/ Tokenizer

**The tokenizers of the Huggingface Library:** https://huggingface.co/transformers/preprocessing.html

In this section we use pretrained *DistilBertTokenizer* (DistilBERT only uses token embeddings).

In [7]:
from transformers import DistilBertTokenizer

tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased', num_labels=3)

### 2.1/ Tokenizing a pair of sentences

https://huggingface.co/transformers/preprocessing.html#preprocessing-pairs-of-sentences

To encode a pair of sentences in the format expected by your model, supply the two sentences as two distinct arguments (not a list since a list of two sentences will be interpreted as a batch of two single sentences).

In [8]:
tokenized = tokenizer(train[0]["hypothesis"], train[0]["premise"])
print(tokenized)

{'input_ids': [101, 1037, 2711, 2003, 2731, 2010, 3586, 2005, 1037, 2971, 1012, 102, 1037, 2711, 2006, 1037, 3586, 14523, 2058, 1037, 3714, 2091, 13297, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


In [9]:
print(tokenizer.decode(tokenized['input_ids']), "\n")

i=10
print("id {} encodes {}".format(i,tokenizer.decode(tokenized['input_ids'][i])))

[CLS] a person is training his horse for a competition. [SEP] a person on a horse jumps over a broken down airplane. [SEP] 

id 10 encodes .


### 2.2/ Tokenizing the whole dataset

**The Map method with *batched=true*:**

https://huggingface.co/docs/datasets/processing.html#processing-data-in-batches
https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map

In [10]:
batch_size = 32

def encode(sample):
    return tokenizer(sample["hypothesis"], sample["premise"], truncation=False, padding='longest')

tokenized_train = train.map(encode, batched=True, batch_size=batch_size)
tokenized_valid = valid.map(encode, batched=True, batch_size=batch_size)
tokenized_test = test.map(encode, batched=True, batch_size=batch_size)

  0%|          | 0/17168 [00:00<?, ?ba/s]

  0%|          | 0/308 [00:00<?, ?ba/s]

  0%|          | 0/307 [00:00<?, ?ba/s]

## 3/ Formatting the dataset

**To be able to train our model with this dataset and PyTorch, we will need to do three modifications:**

- rename our columns to match the names expected by the forward function of the DistilBERT model

- apply the .set_format method to transformation dataset items, here into tensors

- filter the columns to return only the subset of the columns that we need for our model inputs (input_ids,labels and attention_mask).

In [11]:
def sample_modification(sample):
    sample['labels'] = sample.pop('label') # changes the key name from label to labels
    return(sample)

tokenized_train = tokenized_train.map(sample_modification, batched=True, batch_size=batch_size)
tokenized_valid = tokenized_valid.map(sample_modification, batched=True, batch_size=batch_size)
tokenized_test = tokenized_test.map(sample_modification, batched=True, batch_size=batch_size)

  0%|          | 0/17168 [00:00<?, ?ba/s]

  0%|          | 0/308 [00:00<?, ?ba/s]

  0%|          | 0/307 [00:00<?, ?ba/s]

In [13]:
import torch

tokenized_train.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])
tokenized_valid.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])
tokenized_test.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])

train_dataloader = torch.utils.data.DataLoader(tokenized_train, batch_size=batch_size)
valid_dataloader = torch.utils.data.DataLoader(tokenized_valid, batch_size=batch_size)
test_dataloader = torch.utils.data.DataLoader(tokenized_test, batch_size=batch_size)

next(iter(train_dataloader))

{'input_ids': tensor([[ 101, 1037, 2711,  ...,    0,    0,    0],
         [ 101, 1037, 2711,  ...,    0,    0,    0],
         [ 101, 1037, 2711,  ...,    0,    0,    0],
         ...,
         [ 101, 2048, 2308,  ...,    0,    0,    0],
         [ 101, 1037, 2136,  ...,    0,    0,    0],
         [ 101, 1037, 2136,  ...,    0,    0,    0]]),
 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         ...,
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0]]),
 'labels': tensor([1, 2, 0, 1, 0, 2, 2, 0, 1, 1, 2, 1, 1, 2, 0, 1, 2, 0, 0, 2, 1, 1, 2, 0,
         2, 0, 1, 1, 2, 0, 1, 0])}

## 4/ Models

In this section we define our DistilBert models.

https://huggingface.co/transformers/model_doc/distilbert.html

https://huggingface.co/transformers/model_doc/distilbert.html#transformers.DistilBertConfig


### 4.1/ DistilBERT for SequenceClassification

Blog about DistilBERT and Distillation https://medium.com/huggingface/distilbert-8cf3380435b5

In [14]:
from transformers import DistilBertForSequenceClassification

model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=3)

print(model.config)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_projector.weight', 'vocab_transform.bias', 'vocab_projector.bias', 'vocab_layer_norm.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classi

DistilBertConfig {
  "_name_or_path": "distilbert-base-uncased",
  "activation": "gelu",
  "architectures": [
    "DistilBertForMaskedLM"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2"
  },
  "initializer_range": 0.02,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2
  },
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.17.0",
  "vocab_size": 30522
}



#### 4.2.1/ Characteristics of the DistilBERT for SequenceClassification output

In [15]:
inputs = tokenizer("Hello, my dog is cute", return_tensors="pt") # specify that tensor type is pytorch
print("model inputs is: \n{}".format(inputs), "\n")

outputs = model(**inputs)
print("model ouputs is: \n{}".format(outputs), "\n")

model inputs is: 
{'input_ids': tensor([[  101,  7592,  1010,  2026,  3899,  2003, 10140,   102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1]])} 

model ouputs is: 
SequenceClassifierOutput(loss=None, logits=tensor([[-0.0892,  0.0350,  0.0604]], grad_fn=<AddmmBackward>), hidden_states=None, attentions=None) 



## 5/ Training

In this section, we train our models.

**cuda version check**

In [16]:
print(torch.cuda.is_available())
print(torch.__version__)
print(torch.version.cuda)

True
1.9.1+cu111
11.1


**the loss function**

When we call a classification model with the labels argument, the first returned element is the **Cross Entropy loss** between the predictions and the passed labels.

In [17]:
from torch.nn import functional as F

def my_loss_fn(logits, labels):
    return F.cross_entropy(logits, labels)

**freezing parameters for the base layers**

when using the BERT base model with a custom head, we freeze the base parameters.
https://huggingface.co/transformers/training.html#freezing-the-encoder

In [None]:
# for param in model.base_model.parameters():
#     param.requires_grad = False

### 5.1/ metrics

We use our own accuracy and f_score for display only. Learning is done with the model loss.

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html

In [18]:
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
import torch.nn.functional as F
import statistics
import numpy as np

def compute_metrics(loss, logits, labels):
    
    my_loss = loss.clone().detach()
    my_logits = logits.clone().detach()
    my_labels = labels.clone().detach()
    
    preds = torch.argmax(F.softmax(logits, dim=-1), dim=1) # computes preds from logits
    
    # before converting to numpy, tensor have to be copied locally on the cpu:
    my_loss = my_loss.cpu().numpy()
    my_labels = my_labels.cpu().numpy()
    preds = preds.cpu().numpy()
    
    loss = my_loss.item()
    precision, recall, f1, _ = precision_recall_fscore_support(my_labels, preds, average='macro', zero_division=0)
    acc = accuracy_score(my_labels, preds)
    
    return {'loss': loss, 'accuracy': acc, 'f1': f1, 'precision': precision, 'recall': recall}

def update_metrics(metrics:dict, batch_metrics:dict):
    for k,v in metrics.items():
        metrics[k].append(batch_metrics[k])

def display_metrics(metrics:dict, epoch, epochs, text):
    loss = statistics.mean(metrics['loss'])
    acc = statistics.mean(metrics['accuracy'])
    f1 = statistics.mean(metrics['f1'])
    print(f'{text} : Epoch [{epoch+1}/{epochs}], Loss [{loss:.4f}], Accuracy [{acc:.4f}], F1_score [{f1:.4f}]')
    

### 5.2/ training with pytorch, iterating over the dataset

The model is trained using an optimizer and a scheduler.

In [None]:
from tqdm import tqdm
from transformers import get_linear_schedule_with_warmup

device = 'cuda'
epochs = 3
num_warmup_steps = 5
num_train_steps = 2

model.to(device)
optimizer = torch.optim.AdamW(params=model.parameters(), lr=1e-5)
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps, num_train_steps)

for epoch in range(epochs):
    
    # 1/ Training
    model.train()
    my_metrics = {'loss': [], 'accuracy': [], 'f1': [],  'precision': [], 'recall': []}

    for batch in tqdm(train_dataloader):
        
        batch = {k: v.to(device) for k, v in batch.items()}
        
        # forward propagate the batch:
        outputs = model(**batch)
        
        # compute the loss:
        # my_loss = my_loss_fn(outputs.logits, batch['labels'])
        
        loss = outputs.loss
        
        # back propagate the loss:
        loss.backward()
        
        batch_metrics = compute_metrics(loss, outputs.logits, batch['labels'])
        update_metrics(my_metrics, batch_metrics)
        
        optimizer.step()
        scheduler.step()
        optimizer.zero_grad()
        
    display_metrics(my_metrics, epoch, epochs, "Training")
    
    # 2/ Evaluation
    model.eval()
    my_metrics = {'loss': [], 'accuracy': [], 'f1': [],  'precision': [], 'recall': []}
    
    with torch.no_grad():
        for batch in tqdm(valid_dataloader):

            batch = {k: v.to(device) for k, v in batch.items()}

            # forward propagate the batch:
            outputs = model(**batch)

            # compute the loss:
            # my_loss = my_loss_fn(outputs.logits, batch['labels'])

            loss = outputs.loss

            batch_metrics = compute_metrics(loss, outputs.logits, batch['labels'])
            update_metrics(my_metrics, batch_metrics)
        
    display_metrics(my_metrics, epoch, epochs, "Validation")
    
    

 76%|███████████████████████████▉         | 12962/17168 [37:20<12:23,  5.66it/s]

## 6/ Using our model

Build a function to play with our model capabilities.

In [None]:
from datasets import Dataset

def my_inferencing(hypothesis:str, premise:str): 
    input_dict = {'hypothesis': [hypothesis], 'premise': [premise]}
    dataset = Dataset.from_dict(input_dict)
    tokenized_dataset = dataset.map(encode, batched=True, batch_size=1)
    tokenized_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask']) # no label column
    dataloader = torch.utils.data.DataLoader(tokenized_dataset, batch_size=1)
    
    model.eval()
    with torch.no_grad():
        # propagate only batch through model
        batch = next(iter(dataloader))
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)

        logits = outputs.logits
        probs = F.softmax(logits, dim=-1)
        pred = torch.argmax(probs, dim=1)

        my_probs = probs.cpu().numpy()[0]
        my_pred = pred.cpu().numpy().item()

    idx_to_labels = {0: "entails", 1: "is neutral regarding", 2: "contradicts"}
    res = f"The premise ['{premise}'] {idx_to_labels[my_pred]} the hypothesis ['{hypothesis}']."
    
    return res, my_probs

We now test our model a manually chosen pair of sentences:

In [None]:
result, probs = my_inferencing("I am tall", "I have long legs")
print(result)
print(probs)

HBox(children=(IntProgress(value=0, max=1), HTML(value='')))


The premise ['I have long legs'] entails the hypothesis ['I am tall'].
[0.36271402 0.28665614 0.35062984]
