## NLP LAB06
- Nelson Vicel-Farrah
- Karen Kaspar
- Romain Brand

## 1. Fine-tune the model on the training data

In [1]:
"""
installing the Transformers packages needed for this lab
"""
!pip install datasets evaluate transformers[sentencepiece]

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [2]:
"""
we use the recommened distilbert pre-trained model
"""
import torch
from transformers import AdamW, AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

In [3]:
"""
we load the imdb dataset which includes a test, train and unsupervised 
datasets of text and labels indicating if the text has a positive or negative connotation
"""
from datasets import load_dataset

raw_datasets = load_dataset("imdb")
raw_datasets



  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

In [4]:
"""
function that is mapped on all the elements of the dataset in order to tokenize them
    :param example: 
        dictionary containg the items of the dataset
    :return: 
        returns a tokenized version of the dataset 
"""
def tokenize_function(example):
    return tokenizer(example["text"], truncation=True)

In [5]:
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
tokenized_datasets



DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 50000
    })
})

In [6]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [7]:
training_args = TrainingArguments("test-trainer", evaluation_strategy="epoch")

In [8]:
import torch

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)
device

device(type='cuda')

In [9]:
import numpy as np
import evaluate

"""
we use the accuracy in order to evaluate our model
"""

metric = evaluate.load("accuracy")


def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

trainer.train()

The following columns in the training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 25000
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 9375
  Number of trainable parameters = 66955010
You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.288,0.23656,0.91108
2,0.1678,0.304135,0.93012
3,0.0523,0.373167,0.93092


Saving model checkpoint to test-trainer/checkpoint-500
Configuration saved in test-trainer/checkpoint-500/config.json
Model weights saved in test-trainer/checkpoint-500/pytorch_model.bin
tokenizer config file saved in test-trainer/checkpoint-500/tokenizer_config.json
Special tokens file saved in test-trainer/checkpoint-500/special_tokens_map.json
Saving model checkpoint to test-trainer/checkpoint-1000
Configuration saved in test-trainer/checkpoint-1000/config.json
Model weights saved in test-trainer/checkpoint-1000/pytorch_model.bin
tokenizer config file saved in test-trainer/checkpoint-1000/tokenizer_config.json
Special tokens file saved in test-trainer/checkpoint-1000/special_tokens_map.json
Saving model checkpoint to test-trainer/checkpoint-1500
Configuration saved in test-trainer/checkpoint-1500/config.json
Model weights saved in test-trainer/checkpoint-1500/pytorch_model.bin
tokenizer config file saved in test-trainer/checkpoint-1500/tokenizer_config.json
Special tokens file saved

TrainOutput(global_step=9375, training_loss=0.17382249572753905, metrics={'train_runtime': 4825.7096, 'train_samples_per_second': 15.542, 'train_steps_per_second': 1.943, 'total_flos': 9363658844900448.0, 'train_loss': 0.17382249572753905, 'epoch': 3.0})

## 2. Evaluate the model in term of accuracy on the test data.

In [11]:
import evaluate
import numpy as np

"""
we calculae our model's accuracy
"""

predictions = trainer.predict(tokenized_datasets["test"])
print(predictions.predictions.shape, predictions.label_ids.shape)

preds = np.argmax(predictions.predictions, axis=-1)

metric = evaluate.load("accuracy")
metric.compute(predictions=preds, references=predictions.label_ids)

The following columns in the test set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 25000
  Batch size = 8


(25000, 2) (25000,)


{'accuracy': 0.93092}

## 3. For at least 2 samples which have been wrongly classified in the test set, try explaining why the model could have been wrong.

In [24]:
prediction_labels = preds
test_labels = np.array(tokenized_datasets['test']['label'])
test_text = np.array(tokenized_datasets['test']['text'])

number_wrong_examples = 0
examples = []

for index, value in enumerate(test_text):
  if (number_wrong_examples == 2):
    break
  if (test_labels[index]!=prediction_labels[index]):
    number_wrong_examples+=1
    examples.append((value, test_labels[index], prediction_labels[index]))


for text, label, prediction in examples:
  print('text:', text)
  print('label:', label)
  print('prediction:', prediction)

text: First off let me say, If you haven't enjoyed a Van Damme movie since bloodsport, you probably will not like this movie. Most of these movies may not have the best plots or best actors but I enjoy these kinds of movies for what they are. This movie is much better than any of the movies the other action guys (Segal and Dolph) have thought about putting out the past few years. Van Damme is good in the movie, the movie is only worth watching to Van Damme fans. It is not as good as Wake of Death (which i highly recommend to anyone of likes Van Damme) or In hell but, in my opinion it's worth watching. It has the same type of feel to it as Nowhere to Run. Good fun stuff!
label: 0
prediction: 1
text: Ben, (Rupert Grint), is a deeply unhappy adolescent, the son of his unhappily married parents. His father, (Nicholas Farrell), is a vicar and his mother, (Laura Linney), is ... well, let's just say she's a somewhat hypocritical soldier in Jesus' army. It's only when he takes a summer job as 

Both wrongly classified examples, are very long and complex. We also notice that the text is ambiguous and  it includes misleading sentences such as 'you will not like this movie' followed by 'it's worth watching' and 'good fun stuff' making it even impossible for humans to correctly label it. 
The model uses a bidirectional encoder, therefore the presence of ambiguous, alternating positive and negative sentences can explain the wrongly classified examples.

## 4. What are the advantages and inconvenient of using this model in production compared to the naive Bayes we implemented in the first part of the course?

The main advantage of using the naive Bayes we implemented in the first part of the course is that it's a lot faster to train. It is simple and easy to implement and doesn't require as much training data. It is also fast and can be used to make real-time predictions. However, it is not as effective and precise as using a model in production, which can be fine-tuned for the exact usage we need. And while the model in production required more time, data and ressources, it displays a better performance.