In [1]:
# Installing package from huggingFace
! pip install datasets evaluate transformers[sentencepiece] torch

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.7.0-py3-none-any.whl (451 kB)
[K     |████████████████████████████████| 451 kB 13.0 MB/s 
[?25hCollecting evaluate
  Downloading evaluate-0.3.0-py3-none-any.whl (72 kB)
[K     |████████████████████████████████| 72 kB 1.7 MB/s 
[?25hCollecting transformers[sentencepiece]
  Downloading transformers-4.24.0-py3-none-any.whl (5.5 MB)
[K     |████████████████████████████████| 5.5 MB 66.5 MB/s 
Collecting xxhash
  Downloading xxhash-3.1.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[K     |████████████████████████████████| 212 kB 70.4 MB/s 
[?25hCollecting multiprocess
  Downloading multiprocess-0.70.14-py37-none-any.whl (115 kB)
[K     |████████████████████████████████| 115 kB 72.7 MB/s 
Collecting huggingface-hub<1.0.0,>=0.2.0
  Downloading huggingface_hub-0.11.0-py3-none-any.whl (182 kB)
[K     |█████████████████████

### 1. Fine-tune the model on the training data
We will use the transformers library provide by HuggingFace to get a model

In [2]:
#import the library
import transformers
import torch

Loading the IMDB library dataset and set distilbert pretrain model.  
IMDB is a dataset that contains reviews from movies and a label that corespond to an positive or negative appreciation of the movie by the reviewer.

In [3]:
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding

dataset = load_dataset("imdb")

tokenizer = AutoTokenizer.from_pretrained('distilbert-base-cased')

Downloading builder script:   0%|          | 0.00/4.31k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.17k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/7.59k [00:00<?, ?B/s]

Downloading and preparing dataset imdb/plain_text to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1...


Downloading data:   0%|          | 0.00/84.1M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Dataset imdb downloaded and prepared to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/411 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/436k [00:00<?, ?B/s]

In [4]:
#The dataset is a DatasetDict
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

In [5]:
#We set a simple tokenize function for our dataset following the structure
def tokenize_function(example):
    return tokenizer(example["text"], truncation=True)

#applying into the dataset
tokenized_datasets = dataset.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

  0%|          | 0/25 [00:00<?, ?ba/s]

  0%|          | 0/25 [00:00<?, ?ba/s]

  0%|          | 0/50 [00:00<?, ?ba/s]

In [6]:
from transformers import TrainingArguments

#Because the training takes a long time we will set epochs to 1 instead of 3
training_args = TrainingArguments("test-trainer", num_train_epochs=1)

In [7]:
from transformers import DistilBertForSequenceClassification

model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-cased', num_labels=2)

Downloading:   0%|          | 0.00/263M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-cased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_projector.bias', 'vocab_projector.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-cased and are newly initialized: ['pre_classifier.bias', 'pre_classifier.weight', 'classifier

In [8]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") 

In [9]:
model.to(device)
model.device

device(type='cuda', index=0)

La ligne suivante sert si le modèle n'est pas executé sur colab

In [10]:
#model.cuda()

In [11]:
from transformers import Trainer

#creating the trainer with train and test
trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

In [12]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 25000
  Num Epochs = 1
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 3125
  Number of trainable parameters = 65783042
You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
500,0.4507
1000,0.3729
1500,0.3203
2000,0.3032
2500,0.2833
3000,0.2844


Saving model checkpoint to test-trainer/checkpoint-500
Configuration saved in test-trainer/checkpoint-500/config.json
Model weights saved in test-trainer/checkpoint-500/pytorch_model.bin
tokenizer config file saved in test-trainer/checkpoint-500/tokenizer_config.json
Special tokens file saved in test-trainer/checkpoint-500/special_tokens_map.json
Saving model checkpoint to test-trainer/checkpoint-1000
Configuration saved in test-trainer/checkpoint-1000/config.json
Model weights saved in test-trainer/checkpoint-1000/pytorch_model.bin
tokenizer config file saved in test-trainer/checkpoint-1000/tokenizer_config.json
Special tokens file saved in test-trainer/checkpoint-1000/special_tokens_map.json
Saving model checkpoint to test-trainer/checkpoint-1500
Configuration saved in test-trainer/checkpoint-1500/config.json
Model weights saved in test-trainer/checkpoint-1500/pytorch_model.bin
tokenizer config file saved in test-trainer/checkpoint-1500/tokenizer_config.json
Special tokens file saved

TrainOutput(global_step=3125, training_loss=0.3346817175292969, metrics={'train_runtime': 1190.7218, 'train_samples_per_second': 20.996, 'train_steps_per_second': 2.624, 'total_flos': 3141257378816640.0, 'train_loss': 0.3346817175292969, 'epoch': 1.0})

### 2. Evaluating the model

In [13]:
import pandas as pd
import numpy as np

In [14]:
predictions = trainer.predict(tokenized_datasets["test"])

The following columns in the test set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 25000
  Batch size = 8


In [15]:
preds = np.argmax(predictions.predictions, axis=-1)

In [16]:
eval = trainer.evaluate()
eval

The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 25000
  Batch size = 8


{'eval_loss': 0.237929105758667,
 'eval_runtime': 399.6654,
 'eval_samples_per_second': 62.552,
 'eval_steps_per_second': 7.819,
 'epoch': 1.0}

In [17]:
prediction = pd.DataFrame(np.array(dataset['test']['text']), columns=['sentence'])

In [18]:
prediction['preds'] = np.array(preds)
prediction['label'] = np.array(dataset['test']['label'])

In [19]:
prediction

Unnamed: 0,sentence,preds,label
0,I love sci-fi and am willing to put up with a ...,0,0
1,"Worth the entertainment value of a rental, esp...",0,0
2,its a totally average film with a few semi-alr...,0,0
3,STAR RATING: ***** Saturday Night **** Friday ...,0,0
4,"First off let me say, If you haven't enjoyed a...",1,0
...,...,...,...
24995,Just got around to seeing Monster Man yesterda...,1,1
24996,I got this as part of a competition prize. I w...,1,1
24997,I got Monster Man in a box set of three films ...,1,1
24998,"Five minutes in, i started to feel how naff th...",0,1


In [20]:
false_prediction = prediction[prediction['preds'] != prediction['label']]
false_prediction

Unnamed: 0,sentence,preds,label
4,"First off let me say, If you haven't enjoyed a...",1,0
18,"Ben, (Rupert Grint), is a deeply unhappy adole...",1,0
36,"Beware, My Lovely (1952) Dir: Harry Horner <br...",1,0
46,"Okay, so it was never going to change the worl...",1,0
61,This film features two of my favorite guilty p...,1,0
...,...,...,...
24917,I'm torn about this show. While MOST parts of ...,0,1
24920,"Sex, drugs, racism and of course you ABC's. Wh...",0,1
24938,Should we take the opening shot as a strange f...,0,1
24981,"""Gaming? Nicotine? Fisticuffs? We're moving in...",0,1


In [33]:
accuracy = (1 - (len(false_prediction) / len(preds))) * 100
print('The accuracy with DistilBert:', accuracy)

The accuracy with DistilBert: 91.684


### 3. For at least 2 samples which have been wrongly classified in the test set, try explaining why the model could have been wrong.

Exemple of a false prediction where the prediction class was negative but the given label was positive.  
The model can be confuse cause there are lots of negative word and a sentence that they 'haven't enjoyed' or 'you probably will not like this movie' or 'is not as good as'.

In [None]:
np.array(false_prediction[false_prediction['preds'] == 1].sentence)[0]

"First off let me say, If you haven't enjoyed a Van Damme movie since bloodsport, you probably will not like this movie. Most of these movies may not have the best plots or best actors but I enjoy these kinds of movies for what they are. This movie is much better than any of the movies the other action guys (Segal and Dolph) have thought about putting out the past few years. Van Damme is good in the movie, the movie is only worth watching to Van Damme fans. It is not as good as Wake of Death (which i highly recommend to anyone of likes Van Damme) or In hell but, in my opinion it's worth watching. It has the same type of feel to it as Nowhere to Run. Good fun stuff!"

Exemple of a false prediction where the prediction class was positive but the given label was negative.  
In this exemple the first part of the sentence is pretty positive but the rest of the comment is a complaint about the movie.

In [None]:
np.array(false_prediction[false_prediction['preds'] == 0].sentence)[0]

'Overall, a well done movie. There were the parts that made me wince, and there were the parts that I threw my hands up at, but I came away with something more than I gone in with.<br /><br />I think the movie suffers from some serious excess ambition. Without spoiling it, let me say that the obvious references to the trial by fire in Ramayana, is way beyond what this movie stands for. The Ramayana is an epic. Not a 200 page book that puts down women in India. The movie is about two girls married into a very distinctive Indian family. While the basic tenets of the "unwritten laws of the family tradition" seem to be that of conservative India, let me assure my reader that I (having lived in Delhi for 12 years) found entire parts that just did not ring those bells. I mean some things and some actions are very true, but some other stuff is just way off the mark. Especially today.<br /><br />Delhi is complicated. India is complicated. The director tries to simplify both. And fails pretty m

### 4. What are the advantages and inconvenient of using this model in production compared to the naive Bayes we implemented in the first part of the course?

    While implementing the Naive Bayes model we had to do stemming or lemmatization to perform the model. We needed to focus on each word and try to find connection with a given dictionary. The model was also feed with label, and we trained the model with labels. The model was pretty easy to understand and provides good results for a first aproach. 

    We use the model here in a fine-tuning purpose, DistilBert was pre-train without labelling on Bert base model and uses MLM where a mask is applied to sentences so the model can define what word could fit in the mask. Our model was train with a bigger dataset and evolve into a new smaller one but the accuracy is similar.  
    
    One of the default in this method is that we need to use GPU compare to Bayes model.