## Hugging Face Lesson (1 to 4)

In this section we had to use the **HuggingFace transformer library** to fine-tune a model on the **IMDB** library dataset and then evaluate it on the test set.

The HuggingFace transformers course helped us a lot to go step by step through the process of fine-tuning a model. First we had to load the **IMDB** dataset, then we had to instantiate a tokenizer to preprocess it, then we had to create a model and train it with a trainer.

We used the **distilbert model** as a pre-trained model, as it is light and fine-tuned fast. We also used the **accuracy** as evaluation instead of the **loss** (default). We saved our model on HuggingFace model hub. We evaluated the model in term of accuracy on the test data and we got a 0.92 accuracy. We also tried to explain why the model could have been wrong for some samples which have been wrongly classified in the test set. We also compared the advantages and inconvenient of using this model in production compared to the naive Bayes we implemented in the first part of the course.

---

## Installation

In [None]:
!pip install transformers
!pip install datasets

In [2]:
import transformers
import datasets

In [None]:
!python -m pip install huggingface_hub
!huggingface-cli login

---

## The dataset

First we need to load our dataset, try to understand it better and preprocess it. We used the **IMDB** dataset which is a dataset of 50 000 movie reviews from IMDB, labeled by sentiment (positive/negative).

In [4]:
from datasets import load_dataset

raw_datasets = load_dataset("imdb")
raw_datasets

Downloading builder script:   0%|          | 0.00/4.31k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.17k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/7.59k [00:00<?, ?B/s]

Downloading and preparing dataset imdb/plain_text to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1...


Downloading data:   0%|          | 0.00/84.1M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Dataset imdb downloaded and prepared to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

We can access each pair of sentences in our raw_datasets object by indexing, like with a dictionary:

In [5]:
raw_train_dataset = raw_datasets["train"]
raw_train_dataset[0]

{'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far be

---

## Preprocessing

To preprocess the dataset, we need to convert the text to numbers the model can make sense of. 
Here we use the **distilbert** tokenizer, which is a tokenizer that is part of the **distilbert** model. We need to use **map** to apply the tokenizer to each sentence in our dataset, which is way faster than using a for loop.

In [6]:
from transformers import DistilBertTokenizerFast, DataCollatorWithPadding

access_token = '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'

checkpoint = "distilbert-base-uncased"
tokenizer = DistilBertTokenizerFast.from_pretrained(checkpoint, use_auth_token=access_token)

def tokenize_function(example):
    return tokenizer(example["text"], truncation=True)

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
tokenized_datasets

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

  0%|          | 0/25 [00:00<?, ?ba/s]

  0%|          | 0/25 [00:00<?, ?ba/s]

  0%|          | 0/50 [00:00<?, ?ba/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 50000
    })
})

Here we create the object that is responsible for putting together samples inside a batch and which is called a collate function. A data collator is a function that takes a list of samples from a Dataset and collates them into a batch. We use the **DataCollatorWithPadding** function to pad our samples to the maximum length.

In [7]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

---

## Training

Now that we have our dataset ready, we can create our model. We use the **distilbert** model, which is a light model that is fine-tuned fast. We use the **Trainer** object to train our model.

In [8]:
from transformers import TrainingArguments, DistilBertForSequenceClassification, Trainer

training_args = TrainingArguments("NLP_DEEP_2", push_to_hub=True)
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', use_auth_token=access_token)


Downloading:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.bias', 'vocab_layer_norm.bias', 'vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_projector.weight', 'vocab_transform.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classi

In [9]:
trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

Cloning https://huggingface.co/Bictole/NLP_DEEP_2 into local empty directory.


Download file pytorch_model.bin:   0%|          | 346/255M [00:00<?, ?B/s]

Clean file pytorch_model.bin:   0%|          | 1.00k/255M [00:00<?, ?B/s]

We need to make sure to train on the GPU, otherwise it will take a lot of time to train. We can check if we have a GPU available with the following cell:

In [10]:
print(model.device)

cuda:0


In [11]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 25000
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 9375
  Number of trainable parameters = 66955010
You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
500,0.4
1000,0.3337
1500,0.3215
2000,0.3093
2500,0.2911
3000,0.2896


Saving model checkpoint to NLP_DEEP_2/checkpoint-500
Configuration saved in NLP_DEEP_2/checkpoint-500/config.json
Model weights saved in NLP_DEEP_2/checkpoint-500/pytorch_model.bin
tokenizer config file saved in NLP_DEEP_2/checkpoint-500/tokenizer_config.json
Special tokens file saved in NLP_DEEP_2/checkpoint-500/special_tokens_map.json
tokenizer config file saved in NLP_DEEP_2/tokenizer_config.json
Special tokens file saved in NLP_DEEP_2/special_tokens_map.json
Saving model checkpoint to NLP_DEEP_2/checkpoint-1000
Configuration saved in NLP_DEEP_2/checkpoint-1000/config.json
Model weights saved in NLP_DEEP_2/checkpoint-1000/pytorch_model.bin
tokenizer config file saved in NLP_DEEP_2/checkpoint-1000/tokenizer_config.json
Special tokens file saved in NLP_DEEP_2/checkpoint-1000/special_tokens_map.json
Saving model checkpoint to NLP_DEEP_2/checkpoint-1500
Configuration saved in NLP_DEEP_2/checkpoint-1500/config.json
Model weights saved in NLP_DEEP_2/checkpoint-1500/pytorch_model.bin
token

Step,Training Loss
500,0.4
1000,0.3337
1500,0.3215
2000,0.3093
2500,0.2911
3000,0.2896
3500,0.2098
4000,0.1854
4500,0.1784
5000,0.1693


Saving model checkpoint to NLP_DEEP_2/checkpoint-3500
Configuration saved in NLP_DEEP_2/checkpoint-3500/config.json
Model weights saved in NLP_DEEP_2/checkpoint-3500/pytorch_model.bin
tokenizer config file saved in NLP_DEEP_2/checkpoint-3500/tokenizer_config.json
Special tokens file saved in NLP_DEEP_2/checkpoint-3500/special_tokens_map.json
Saving model checkpoint to NLP_DEEP_2/checkpoint-4000
Configuration saved in NLP_DEEP_2/checkpoint-4000/config.json
Model weights saved in NLP_DEEP_2/checkpoint-4000/pytorch_model.bin
tokenizer config file saved in NLP_DEEP_2/checkpoint-4000/tokenizer_config.json
Special tokens file saved in NLP_DEEP_2/checkpoint-4000/special_tokens_map.json
Saving model checkpoint to NLP_DEEP_2/checkpoint-4500
Configuration saved in NLP_DEEP_2/checkpoint-4500/config.json
Model weights saved in NLP_DEEP_2/checkpoint-4500/pytorch_model.bin
tokenizer config file saved in NLP_DEEP_2/checkpoint-4500/tokenizer_config.json
Special tokens file saved in NLP_DEEP_2/checkpoi

TrainOutput(global_step=9375, training_loss=0.18786427083333335, metrics={'train_runtime': 3702.6349, 'train_samples_per_second': 20.256, 'train_steps_per_second': 2.532, 'total_flos': 9363658844900448.0, 'train_loss': 0.18786427083333335, 'epoch': 3.0})

The training of 3 epochs takes around 50 minutes on the GPU.

---

## Model Hub

Next to that, we discovered the **HuggingFace model hub**. We can upload our model on the hub and share it with the community. We obviously filled the model's card which is a markdown file that describes the model and its training. We can also upload the model's weights and the tokenizer's vocabulary.

In [12]:
from huggingface_hub import notebook_login

notebook_login()

Login successful
Your token has been saved to /root/.huggingface/token


In [13]:
model.push_to_hub("Bictole/NLP_DEEP_2")
tokenizer.push_to_hub("Bictole/NLP_DEEP_2")

Configuration saved in NLP_DEEP_2/config.json
Model weights saved in NLP_DEEP_2/pytorch_model.bin
Uploading the following files to Bictole/NLP_DEEP_2: pytorch_model.bin,config.json
tokenizer config file saved in NLP_DEEP_2/tokenizer_config.json
Special tokens file saved in NLP_DEEP_2/special_tokens_map.json
Uploading the following files to Bictole/NLP_DEEP_2: special_tokens_map.json,tokenizer_config.json,vocab.txt,tokenizer.json


CommitInfo(commit_url='https://huggingface.co/Bictole/NLP_DEEP_2/commit/ea8d3de2bbe475f10e2a792bd4a97fe5b8ead3d7', commit_message='Upload tokenizer', commit_description='', oid='ea8d3de2bbe475f10e2a792bd4a97fe5b8ead3d7', pr_url=None, pr_revision=None, pr_num=None)

---

## Evaluation

Here we evaluate the model in term of accuracy on the test data.

For that we use the predict method of the trainer object on the test dataset. We then use the **accuracy** metric to evaluate the model.

In [14]:
predictions = trainer.predict(tokenized_datasets["test"])
print(predictions.predictions.shape, predictions.label_ids.shape)

The following columns in the test set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 25000
  Batch size = 8


(25000, 2) (25000,)


In [None]:
!pip install evaluate

With the `evaluate` module, we load the accuracy metric and we compute the accuracy on the test set.

In [18]:
import numpy as np
import evaluate

metric = evaluate.load("accuracy")
metric.compute(predictions=predictions.predictions.argmax(axis=-1), references=predictions.label_ids)

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

{'accuracy': 0.92924}

Here we can see that the model has an accuracy of `0.92` on the test set which is pretty good for a model that has been trained on 3 epochs in 50 minutes.

Now for at least 2 samples which have been wrongly classified in the test set, we will try explaining why the model could have been wrong.

Let's first take a look at the wrongly classified samples:

In [20]:
import numpy as np

wrong_predictions = np.where(predictions.predictions.argmax(axis=1) != predictions.label_ids)[0][:2]
wrong_predictions

array([ 4, 20])

In [21]:
tokenized_test_dataset = tokenized_datasets["test"]
tokenized_test_dataset[wrong_predictions]['text']

["First off let me say, If you haven't enjoyed a Van Damme movie since bloodsport, you probably will not like this movie. Most of these movies may not have the best plots or best actors but I enjoy these kinds of movies for what they are. This movie is much better than any of the movies the other action guys (Segal and Dolph) have thought about putting out the past few years. Van Damme is good in the movie, the movie is only worth watching to Van Damme fans. It is not as good as Wake of Death (which i highly recommend to anyone of likes Van Damme) or In hell but, in my opinion it's worth watching. It has the same type of feel to it as Nowhere to Run. Good fun stuff!",
 "Low budget horror movie. If you don't raise your expectations too high, you'll probably enjoy this little flick. Beginning and end are pretty good, middle drags at times and seems to go nowhere for long periods as we watch the goings on of the insane that add atmosphere but do not advance the plot. Quite a bit of gore. 

There are a few possible reasons for the model to fail here. First, the model may be **overfitting** on the training data. This means that it has learned to classify reviews based on specific features of the training data that do not **generalize** well to the test data. Second, the model may be biased against positive reviews. This could be due to the fact that the training data is **imbalanced**, with more negative reviews than positive reviews. Finally, the model may be simply inaccurate. This could be due to a variety of factors, including the fact that the model has not been trained for **long enough**, or the data is too noisy.

---

**What are the advantages and inconvenient of using this model in production compared to the naive Bayes we implemented in the first part of the course?**

There are several advantages to using this fine-tuned Hugging-face model in production compared to the naive Bayes we implemented in the first part of the course.

First, the fine-tuned Hugging-face model is more **accurate**. It has been trained on a large dataset and has been **fine-tuned** specifically for the task of sentiment analysis. This results in a model that is much more accurate than the naive Bayes model.

Second, this model is **much faster**. The naive Bayes model has to perform a lot of calculations for each text, which can take a long time. The fine-tuned Hugging-face model, on the other hand, can process texts much faster.

Third, this model is more robust. The naive Bayes model is very sensitive to changes in the data, and if the data is different from what the model was trained on, the accuracy will suffer. This model is less sensitive to changes in the data, and so is more robust.

Fourth, this model is more **interpretable**. The naive Bayes model is a black box, and it is difficult to understand why it makes the predictions it does. It is more transparent, and it is easier to understand why it makes the predictions it does.

There are also some disadvantages to using a fine-tuned Hugging-face model in production.

First, the fine-tuned Hugging-face model is more expensive. It requires more **resources** to train and fine-tune, and so is more expensive to use.

Second, this model is more complex. The naive Bayes model is very simple, and so is easy to understand and use. The fine-tuned Hugging-face model is more complex, and so is more difficult to understand and use.

Third, this model is **less flexible**. The naive Bayes model can be easily modified to work with different data or different tasks. The fine-tuned Hugging-face model is more difficult to modify, and so is less flexible.

---

## Bonus

Fine-tune your model using the **accuracy** as evaluation instead of the loss (default). You can use the base Trainer class, create your own custom trainer class, or even not use Trainer at all. Return the model with the best results on the validation set instead of the last one.

We decided to use the **base Trainer class** to fine-tune our model. We used the **accuracy** as evaluation instead of the **loss**. Here a `compute_metrics` function is used to compute the accuracy on the validation set.

In [25]:
# Let's now fine-tune our model using the accuracy as evaluation instead of the loss

metric = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

training_args = TrainingArguments("NLP_DEEP_2", metric_for_best_model="accuracy", evaluation_strategy="epoch")
trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    compute_metrics=compute_metrics,
    data_collator=data_collator,
    tokenizer=tokenizer,
)

trainer.train()

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
The following columns in the training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 25000
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 9375
  Number of trainable parameters = 66955010


Epoch,Training Loss,Validation Loss,Accuracy
1,0.1119,0.408011,0.91568
2,0.05,0.549399,0.9138


Saving model checkpoint to NLP_DEEP_2/checkpoint-500
Configuration saved in NLP_DEEP_2/checkpoint-500/config.json
Model weights saved in NLP_DEEP_2/checkpoint-500/pytorch_model.bin
tokenizer config file saved in NLP_DEEP_2/checkpoint-500/tokenizer_config.json
Special tokens file saved in NLP_DEEP_2/checkpoint-500/special_tokens_map.json
Saving model checkpoint to NLP_DEEP_2/checkpoint-1000
Configuration saved in NLP_DEEP_2/checkpoint-1000/config.json
Model weights saved in NLP_DEEP_2/checkpoint-1000/pytorch_model.bin
tokenizer config file saved in NLP_DEEP_2/checkpoint-1000/tokenizer_config.json
Special tokens file saved in NLP_DEEP_2/checkpoint-1000/special_tokens_map.json
Saving model checkpoint to NLP_DEEP_2/checkpoint-1500
Configuration saved in NLP_DEEP_2/checkpoint-1500/config.json
Model weights saved in NLP_DEEP_2/checkpoint-1500/pytorch_model.bin
tokenizer config file saved in NLP_DEEP_2/checkpoint-1500/tokenizer_config.json
Special tokens file saved in NLP_DEEP_2/checkpoint-15

Epoch,Training Loss,Validation Loss,Accuracy
1,0.1119,0.408011,0.91568
2,0.05,0.549399,0.9138
3,0.024,0.594581,0.92252


Saving model checkpoint to NLP_DEEP_2/checkpoint-8000
Configuration saved in NLP_DEEP_2/checkpoint-8000/config.json
Model weights saved in NLP_DEEP_2/checkpoint-8000/pytorch_model.bin
tokenizer config file saved in NLP_DEEP_2/checkpoint-8000/tokenizer_config.json
Special tokens file saved in NLP_DEEP_2/checkpoint-8000/special_tokens_map.json
Saving model checkpoint to NLP_DEEP_2/checkpoint-8500
Configuration saved in NLP_DEEP_2/checkpoint-8500/config.json
Model weights saved in NLP_DEEP_2/checkpoint-8500/pytorch_model.bin
tokenizer config file saved in NLP_DEEP_2/checkpoint-8500/tokenizer_config.json
Special tokens file saved in NLP_DEEP_2/checkpoint-8500/special_tokens_map.json
Saving model checkpoint to NLP_DEEP_2/checkpoint-9000
Configuration saved in NLP_DEEP_2/checkpoint-9000/config.json
Model weights saved in NLP_DEEP_2/checkpoint-9000/pytorch_model.bin
tokenizer config file saved in NLP_DEEP_2/checkpoint-9000/tokenizer_config.json
Special tokens file saved in NLP_DEEP_2/checkpoi

TrainOutput(global_step=9375, training_loss=0.06052143096923828, metrics={'train_runtime': 4977.6335, 'train_samples_per_second': 15.067, 'train_steps_per_second': 1.883, 'total_flos': 9363658844900448.0, 'train_loss': 0.06052143096923828, 'epoch': 3.0})

The training of 3 epochs takes around 70 minutes on the GPU.

Finally, we evaluate the model in term of **accuracy** on the test data.

In [26]:
predictions_accuracy = trainer.predict(tokenized_datasets["test"])
print(predictions_accuracy.predictions.shape, predictions_accuracy.label_ids.shape)

The following columns in the test set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 25000
  Batch size = 8


(25000, 2) (25000,)


In [27]:
metric = evaluate.load("accuracy")
metric.compute(predictions=predictions_accuracy.predictions.argmax(axis=-1), references=predictions_accuracy.label_ids)

{'accuracy': 0.92252}

We got a 0.92 accuracy, which is closely the same as the accuracy we got with the default loss. It is quite hard to get some good conclusions on this part, because the final accuracy is quite similar with both loss and accuracy. Train the model for **more epochs** could gives us some insights on this part, but it would take a lot of time to train.

---

Our model is now ready to be used in production. It is available on the **HuggingFace model hub** here : https://huggingface.co/Bictole/NLP_DEEP_2