# <center>Lab03 Transformers</center>

## HuggingFace Transformers

In [7]:
!pip install transformers[sentencepiece]

/bin/bash: /home/pili/anaconda3/envs/scia/lib/libtinfo.so.6: no version information available (required by /bin/bash)


## Sentence classification

Using HuggingFace transformer library to fine-tune a model on the IMDB library dataset and then evaluating it on the test set.

In [8]:
import transformers
from datasets import load_dataset
import numpy as np

BATCH_SIZE = 10

# we will use the distilbert model$

In [9]:
# working with the GPU
import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

cuda


In [10]:
dataset = load_dataset("imdb")
train_dataset = dataset["train"].train_test_split(
    stratify_by_column="label", test_size=0.2, seed=42
)
test_df = dataset["test"]
train_df = train_dataset["train"]
valid_df = train_dataset["test"]
train_df.shape, valid_df.shape, test_df.shape

Found cached dataset imdb (/home/pili/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0)
100%|██████████| 3/3 [00:00<00:00, 950.51it/s]
Loading cached split indices for dataset at /home/pili/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0/cache-5f37fd0866e4f89f.arrow and /home/pili/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0/cache-dd5732a0e6ac784c.arrow


((20000, 2), (5000, 2), (25000, 2))

### Preprocessing

In [11]:
# preprocessing
from transformers import AutoTokenizer
checkpoint = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = train_dataset.map(tokenize_function, batched=True)
tokenized_datasets.set_format("torch")

data_collator = transformers.DataCollatorWithPadding(tokenizer=tokenizer)

Loading cached processed dataset at /home/pili/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0/cache-74e187f956cd9276.arrow
Loading cached processed dataset at /home/pili/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0/cache-bf12997ecf0c40f9.arrow


### Fine tuning of the model using the accuracy

Training args used, the max batch size we could use was 10, it means 2000 steps for a dataset of 20 000 images.  
Every 100 steps we output a logging and every 500 steps we display the accuracy on the validation set and we save the model

In [12]:
# training
from transformers import TrainingArguments
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=1,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    logging_dir="./logs",
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    evaluation_strategy="steps",
    save_strategy="steps",
    eval_steps=500,
    save_steps=500,
    logging_steps=100,
    
)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.bias', 'vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias', 'vocab_layer_norm.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classi

We use a pretrained model `distilbert-base-uncased`

In [None]:
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

custom metrics using accuracy

In [13]:
import evaluate

def compute_metrics(eval_pred):
    metric = evaluate.load("accuracy")
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

trainer parameters

In [14]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

In [15]:
trainer.train()

  0%|          | 0/2000 [00:00<?, ?it/s]You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
  5%|▌         | 100/2000 [00:42<13:12,  2.40it/s]

{'loss': 0.4768, 'learning_rate': 4.75e-05, 'epoch': 0.05}


 10%|█         | 200/2000 [01:25<12:42,  2.36it/s]

{'loss': 0.364, 'learning_rate': 4.5e-05, 'epoch': 0.1}


 15%|█▌        | 300/2000 [02:07<11:53,  2.38it/s]

{'loss': 0.3581, 'learning_rate': 4.25e-05, 'epoch': 0.15}


 20%|██        | 400/2000 [02:50<11:15,  2.37it/s]

{'loss': 0.3774, 'learning_rate': 4e-05, 'epoch': 0.2}


 25%|██▌       | 500/2000 [03:32<10:34,  2.36it/s]

{'loss': 0.3243, 'learning_rate': 3.7500000000000003e-05, 'epoch': 0.25}


                                                  
 25%|██▌       | 500/2000 [04:47<10:34,  2.36it/s]

{'eval_loss': 0.32818135619163513, 'eval_accuracy': 0.8974, 'eval_runtime': 74.7844, 'eval_samples_per_second': 66.859, 'eval_steps_per_second': 6.686, 'epoch': 0.25}


 30%|███       | 600/2000 [05:31<09:57,  2.34it/s]  

{'loss': 0.3319, 'learning_rate': 3.5e-05, 'epoch': 0.3}


 35%|███▌      | 700/2000 [06:13<09:07,  2.37it/s]

{'loss': 0.3094, 'learning_rate': 3.2500000000000004e-05, 'epoch': 0.35}


 40%|████      | 800/2000 [06:56<08:28,  2.36it/s]

{'loss': 0.2843, 'learning_rate': 3e-05, 'epoch': 0.4}


 45%|████▌     | 900/2000 [07:38<07:45,  2.36it/s]

{'loss': 0.3232, 'learning_rate': 2.7500000000000004e-05, 'epoch': 0.45}


 50%|█████     | 1000/2000 [08:21<07:02,  2.36it/s]

{'loss': 0.2841, 'learning_rate': 2.5e-05, 'epoch': 0.5}


                                                   
 50%|█████     | 1000/2000 [09:36<07:02,  2.36it/s]

{'eval_loss': 0.29452940821647644, 'eval_accuracy': 0.9038, 'eval_runtime': 75.0229, 'eval_samples_per_second': 66.646, 'eval_steps_per_second': 6.665, 'epoch': 0.5}


 55%|█████▌    | 1100/2000 [10:20<06:18,  2.38it/s]  

{'loss': 0.2568, 'learning_rate': 2.25e-05, 'epoch': 0.55}


 60%|██████    | 1200/2000 [11:02<05:33,  2.40it/s]

{'loss': 0.2848, 'learning_rate': 2e-05, 'epoch': 0.6}


 65%|██████▌   | 1300/2000 [11:45<04:57,  2.35it/s]

{'loss': 0.2536, 'learning_rate': 1.75e-05, 'epoch': 0.65}


 70%|███████   | 1400/2000 [12:28<04:12,  2.38it/s]

{'loss': 0.225, 'learning_rate': 1.5e-05, 'epoch': 0.7}


 75%|███████▌  | 1500/2000 [13:10<03:31,  2.37it/s]

{'loss': 0.2592, 'learning_rate': 1.25e-05, 'epoch': 0.75}


                                                   
 75%|███████▌  | 1500/2000 [14:25<03:31,  2.37it/s]

{'eval_loss': 0.25944778323173523, 'eval_accuracy': 0.9122, 'eval_runtime': 74.768, 'eval_samples_per_second': 66.873, 'eval_steps_per_second': 6.687, 'epoch': 0.75}


 80%|████████  | 1600/2000 [15:08<02:48,  2.37it/s]  

{'loss': 0.2051, 'learning_rate': 1e-05, 'epoch': 0.8}


 85%|████████▌ | 1700/2000 [15:51<02:04,  2.41it/s]

{'loss': 0.2177, 'learning_rate': 7.5e-06, 'epoch': 0.85}


 90%|█████████ | 1800/2000 [16:33<01:23,  2.40it/s]

{'loss': 0.2365, 'learning_rate': 5e-06, 'epoch': 0.9}


 95%|█████████▌| 1900/2000 [17:15<00:44,  2.27it/s]

{'loss': 0.2742, 'learning_rate': 2.5e-06, 'epoch': 0.95}


100%|██████████| 2000/2000 [17:58<00:00,  2.36it/s]

{'loss': 0.2357, 'learning_rate': 0.0, 'epoch': 1.0}


                                                   
100%|██████████| 2000/2000 [19:14<00:00,  2.36it/s]

{'eval_loss': 0.23784467577934265, 'eval_accuracy': 0.9162, 'eval_runtime': 75.9594, 'eval_samples_per_second': 65.825, 'eval_steps_per_second': 6.582, 'epoch': 1.0}


100%|██████████| 2000/2000 [19:15<00:00,  1.73it/s]

{'train_runtime': 1155.7921, 'train_samples_per_second': 17.304, 'train_steps_per_second': 1.73, 'train_loss': 0.294110538482666, 'epoch': 1.0}





TrainOutput(global_step=2000, training_loss=0.294110538482666, metrics={'train_runtime': 1155.7921, 'train_samples_per_second': 17.304, 'train_steps_per_second': 1.73, 'train_loss': 0.294110538482666, 'epoch': 1.0})

Saving and loading the model

In [16]:
# save the model
trainer.save_model("./model")

In [17]:
# load the model
from transformers import pipeline
classifier = pipeline("sentiment-analysis", model="./model", tokenizer=checkpoint)

Xformers is not installed correctly. If you want to use memorry_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.


### Evaluation of the modle on the test set

In [20]:
# evaluate the model on the test set
tokenized_test = dataset["test"].map(tokenize_function, batched=True)
tokenized_test.set_format("torch")
trainer.evaluate(tokenized_test)

100%|██████████| 2500/2500 [06:12<00:00,  6.71it/s]                


{'eval_loss': 0.21499311923980713,
 'eval_accuracy': 0.92404,
 'eval_runtime': 372.9686,
 'eval_samples_per_second': 67.03,
 'eval_steps_per_second': 6.703,
 'epoch': 1.0}

here we have an accuracy of 0.92404% on the test set wich was expected since we had a 0.9162 score on the evaluation set during training

### Analysis on wrongly classified examples

In [30]:
# analysis on two wrongly classified examples on the test set
nb = 2
for i in range(25000):
    if int(classifier(dataset["test"][i]["text"])[0]["label"][-1]) != int(dataset["test"][i]["label"]):
        print(dataset["test"][i]["text"])
        print("classifier prediction : ", classifier(dataset["test"][i]["text"])[0]["label"][-1])
        print("dataset label : ", dataset["test"][i]["label"])
        nb -= 1
        if nb == 0:
         break

First off let me say, If you haven't enjoyed a Van Damme movie since bloodsport, you probably will not like this movie. Most of these movies may not have the best plots or best actors but I enjoy these kinds of movies for what they are. This movie is much better than any of the movies the other action guys (Segal and Dolph) have thought about putting out the past few years. Van Damme is good in the movie, the movie is only worth watching to Van Damme fans. It is not as good as Wake of Death (which i highly recommend to anyone of likes Van Damme) or In hell but, in my opinion it's worth watching. It has the same type of feel to it as Nowhere to Run. Good fun stuff!
classifier prediction :  1
dataset label :  0
Isaac Florentine has made some of the best western Martial Arts action movies ever produced. In particular US Seals 2, Cold Harvest, Special Forces and Undisputed 2 are all action classics. You can tell Isaac has a real passion for the genre and his films are always eventful, crea

# TODO ANALYSIS

### What are the advantages and inconvenient of using this model in production compared to the naive Bayes we implemented in the first part of the course? And compared to a recurrent model like an RNN or an LSTM?

# TODO ANSWER