In [1]:
!pip install datasets
!pip install transformers[sentencepiece]
!pip install evaluate

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.7.0-py3-none-any.whl (451 kB)
[K     |████████████████████████████████| 451 kB 13.0 MB/s 
[?25hCollecting xxhash
  Downloading xxhash-3.1.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[K     |████████████████████████████████| 212 kB 68.8 MB/s 
Collecting huggingface-hub<1.0.0,>=0.2.0
  Downloading huggingface_hub-0.11.0-py3-none-any.whl (182 kB)
[K     |████████████████████████████████| 182 kB 34.3 MB/s 
Collecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting multiprocess
  Downloading multiprocess-0.70.14-py37-none-any.whl (115 kB)
[K     |████████████████████████████████| 115 kB 59.9 MB/s 
Collecting urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1
  Downloading urllib3-1.25.11-py2.py3-none-any.whl (127 kB)
[K     |████████████████████████████████| 127 kB 72.2 MB/s 
Installing collected

In [16]:
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification

SEED = 42
imdb_ds = load_dataset("imdb")
imdb_ds



  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

In [21]:
# In previous practicals it was observed that the data for the imdb dataset
# is not shuffled causing issues when model training with it. Let's correct that

for ds_split_key in imdb_ds:
  imdb_ds[ds_split_key] = imdb_ds[ds_split_key].shuffle(seed=SEED)

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 12500
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
    test_full: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    val: Dataset({
        features: ['text', 'label'],
        num_rows: 12500
    })
})

In [24]:
checkpoint = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

def tokenize_function(example):
    return tokenizer(example["text"], truncation=True)

# We will not be testing heavily the first model, so we will be using the test
# split of imdb to evaluate our model during training.

tokenized_train_ds = imdb_ds['train'].map(tokenize_function, batched=True)
tokenized_val_ds = imdb_ds['test'].map(tokenize_function, batched=True)
tokenized_train_ds



  0%|          | 0/13 [00:00<?, ?ba/s]

Dataset({
    features: ['text', 'label', 'input_ids', 'attention_mask'],
    num_rows: 25000
})

In [29]:
len(tokenized_train_ds['input_ids'][0]), len(tokenized_train_ds['text'][0])

(483, 2410)

In [31]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [32]:
from transformers import TrainingArguments, AutoModelForSequenceClassification, Trainer
import numpy as np
import evaluate

In [33]:
training_args = TrainingArguments("model_checkpoints", num_train_epochs=1, optim='adamw_torch')
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

def compute_metrics(eval_pred, metric=evaluate.load("accuracy")):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_train_ds,
    eval_dataset=tokenized_val_ds,
    data_collator=data_collator,
    tokenizer=tokenizer,
)

Downloading:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.bias', 'vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_layer_norm.weight', 'vocab_transform.bias', 'vocab_projector.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.weight', 'classifier

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

In [34]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 25000
  Num Epochs = 1
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 3125
  Number of trainable parameters = 66955010
You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
500,0.4151
1000,0.3226
1500,0.3091
2000,0.2965
2500,0.2598
3000,0.229


Saving model checkpoint to test-trainer/checkpoint-500
Configuration saved in test-trainer/checkpoint-500/config.json
Model weights saved in test-trainer/checkpoint-500/pytorch_model.bin
tokenizer config file saved in test-trainer/checkpoint-500/tokenizer_config.json
Special tokens file saved in test-trainer/checkpoint-500/special_tokens_map.json
Saving model checkpoint to test-trainer/checkpoint-1000
Configuration saved in test-trainer/checkpoint-1000/config.json
Model weights saved in test-trainer/checkpoint-1000/pytorch_model.bin
tokenizer config file saved in test-trainer/checkpoint-1000/tokenizer_config.json
Special tokens file saved in test-trainer/checkpoint-1000/special_tokens_map.json
Saving model checkpoint to test-trainer/checkpoint-1500
Configuration saved in test-trainer/checkpoint-1500/config.json
Model weights saved in test-trainer/checkpoint-1500/pytorch_model.bin
tokenizer config file saved in test-trainer/checkpoint-1500/tokenizer_config.json
Special tokens file saved

TrainOutput(global_step=3125, training_loss=0.3039225634765625, metrics={'train_runtime': 1205.502, 'train_samples_per_second': 20.738, 'train_steps_per_second': 2.592, 'total_flos': 3109326526331232.0, 'train_loss': 0.3039225634765625, 'epoch': 1.0})

In [45]:
from transformers import pipeline
pipe = pipeline("text-classification", model="mvonwyl/distilbert-base-uncased-imdb", device='cuda:0')

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--mvonwyl--distilbert-base-uncased-imdb/snapshots/e78f2fa182bace5db1195f3672ebd502c9d35157/config.json
Model config DistilBertConfig {
  "_name_or_path": "mvonwyl/distilbert-base-uncased-imdb",
  "activation": "gelu",
  "architectures": [
    "DistilBertForSequenceClassification"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "problem_type": "single_label_classification",
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "torch_dtype": "float32",
  "transformers_version": "4.24.0",
  "vocab_size": 30522
}

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--mvonwyl--distilbert-base-uncased-imdb/snapshots/e78f2fa182ba

 *Evaluate the model in term of accuracy on the test data.*

In [48]:
from evaluate import evaluator

# To evaluate our model for accuracy, we will be using huggingface's evaluator
# class and our previously tokenized test split of the imdb dataset

task_evaluator = evaluator("text-classification")
eval_results = task_evaluator.compute(
    model_or_pipeline=pipe,
    data=imdb_ds['test'],
    label_mapping={"LABEL_0": 0, "LABEL_1": 1}
)
eval_results['accuracy']

0.92888

In [49]:
eval_results

{'accuracy': 0.92888,
 'total_time_in_seconds': 138.461606373,
 'samples_per_second': 90.27773349910737,
 'latency_in_seconds': 0.01107692850984}

*For at least 2 samples which have been wrongly classified in the test set, try explaining why the model could have been wrong.*

In [106]:
import textwrap

# During development it was observed that if we entered strings with length over
# 512 tokens we could not run the pipe line. We will need to truncate 
def my_truncate(string, max_len):
  return textwrap.shorten(string, width=max_len, placeholder='')

# The pipeline also outputs labels in the form of strings named 'LABEL_1' and
# 'LABEL_0'. However in imdb dataset, these are simply 1 and 0 integers.

def convert_label(label_value):
  if label_value == 'LABEL_1':
    return 1
  return 0

misclassified_indexes = []
iteration = 0

while len(misclassified_indexes) < 5 and iteration < 1000:

  sample_txt = imdb_ds['test']['text'][iteration]
  truncated_txt = my_truncate(sample_txt, 512)

  sample_label = imdb_ds['test']['label'][iteration]
  prediction = pipe(truncated_txt)
  converted_pred = convert_label(prediction[0]['label'])

  if converted_pred != sample_label:
    misclassified_indexes.append(iteration)
  iteration += 1



In [108]:
for id in misclassified_indexes:
  print(imdb_ds['test']['text'][id])
  print(imdb_ds['test']['label'][id])

The only thing it has to offer is the interesting opposites of Tru and Jack, their choices and viewpoints, and the philosophical questions that it raises. Tru feels that she is helping people who aren't supposed to die, and Jack feels that they are supposed to die, and she is messing with fate's plan, or the universe's plan, or such-whatnot.<br /><br />But she is obviously able to change things, so there is obviously no such thing as fate in the series' metaphysics. Jack has no basis for believing that there is. And very conveniently, Tru never asks him the right questions. Nobody does. Which obviously proves that the makers of the series don't have an answer.<br /><br />There simply is no plot!<br /><br />Instead, they leave it murky in order for the series to be able to continue with it's boring girl stuff, only occasionally interrupted by Tru and Jack's racing against each other towards ends that are unknown...<br /><br />It turns out that there is nothing to any of it. A teenage po

Among the misclassified samples we see that errors most often occur when the review contains a mix of heavily biased words positively and negatively. The second and third examples printed above are perfect example. In the first case the review literally starts with the term "Worst" while the overall opinion of the sample is to go watch the movie. This is understandable as worst is rarely associated with a positive context. In the second case, the reviewer mentions the filmography of the film's director. However, the positive first half of the review only mentions the director's previous works and the reviewer's appreciation of it. Unfortunately it is later in the review that the reviewer mentions their opinion of the actual movie. A possible additional reason for this mistake could be that the initial sample was truncated for inference purposes, in which case even more information could have been lost.

*What are the advantages and inconvenient of using this model in production compared to the naive Bayes we implemented in the first part of the course?*

Overall, we cannot ignore the overwhelming advantage in accuracy of the transformer's approach compared to the naive bayes one. If I remember correctly the naive bayes approach had a accuracy of at most 70 percent which pales in comparison to the 92 percent we have with the distilbert model. However, depending on the target devices used for inference (i.e lighter devices like phones vs. machines with higher level hardware) we may find that the usage of the transformer is too costly or would need work arounds like using cloud computing to lighten the weight for the devices. Furthermore, if we consider model interpretability as more important then accuracy for our solution, the naive bayes approach could be preferred.