### Read in dataset
IMDb dataset available from https://huggingface.co/datasets/imdb consists of 50,000 movie reviews, labelled as positive or negative.
Dataset loaded using 🤗 Transformers datasets API

In [1]:
#Read in imdb dataset using datasets
from datasets import load_dataset
imdb_data = load_dataset("imdb")

Downloading builder script:   0%|          | 0.00/1.79k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/1.05k [00:00<?, ?B/s]

Downloading and preparing dataset imdb/plain_text (download: 80.23 MiB, generated: 127.02 MiB, post-processed: Unknown size, total: 207.25 MiB) to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1...


Downloading data:   0%|          | 0.00/84.1M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Dataset imdb downloaded and prepared to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

### Tokenize reviews
Reviews are tokenized using the 🤗 Transformers AutoTokenizer. The model checkpoint specifies that we want to use the tokenizer which was used during the training of the distilbert-base-uncased model. It is imperative to use the tokenizer with which the model was originally trained.

This tokenization process is applied to all examples in the imdb dataset using the map method.

In [2]:
from transformers import AutoTokenizer

#Instantiate tokenizer instance from distilbert-base-uncased checkpoint
model_checkpoint = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

#Define a function to tokenize a batch of texts.
#Padding=True means that each review will be padded with zero's to the length of the longest review
#Truncation=True means that any review that is longer than the max number of tokens for distilbert uncased (512) 
#will be truncated to this max length
def tokenize(batch):
    return tokenizer(batch["text"], padding=True, truncation=True)

#Apply tokenize functino to all examples using map
imdb_encoded = imdb_data.map(tokenize, batched=True, batch_size=None)

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]



  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

### Load in model
The distilbert-base-uncased model is loaded in using AutoModelForSequenceClassification, which includes a classification head which can predict amongst classes as specified by the num_lables argument.
This classification head is a Neural Network, chosen because it is differentiable thus allowing for training as part of the fine-tuning process.

In [3]:
#Define device to run on GPU if available, else use CPU
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

#Load in model with a binary classification head
from transformers import AutoModelForSequenceClassification
num_labels = 2
model = (AutoModelForSequenceClassification
         .from_pretrained(model_checkpoint, num_labels=num_labels)
         .to(device))

Downloading:   0%|          | 0.00/256M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.weight', 'vocab_transform.bias', 'vocab_projector.bias', 'vocab_projector.weight', 'vocab_transform.weight', 'vocab_layer_norm.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'classifier.weight', 'classifier

### Finetune model

Model is finetuned using the Trainer API. F1 score and accuracy are used as performance metrics to be evaluated during training.

In [4]:
#Define function to compute model performance
from sklearn.metrics import accuracy_score, f1_score

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    f1 = f1_score(labels, preds, average="weighted")
    acc = accuracy_score(labels, preds)
    return {"accuracy": acc, "f1": f1}

In [5]:
#Define training parameters
from transformers import Trainer, TrainingArguments

batch_size = 16
logging_steps = len(imdb_encoded["train"]) // batch_size
model_name = f"{model_checkpoint}-finetuned-imdb"
training_args = TrainingArguments(output_dir=model_name,
                                  num_train_epochs=2,
                                  learning_rate=2e-5,
                                  per_device_train_batch_size=batch_size,
                                  per_device_eval_batch_size=batch_size,
                                  weight_decay=0.01,
                                  evaluation_strategy="epoch",
                                  disable_tqdm=False,
                                  logging_steps=logging_steps,
                                  push_to_hub=False,
                                  log_level="error")

trainer = Trainer(model=model, args=training_args,
                  compute_metrics=compute_metrics,
                  train_dataset=imdb_encoded["train"],
                  eval_dataset=imdb_encoded["test"],
                  tokenizer=tokenizer)
trainer.train();



Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.2653,0.190673,0.92648,0.926456
2,0.1489,0.230375,0.9308,0.930794
