<table align="center" style="text-align: center; border: hidden;">
    <td align="center">
        <a target="_blank" href="https://colab.research.google.com/github/SMMousaviSP/huggingface_transformers_tutorial/blob/master/transformers_and_datasets.ipynb">
            <img src="./images/colab.png" height="60px" style="padding-bottom: 5px; height: 60px;" />
            <br>
            Run in Google Colab
        </a>
    </td>
    <td align="center" style="padding-left: 20px;">
        <a target="_blank" href="https://github.com/SMMousaviSP/huggingface_transformers_tutorial">
            <img src="./images/github.png" height="60px" style="padding-bottom: 5px; height: 60px;" />
            <br>
            View Source on GitHub
        </a>
    </td>
</table>

# Fine-Tuning Hugging Face Transformers for Text Classification
This tutorial is based on [Hugging Face course](https://huggingface.co/course).

[You can see the YouTube video recorded for this tutorial (Persian)](https://youtu.be/taAURowmzks)

---

### [Available transformer models on Hugging Face](https://huggingface.co/models)

`distilbert-base-uncased` is recommended, since it's faster than `bert-base-uncased` and offers a good performance. Also it was pretrained with the same corpus as BERT. This model is aimed at being **fine-tuned** for NLP tasks such as text classification, token classification, and question answering, for text generation you should go for models such as `gpt2`. [More information about this model is available here.](https://huggingface.co/distilbert-base-uncased)

---

### [Available datasets on Hugging Face](https://huggingface.co/datasets)

`sst2` from `glue` benchmark is used on this tutorial. It consists of sentences from movie reviews and human annotations of their sentiment. The task is to predict the sentiment of a given sentence. It uses the two-way (positive/negative) class split, with only sentence-level labels.

## Installing transformers, datasets and numpy library

This notebook has been tested on Google Colab, if you want to run it on your own system, you may have to install other libraries as well, also you need to install CUDA, if you wish to train your model faster with GPU.

In [1]:
!pip install transformers datasets numpy > /dev/null

## Checking the GPU you are currently using
Currently, Google gives a 12 GB K80 to free users, you can use it in a 12 hour session. If you need faster GPUs or more memory, you can buy Colab Pro subscription (10 USD / Month). Colab Pro gives you a T4 or P100 for a 24 hour session.

In [2]:
!nvidia-smi -L

GPU 0: Tesla P100-PCIE-16GB (UUID: GPU-949c7f9a-f79f-4726-e483-4a5aebd40d69)


## Importing required libraries and modules

In [3]:
import numpy as np

from datasets import (
    load_dataset,
    load_metric,
    DatasetDict,
    Dataset,
)

from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    DataCollatorWithPadding,
    TrainingArguments,
    Trainer,
    set_seed,
)


## Setting Constants
We're going to use `SEED`, for shuffling our dataset, transformer models and everywhere we use something that is done randomly so we get the same result each time. `CHECKPOINT` is used to when we load the model and the tokenizer of the model, you can simply change the `CHECKPOINT` to a similiar model and run this whole notebook and it will work; for example you can change it to `bert-base-uncased`.

In [4]:
SEED = 1000
CHECKPOINT = "distilbert-base-uncased"

## Loading `sst2` Dataset from `glue` Benchmark

In [5]:
sst2_datasets = load_dataset("glue", "sst2")

Downloading:   0%|          | 0.00/7.78k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/4.47k [00:00<?, ?B/s]

Downloading and preparing dataset glue/sst2 (download: 7.09 MiB, generated: 4.81 MiB, post-processed: Unknown size, total: 11.90 MiB) to /root/.cache/huggingface/datasets/glue/sst2/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad...


Downloading:   0%|          | 0.00/7.44M [00:00<?, ?B/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

Dataset glue downloaded and prepared to /root/.cache/huggingface/datasets/glue/sst2/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

## Creating a Smaller Version of `sst2` Dataset


In [6]:
select_example = sst2_datasets["train"].select(range(300))

custom_sst2 = DatasetDict({
    "train": sst2_datasets["train"].shuffle(seed=SEED).select(range(1000)).flatten_indices(),
    "validation": sst2_datasets["validation"],
    "test": sst2_datasets["train"].shuffle(seed=SEED).select(range(300, 400)).flatten_indices()
})

  0%|          | 0/1 [00:00<?, ?ba/s]

Loading cached shuffled indices for dataset at /root/.cache/huggingface/datasets/glue/sst2/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-70a6ee4383153c3b.arrow


  0%|          | 0/1 [00:00<?, ?ba/s]

## Defining F1 Compute Metric

In [7]:
def compute_metrics(eval_preds):
    metric = load_metric("f1")
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

# Set Transformers Seed
If we don't set the seed, the first time we train a model, the transformers library is going to set a seed itself. [More information about this.](https://discuss.huggingface.co/t/multiple-training-will-give-exactly-the-same-result-except-for-the-first-time/8493?u=smmousavi)

In [8]:
set_seed(SEED)

## Loading the Model and the Tokenizer and Tokenizing the Dataset

In [9]:
num_labels = len(custom_sst2["train"].unique('label'))

model = AutoModelForSequenceClassification.from_pretrained(CHECKPOINT, num_labels=num_labels)
tokenizer = AutoTokenizer.from_pretrained(CHECKPOINT)

def tokenize_function(example):
    return tokenizer(example["sentence"], truncation=True)

tokenized_datasets = custom_sst2.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/256M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.weight', 'vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.bias', 'vocab_layer_norm.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classi

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

## Setting Training Arguments and Creating a Trainer Object

In [10]:
saving_folder = "custom_sst2_distilbert"
training_args = TrainingArguments(
    saving_folder,
    load_best_model_at_end=True,
    num_train_epochs=2,
    evaluation_strategy="steps",
    eval_steps=100,
    save_steps=100,
    metric_for_best_model="f1",
    save_total_limit=10,
)

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

## Training the Model

In [11]:
trainer.train()

The following columns in the training set  don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: idx, sentence.
***** Running training *****
  Num examples = 1000
  Num Epochs = 2
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 250


Step,Training Loss,Validation Loss,F1
100,No log,0.415124,0.807931
200,No log,0.502109,0.859669


The following columns in the evaluation set  don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: idx, sentence.
***** Running Evaluation *****
  Num examples = 872
  Batch size = 8


Downloading:   0%|          | 0.00/2.07k [00:00<?, ?B/s]

Saving model checkpoint to custom_sst2_distilbert/checkpoint-100
Configuration saved in custom_sst2_distilbert/checkpoint-100/config.json
Model weights saved in custom_sst2_distilbert/checkpoint-100/pytorch_model.bin
tokenizer config file saved in custom_sst2_distilbert/checkpoint-100/tokenizer_config.json
Special tokens file saved in custom_sst2_distilbert/checkpoint-100/special_tokens_map.json
The following columns in the evaluation set  don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: idx, sentence.
***** Running Evaluation *****
  Num examples = 872
  Batch size = 8
Saving model checkpoint to custom_sst2_distilbert/checkpoint-200
Configuration saved in custom_sst2_distilbert/checkpoint-200/config.json
Model weights saved in custom_sst2_distilbert/checkpoint-200/pytorch_model.bin
tokenizer config file saved in custom_sst2_distilbert/checkpoint-200/tokenizer_config.json
Special tokens file saved in custom_sst2_distilbert/chec

TrainOutput(global_step=250, training_loss=0.32484719848632815, metrics={'train_runtime': 18.3626, 'train_samples_per_second': 108.917, 'train_steps_per_second': 13.615, 'total_flos': 15366218244096.0, 'train_loss': 0.32484719848632815, 'epoch': 2.0})

## Testing the Model on `test` Dataset

In [12]:
predictions = trainer.predict(tokenized_datasets["test"])

The following columns in the test set  don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: idx, sentence.
***** Running Prediction *****
  Num examples = 100
  Batch size = 8


## Getting the Predicted Label from the Logits (Raw Predictions)

In [13]:
preds = np.argmax(predictions.predictions, axis=-1)

## Calculating the Probability of Each Class with Softmax

In [14]:
import torch

softmax = torch.nn.Softmax(dim=-1)
predictions_tensor = torch.tensor(predictions.predictions)
probability = softmax(predictions_tensor)
probability

tensor([[0.2808, 0.7192],
        [0.0143, 0.9857],
        [0.0038, 0.9962],
        [0.9858, 0.0142],
        [0.0035, 0.9965],
        [0.9848, 0.0152],
        [0.9841, 0.0159],
        [0.9863, 0.0137],
        [0.9832, 0.0168],
        [0.9849, 0.0151],
        [0.9789, 0.0211],
        [0.9718, 0.0282],
        [0.9830, 0.0170],
        [0.9824, 0.0176],
        [0.0432, 0.9568],
        [0.9667, 0.0333],
        [0.0061, 0.9939],
        [0.9847, 0.0153],
        [0.0041, 0.9959],
        [0.9811, 0.0189],
        [0.0056, 0.9944],
        [0.0044, 0.9956],
        [0.0042, 0.9958],
        [0.9837, 0.0163],
        [0.9843, 0.0157],
        [0.9860, 0.0140],
        [0.9758, 0.0242],
        [0.9778, 0.0222],
        [0.0036, 0.9964],
        [0.0038, 0.9962],
        [0.9853, 0.0147],
        [0.9809, 0.0191],
        [0.0035, 0.9965],
        [0.9841, 0.0159],
        [0.9834, 0.0166],
        [0.0070, 0.9930],
        [0.0036, 0.9964],
        [0.0038, 0.9962],
        [0.0