## Importing Libraries

Instead of training the model from scratch, we’ll fine-tune an already existing model (DistilBERT)

In [2]:
!pip install torch
!pip install datasets transformers
!pip install sklearn

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.9.0-py3-none-any.whl (462 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m462.8/462.8 KB[0m [31m11.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting transformers
  Downloading transformers-4.26.1-py3-none-any.whl (6.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m85.4 MB/s[0m eta [36m0:00:00[0m
Collecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting multiprocess
  Downloading multiprocess-0.70.14-py38-none-any.whl (132 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.0/132.0 KB[0m [31m17.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting huggingface-hub<1.0.0,>=0.2.0
  Downloading huggingface_hub-0.12.1-py3-none-any.whl (190 kB

Since we are working with yes/no questions, our goal is to train a model that performs better than just picking an answer at random – this is why we must aim at >50% accuracy.

In [3]:
from datasets import load_dataset, load_metric
from transformers import DistilBertTokenizerFast
from transformers import AutoModelForSequenceClassification, DataCollatorWithPadding
from transformers import Trainer, TrainingArguments

DistilBERT is a small, fast, cheap and light Transformer model trained by distilling BERT base. It has 40% less parameters than bert-base-uncased, runs 60% faster while preserving over 95% of BERT's performances as measured on the GLUE language understanding benchmark.

In [4]:
checkpoint = "distilbert-base-uncased"

## Dataset importation

We are importing the dataset directly from the hugging face by executing the following command. The dataset name is 'boolq'.

In [5]:
dataset = load_dataset("boolq")

Downloading builder script:   0%|          | 0.00/3.38k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/1.91k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/6.60k [00:00<?, ?B/s]

Downloading and preparing dataset boolq/default to /root/.cache/huggingface/datasets/boolq/default/0.1.0/bf0dd57da941c50de94ae3ce3cef7fea48c08f337a4b7aac484e9dddc5aa24e5...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/6.53M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.24M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/9427 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3270 [00:00<?, ? examples/s]

Dataset boolq downloaded and prepared to /root/.cache/huggingface/datasets/boolq/default/0.1.0/bf0dd57da941c50de94ae3ce3cef7fea48c08f337a4b7aac484e9dddc5aa24e5. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

In [6]:
dataset

DatasetDict({
    train: Dataset({
        features: ['question', 'answer', 'passage'],
        num_rows: 9427
    })
    validation: Dataset({
        features: ['question', 'answer', 'passage'],
        num_rows: 3270
    })
})

## Data Preprocessing

Now to begin with this, we first need to create a text tokenizer

For this, we'll take help from some pretrained tokenizers tranformers are occupied with.

In [7]:
tokenizer = DistilBertTokenizerFast.from_pretrained(checkpoint)

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

DistilBertTokenizerFast runs end-to-end tokenization: punctuation splitting + wordpiece

In [8]:
def tokenize_function(example):
  encoded = tokenizer(example["question"], example["passage"], truncation=True)
  encoded["labels"] = [int(a) for a in example["answer"]]
  return encoded

In [9]:
tokenized_datasets = dataset.map(tokenize_function, batched=True)

  0%|          | 0/10 [00:00<?, ?ba/s]

  0%|          | 0/4 [00:00<?, ?ba/s]

Additionally, we need to define a data collator, which will create batches of examples that don’t have the same length.

Data collators are objects that will form a batch by using a list of dataset elements as input. These elements are of the same type as the elements of train_dataset or eval_dataset.

To be able to build batches, data collators may apply some processing (like padding). Some of them (like DataCollatorForLanguageModeling) also apply some random data augmentation (like random masking) on the formed batch.

In [10]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

## Model definition and training

AutoModelForSequenceClassification is a sequence classification/regression head on top (a linear layer on top of the pooled output)

With this we can directly download the weights of the model we want to fine-tune.

In [11]:
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.bias', 'vocab_projector.bias', 'vocab_transform.weight', 'vocab_projector.weight', 'vocab_layer_norm.weight', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'classifier.bias', 'pre_classifi

##Training Arguments

The Trainer class provides an API for feature-complete training in PyTorch for most standard use cases.

Before instantiating your Trainer, we create a TrainingArguments to access all the points of customization during training.

per_device_train_batch_size :  The batch size per GPU/TPU core/ CPU for training

learning_rate :  The initial learning rate for AamW optimizer


In [12]:
args = TrainingArguments("roberta-booql", per_device_train_batch_size=16, learning_rate=1e-3, num_train_epochs=3)



Now, we will put all the objects we defined earlier together into an instance of a Trainer class

In [13]:
trainer = Trainer(model, args, train_dataset=tokenized_datasets["train"], 
                  eval_dataset=tokenized_datasets["validation"],
                  data_collator=data_collator, tokenizer=tokenizer,)

In [None]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: passage, question, answer. If passage, question, answer are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 9427
  Num Epochs = 3
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 1770
  Number of trainable parameters = 66955010
You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss


## Evaluation

#### Generating Predictions

In [None]:
predictions = trainer.predict(tokenized_datasets["validation"])
y_pred = predictions.predictions.argmax(-1)
labels = predictions.label_ids

####Loading Accuracy Metric

In [None]:
metric = load_metric("accuracy")

#### Generating performance score

In [None]:
metric.compute(predictions=y_pred, references=predictions.label_ids)