# Effortless NLP using HuggingFace's Tranformers Ecosystem

![Image](https://raw.githubusercontent.com/RajkumarGalaxy/dataset/master/Images/z0002.jpg)

> Image by [Author](https://raw.githubusercontent.com/RajkumarGalaxy/dataset/master/Images/z0002.jpg)
### How to finetune a BERT model on a custom dataset using Pytorch?

#### ------------------------------------------------ 
#### *Articles So Far In This Series*
#### -> [[NLP Tutorial] Finish Tasks in Two Lines of Code](https://www.kaggle.com/rajkumarl/nlp-tutorial-finish-tasks-in-two-lines-of-code)
#### -> [[NLP Tutorial] Unwrapping Transformers Pipeline](https://www.kaggle.com/rajkumarl/nlp-unwrapping-transformers-pipeline)
#### -> [[NLP Tutorial] Exploring Tokenizers](https://www.kaggle.com/rajkumarl/nlp-tutorial-exploring-tokenizers)
#### -> [[NLP Tutorial] Fine-Tuning in TensorFlow](https://www.kaggle.com/rajkumarl/nlp-tutorial-fine-tuning-in-tensorflow) 
#### -> [[NLP Tutorail] Fine-Tuning in Pytorch](https://www.kaggle.com/rajkumarl/nlp-tutorial-fine-tuning-in-pytorch) 
#### -> [[NLP Tutorail] Fine-Tuning with Trainer API](https://www.kaggle.com/rajkumarl/nlp-tutorial-fine-tuning-with-trainer-api) 
#### ------------------------------------------------ 

# Prepare Environment and Data

In this article we discuss fine-tuning a BERT model on the famous WNLI dataset using Trainer API. This requires a GPU environment for faster training and inference, while it still works on a CPU device too.

In [1]:
# upgrade transformers and datasets to latest versions
!pip install --upgrade transformers
!pip install --upgrade datasets
import transformers
import datasets
print(transformers.__version__)
print(datasets.__version__)

Collecting transformers
  Downloading transformers-4.12.5-py3-none-any.whl (3.1 MB)
[K     |████████████████████████████████| 3.1 MB 608 kB/s 
[?25hCollecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.1.2-py3-none-any.whl (59 kB)
[K     |████████████████████████████████| 59 kB 3.4 MB/s 
Installing collected packages: huggingface-hub, transformers
  Attempting uninstall: huggingface-hub
    Found existing installation: huggingface-hub 0.0.19
    Uninstalling huggingface-hub-0.0.19:
      Successfully uninstalled huggingface-hub-0.0.19
  Attempting uninstall: transformers
    Found existing installation: transformers 4.5.1
    Uninstalling transformers-4.5.1:
      Successfully uninstalled transformers-4.5.1
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
datasets 1.14.0 requires huggingface-hub<0.1.0,>=0.0.19, but you ha

In [2]:
# Make necessary imports

# for array operations 
import numpy as np 
# PyTorch framework
import torch
# plotting
from matplotlib import pyplot as plt
# reproducibility
import random
# to watch progress
from tqdm.auto import tqdm

# HuggingFace ecosystem
# tokenizer
from transformers import AutoTokenizer, DataCollatorWithPadding
# model
from transformers import AutoModelForSequenceClassification
# optimizer, lr-scheduler
from transformers import AdamW, get_scheduler
# dataset
from datasets import load_dataset, load_metric

# a seed for reproducibility
SEED = 42
# set seed
np.random.seed(SEED)
torch.manual_seed(SEED)
random.seed(SEED)

# check for GPU device
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
print('Device available:', device) 

Device available: cuda:0


Load the WNLI Dataset from GLUE benchmark

In [3]:
raw_data = load_dataset("glue", "wnli")

Downloading:   0%|          | 0.00/7.78k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/4.47k [00:00<?, ?B/s]

Downloading and preparing dataset glue/wnli (download: 28.32 KiB, generated: 154.03 KiB, post-processed: Unknown size, total: 182.35 KiB) to /root/.cache/huggingface/datasets/glue/wnli/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad...


Downloading:   0%|          | 0.00/29.0k [00:00<?, ?B/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

Dataset glue downloaded and prepared to /root/.cache/huggingface/datasets/glue/wnli/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [4]:
# how does it look like?
raw_data

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 635
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 71
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 146
    })
})

In [5]:
# Sample a data
raw_data["train"][0]

{'sentence1': 'I stuck a pin through a carrot. When I pulled the pin out, it had a hole.',
 'sentence2': 'The carrot had a hole.',
 'label': 1,
 'idx': 0}

Each data point contains a sentence, its index and its label. What labels are there? What are their positions?

In [6]:
# what features are there in data?
# What are the label names?
raw_data["train"].features

{'sentence1': Value(dtype='string', id=None),
 'sentence2': Value(dtype='string', id=None),
 'label': ClassLabel(num_classes=2, names=['not_entailment', 'entailment'], names_file=None, id=None),
 'idx': Value(dtype='int32', id=None)}

We understand that this dataset consists of the supervised task - *Sequence Entailment Classification* with 2 classes: `not_entailment` [0] and `entailment` [1] 

# Tokenizer and Data Collator

We are about to use a pre-trained Bert_base_uncased model for our fine-tuning. A tokenizer function associated with a data collator can ensure efficient memory usage and quick data handling during training.

In [7]:
checkpoint = 'bert-base-uncased'
# bert tokenizer
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
# data collator for dynamic padding as per batch
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

In [8]:
# define a tokenize function
def Tokenize_function(example):
    return tokenizer(example['sentence1'], example['sentence2'], truncation=True)

In [9]:
# tokenize entire data
tokenized_data = raw_data.map(Tokenize_function, batched=True)

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

Remove unnecessary columns. 
Rename colum `label` to `labels` as expected by a PyTorch Model.
What columns are there in tokenized data?

In [10]:
tokenized_data = tokenized_data.remove_columns(['idx','sentence1','sentence2'])
tokenized_data = tokenized_data.rename_column('label','labels')
tokenized_data.set_format('pt')
tokenized_data["train"].column_names

['attention_mask', 'input_ids', 'labels', 'token_type_ids']

`attention_mask`, `input_ids`, `token_type_ids` are the necessary input features and `labels` is the target. Other features are useless in the view of modeling.

Prepare DataLoader by batching and dynamic padding using data collator

In [11]:
train_data = torch.utils.data.DataLoader(tokenized_data["train"],
                                         shuffle=True,
                                         batch_size=8,
                                         collate_fn=data_collator
                                        )
val_data = torch.utils.data.DataLoader(tokenized_data["validation"],
                                       batch_size=8,
                                       collate_fn=data_collator
                                      )
test_data = torch.utils.data.DataLoader(tokenized_data["test"],
                                        batch_size=8,
                                        collate_fn=data_collator
                                       )

In [12]:
# do a chekck for proper data preprocessing
for batch in train_data:
    [print('{:>20} : {}'.format(k,v.shape)) for k,v in batch.items()] 
    break 

      attention_mask : torch.Size([8, 49])
           input_ids : torch.Size([8, 49])
              labels : torch.Size([8])
      token_type_ids : torch.Size([8, 49])


Data is ready now for efficient data loading and faster training.

# Model Fine-tuning

Load a BERT model from the checkpoint

In [13]:
# cache a pre-trained BERT model for two-class classification
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

Downloading:   0%|          | 0.00/420M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

Check for data and model compatibility by passing a sample batch of data.

In [14]:
outputs = model(**batch)
print(outputs.loss, outputs.logits.shape)

tensor(0.6845, grad_fn=<NllLossBackward>) torch.Size([8, 2])


Data and model are ready for fine-tuning. Now, we need to define an optimizer and a learning rate scheduler.

In [15]:
EPOCHS = 3
NUM_TRAINING_STEPS = EPOCHS * len(train_data)
print(NUM_TRAINING_STEPS)

optimizer = AdamW(model.parameters(), lr=5e-5)
lr_scheduler = get_scheduler("linear",
                             optimizer=optimizer,
                             num_warmup_steps=0,
                             num_training_steps=NUM_TRAINING_STEPS
                            )

240


We can define the device to put our data and model on.

In [16]:
model.to(device)
device

device(type='cuda', index=0)

Define the training loop

In [17]:
progress_bar = tqdm(range(NUM_TRAINING_STEPS))

model.train()
for epoch in range(EPOCHS):
    for batch in train_data:
        batch = {k:v.to(device) for k,v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

  0%|          | 0/240 [00:00<?, ?it/s]

Training is over.
Define an Evaluation loop to evaluate the validation set.

In [18]:
metric = load_metric("glue","wnli")

model.eval()
for batch in val_data:
    batch = {k:v.to(device) for k,v in batch.items()}
    print(batch['labels'], batch['labels'].shape)
    with torch.no_grad():
        outputs = model(**batch)
    logits = outputs.logits
    preds = torch.argmax(logits, dim=-1)
    metric.add_batch(predictions=preds,references=batch['labels'])
metric.compute()

Downloading:   0%|          | 0.00/1.86k [00:00<?, ?B/s]

tensor([0, 1, 0, 1, 1, 0, 1, 1], device='cuda:0') torch.Size([8])
tensor([0, 0, 0, 1, 0, 0, 0, 0], device='cuda:0') torch.Size([8])
tensor([1, 0, 0, 0, 0, 0, 0, 1], device='cuda:0') torch.Size([8])
tensor([0, 1, 0, 1, 1, 1, 1, 0], device='cuda:0') torch.Size([8])
tensor([1, 1, 0, 1, 0, 0, 1, 1], device='cuda:0') torch.Size([8])
tensor([0, 0, 0, 1, 0, 0, 1, 0], device='cuda:0') torch.Size([8])
tensor([1, 0, 0, 1, 0, 0, 1, 0], device='cuda:0') torch.Size([8])
tensor([1, 0, 1, 1, 0, 0, 1, 1], device='cuda:0') torch.Size([8])
tensor([0, 1, 1, 0, 1, 0, 0], device='cuda:0') torch.Size([7])


{'accuracy': 0.5633802816901409}

# Prediction 

Predict the labels for the test data. 

Being test_data, `labels` are -1 for each data point. However, model expects either 0 or 1 as `labels`. Convert labels into int64 type ones of same batch size before feeding into the model.

In [19]:
# make predictions
preds = [] 
model.eval()
for batch in test_data:
    batch['labels'] = torch.ones(len(batch['labels'])).type(torch.int64)
    batch = {k:v.to(device) for k,v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)
    logits = outputs.logits
    yhat = torch.argmax(logits, dim=-1)
    preds.append(yhat)

In [20]:
preds

[tensor([0, 0, 0, 0, 0, 0, 0, 0], device='cuda:0'),
 tensor([0, 0, 0, 0, 0, 0, 0, 0], device='cuda:0'),
 tensor([0, 0, 0, 0, 0, 0, 0, 0], device='cuda:0'),
 tensor([0, 0, 0, 0, 0, 0, 0, 0], device='cuda:0'),
 tensor([0, 0, 0, 0, 0, 0, 0, 0], device='cuda:0'),
 tensor([0, 0, 0, 0, 0, 0, 0, 0], device='cuda:0'),
 tensor([0, 0, 0, 0, 0, 0, 0, 0], device='cuda:0'),
 tensor([0, 0, 0, 0, 0, 0, 0, 0], device='cuda:0'),
 tensor([0, 0, 0, 0, 0, 0, 0, 0], device='cuda:0'),
 tensor([0, 0, 0, 0, 0, 0, 0, 0], device='cuda:0'),
 tensor([0, 0, 0, 0, 0, 0, 0, 0], device='cuda:0'),
 tensor([0, 0, 0, 0, 0, 0, 0, 0], device='cuda:0'),
 tensor([0, 0, 0, 0, 0, 0, 0, 0], device='cuda:0'),
 tensor([0, 0, 0, 0, 0, 0, 0, 0], device='cuda:0'),
 tensor([0, 0, 0, 0, 0, 0, 0, 0], device='cuda:0'),
 tensor([0, 0, 0, 0, 0, 0, 0, 0], device='cuda:0'),
 tensor([0, 0, 0, 0, 0, 0, 0, 0], device='cuda:0'),
 tensor([0, 0, 0, 0, 0, 0, 0, 0], device='cuda:0'),
 tensor([0, 0], device='cuda:0')]

### That's the end. We got a good understanding of fine-tuning a BERT model on WNLI dataset for a sentiment analysis task in PyTorch!

##### Key reference: [HuggingFace's NLP Course](https://huggingface.co/course)

### Thank you for your valuable time!