# Effortless NLP using HuggingFace's Tranformers Ecosystem

![Image](https://raw.githubusercontent.com/RajkumarGalaxy/dataset/master/Images/z0003.jpg)

> Image by [Author](https://raw.githubusercontent.com/RajkumarGalaxy/dataset/master/Images/z0003.jpg)
### How to finetune a BERT model on a custom dataset using Trainer API?

#### ------------------------------------------------ 
#### *Articles So Far In This Series*
#### -> [[NLP Tutorial] Finish Tasks in Two Lines of Code](https://www.kaggle.com/rajkumarl/nlp-tutorial-finish-tasks-in-two-lines-of-code)
#### -> [[NLP Tutorial] Unwrapping Transformers Pipeline](https://www.kaggle.com/rajkumarl/nlp-unwrapping-transformers-pipeline)
#### -> [[NLP Tutorial] Exploring Tokenizers](https://www.kaggle.com/rajkumarl/nlp-tutorial-exploring-tokenizers)
#### -> [[NLP Tutorial] Fine-Tuning in TensorFlow](https://www.kaggle.com/rajkumarl/nlp-tutorial-fine-tuning-in-tensorflow) 
#### -> [[NLP Tutorail] Fine-Tuning in Pytorch](https://www.kaggle.com/rajkumarl/nlp-tutorial-fine-tuning-in-pytorch) 
#### -> [[NLP Tutorail] Fine-Tuning with Trainer API](https://www.kaggle.com/rajkumarl/nlp-tutorial-fine-tuning-with-trainer-api) 
#### ------------------------------------------------ 

# Prepare Environment and Data

In this article we discuss fine-tuning a BERT model on the famous COLA dataset using Trainer API. This requires a GPU environment for faster training and inference.

In [1]:
# upgrade transformers and datasets to latest versions
!pip install --upgrade transformers
!pip install --upgrade datasets
import transformers
import datasets
print(transformers.__version__)
print(datasets.__version__)

Collecting transformers
  Downloading transformers-4.12.5-py3-none-any.whl (3.1 MB)
[K     |████████████████████████████████| 3.1 MB 786 kB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.1.2-py3-none-any.whl (59 kB)
[K     |████████████████████████████████| 59 kB 5.8 MB/s 
Installing collected packages: huggingface-hub, transformers
  Attempting uninstall: huggingface-hub
    Found existing installation: huggingface-hub 0.0.19
    Uninstalling huggingface-hub-0.0.19:
      Successfully uninstalled huggingface-hub-0.0.19
  Attempting uninstall: transformers
    Found existing installation: transformers 4.5.1
    Uninstalling transformers-4.5.1:
      Successfully uninstalled transformers-4.5.1
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
datasets 1.14.0 requires huggingface-hub<0.1.0,>=0.0.19, but you have hug

In [2]:
# Make necessary imports

# for array operations 
import numpy as np 
# PyTorch framework
import torch
# for pretty printing
from pprint import pprint
# plotting
from matplotlib import pyplot as plt
# reproducibility
import random

# HuggingFace ecosystem
# tokenizer
from transformers import AutoTokenizer, DataCollatorWithPadding
# model
from transformers import AutoModelForSequenceClassification
# trainer
from transformers import TrainingArguments
from transformers import Trainer
from transformers import AdamW
# dataset
from datasets import load_dataset, load_metric

# disable WandB defaults
import os
os.environ["WANDB_DISABLED"] = "true"


# a seed for reproducibility
SEED = 42
# set seed
np.random.seed(SEED)
torch.manual_seed(SEED)
random.seed(SEED)

# check for GPU device
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
print('Device available:', device) 

Device available: cuda:0


Load the COLA Dataset from GLUE benchmark

In [3]:
raw_data = load_dataset("glue", "cola")
# how does it look like?
raw_data

Downloading:   0%|          | 0.00/7.78k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/4.47k [00:00<?, ?B/s]

Downloading and preparing dataset glue/cola (download: 368.14 KiB, generated: 596.73 KiB, post-processed: Unknown size, total: 964.86 KiB) to /root/.cache/huggingface/datasets/glue/cola/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad...


Downloading:   0%|          | 0.00/377k [00:00<?, ?B/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

Dataset glue downloaded and prepared to /root/.cache/huggingface/datasets/glue/cola/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 8551
    })
    validation: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1043
    })
    test: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1063
    })
})

In [4]:
# Sample a data
raw_data["train"][0]

{'sentence': "Our friends won't buy this analysis, let alone the next one we propose.",
 'label': 1,
 'idx': 0}

Each data point contains a sentence, its index and its label. What labels are there? What are their positions?

In [5]:
# what features are there in data?
# What are the label names?
raw_data["train"].features

{'sentence': Value(dtype='string', id=None),
 'label': ClassLabel(num_classes=2, names=['unacceptable', 'acceptable'], names_file=None, id=None),
 'idx': Value(dtype='int32', id=None)}

We understand that this dataset consists of the supervised task - *Sequence Classification* with 2 classes: unacceptable [0] and acceptable [1] 

# Tokenizer and Data Collator

We are about to use a pre-trained Bert_base_uncased model for our fine-tuning. A tokenizer function associated with a data collator can ensure efficient memory usage and quick data handling during training.

In [6]:
checkpoint = 'bert-base-uncased'
# bert tokenizer
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
# data collator for dynamic padding as per batch
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

In [7]:
# cache a pre-trained BERT model for two-class classification
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

Downloading:   0%|          | 0.00/420M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

In [8]:
# define a tokenize function
def Tokenize_function(example):
    return tokenizer(example['sentence'], truncation=True)

In [9]:
# tokenize entire data
tokenized_data = raw_data.map(Tokenize_function, batched=True)

  0%|          | 0/9 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

How does tokenized data look like?

In [10]:
tokenized_data = tokenized_data.remove_columns(['idx','sentence'])
tokenized_data = tokenized_data.rename_column('label','labels')
tokenized_data.with_format('pt')

DatasetDict({
    train: Dataset({
        features: ['attention_mask', 'input_ids', 'labels', 'token_type_ids'],
        num_rows: 8551
    })
    validation: Dataset({
        features: ['attention_mask', 'input_ids', 'labels', 'token_type_ids'],
        num_rows: 1043
    })
    test: Dataset({
        features: ['attention_mask', 'input_ids', 'labels', 'token_type_ids'],
        num_rows: 1063
    })
})

`attention_mask`, `input_ids`, `token_type_ids` are the necessary input features and `labels` is the target. Other features are useless in the view of modeling.

Data is ready now for efficient data loading and faster training.

# Model Fine-tuning

define training arguments

In [11]:
training_args = TrainingArguments('bert-finetuning-cola', 
                                  evaluation_strategy='epoch',
                                  num_train_epochs=2,
                                  learning_rate=5e-5,
                                  weight_decay=0.005,
                                  per_device_train_batch_size=8,
                                  per_device_eval_batch_size=8,
                                  report_to = 'none'
                                 )

In [12]:
# use the pre-built metrics 
def compute_metrics(eval_preds):
    metric = load_metric("glue", "cola")
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

In [13]:
# formulate a trainer with necessary data, metrics, tokenizer and arguments
trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_data["train"],
    eval_dataset=tokenized_data["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

In [14]:
# Train the model
trainer.train()

***** Running training *****
  Num examples = 8551
  Num Epochs = 2
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 2138


Epoch,Training Loss,Validation Loss,Matthews Correlation
1,0.478,0.562753,0.483597
2,0.2777,0.64731,0.563068


Saving model checkpoint to bert-finetuning-cola/checkpoint-500
Configuration saved in bert-finetuning-cola/checkpoint-500/config.json
Model weights saved in bert-finetuning-cola/checkpoint-500/pytorch_model.bin
tokenizer config file saved in bert-finetuning-cola/checkpoint-500/tokenizer_config.json
Special tokens file saved in bert-finetuning-cola/checkpoint-500/special_tokens_map.json
Saving model checkpoint to bert-finetuning-cola/checkpoint-1000
Configuration saved in bert-finetuning-cola/checkpoint-1000/config.json
Model weights saved in bert-finetuning-cola/checkpoint-1000/pytorch_model.bin
tokenizer config file saved in bert-finetuning-cola/checkpoint-1000/tokenizer_config.json
Special tokens file saved in bert-finetuning-cola/checkpoint-1000/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 1043
  Batch size = 8


Downloading:   0%|          | 0.00/1.86k [00:00<?, ?B/s]

Saving model checkpoint to bert-finetuning-cola/checkpoint-1500
Configuration saved in bert-finetuning-cola/checkpoint-1500/config.json
Model weights saved in bert-finetuning-cola/checkpoint-1500/pytorch_model.bin
tokenizer config file saved in bert-finetuning-cola/checkpoint-1500/tokenizer_config.json
Special tokens file saved in bert-finetuning-cola/checkpoint-1500/special_tokens_map.json
Saving model checkpoint to bert-finetuning-cola/checkpoint-2000
Configuration saved in bert-finetuning-cola/checkpoint-2000/config.json
Model weights saved in bert-finetuning-cola/checkpoint-2000/pytorch_model.bin
tokenizer config file saved in bert-finetuning-cola/checkpoint-2000/tokenizer_config.json
Special tokens file saved in bert-finetuning-cola/checkpoint-2000/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 1043
  Batch size = 8


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=2138, training_loss=0.3915980314963103, metrics={'train_runtime': 150.6256, 'train_samples_per_second': 113.54, 'train_steps_per_second': 14.194, 'total_flos': 160465368776460.0, 'train_loss': 0.3915980314963103, 'epoch': 2.0})

# Prediction 

Predict the labels for the test data. Remove the `labels` column as expected by the trainer.

In [15]:
# prepare test data by removing labels
test_data = tokenized_data['test'].remove_columns(['labels'])

In [16]:
# make predictions
yhat = trainer.predict(test_data)
yhat

***** Running Prediction *****
  Num examples = 1063
  Batch size = 8


PredictionOutput(predictions=array([[-3.0082583,  3.116696 ],
       [-0.6556443,  1.114683 ],
       [-2.9872937,  3.0884838],
       ...,
       [ 1.9674035, -1.6151687],
       [-2.2325091,  2.2541625],
       [-1.9814276,  2.0933564]], dtype=float32), label_ids=None, metrics={'test_runtime': 1.426, 'test_samples_per_second': 745.419, 'test_steps_per_second': 93.265})

In [17]:
# classify labels
preds = np.argmax(yhat.predictions, axis=1)
preds

array([1, 1, 1, ..., 0, 1, 1])

### That's the end. We got a good understanding of fine-tuning a BERT model on COLA dataset for a sentiment analysis task with Trainer API!

##### Key reference: [HuggingFace's NLP Course](https://huggingface.co/course)

### Thank you for your valuable time!