<a href="https://www.kaggle.com/code/aisuko/fine-tuning-fill-mask-llm-to-text-classification?scriptVersionId=164614155" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Overview

Fine-tuning is important. And there are significant benefits to using a pretrained model. It reduces computation costs, carbon footprint, and allows you to use state-of-the-art models without having to train one from scratch.

When we use a pretrained model, we train it on a dataset specific to our task. This is known as fine-tuning, an incredibly powerful training technoque.

There are lots of framework we can choice, like:
* Transformers
* Keras in Tensorflow
* Native Pytoch

And we will use Transformers Trainer and Pytorch in this notebook.


# What is the Trainer class?

The Trainer class provides an API for feature-complete training in PyTorch for most standard use cases.

Before instantiating your `Trainer`, create a `TrainingArguments` to access all the points of customization during training. The API supports distributed training on multiple GPUs/TPUs.

The Trainer contains the basic training loop which supports the above features.

The `Trainer` class is powerful, but it has some limitations. It is optimized for Transformers models and can have surprising behaviors when we use it on other models. When using it on our own model, make sure:

* The model always return tuples or subclasses of `ModelOutput`
* The model can compute the loss if a labels argument is provided and that loss is returned as the first element of the tuple
* The model can accept multiple label arguments (use the `label_names`) in the TrainingArguments to indicate their name to the Trainer, but none of them should be named `label`

In [1]:
%%capture
!pip install transformers==4.35.2
!pip install datasets==2.15.0
!pip install evaluate==0.4.1

In [2]:
import os
from huggingface_hub import login
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()

login(token=user_secrets.get_secret("HUGGINGFACE_TOKEN"))

os.environ["WANDB_API_KEY"]=user_secrets.get_secret("WANDB_API_KEY")
os.environ["WANDB_PROJECT"] = "Fine-tune-models"
os.environ["WANDB_NOTES"] = "Fine tune model bert-base-cased"
os.environ["WANDB_NAME"] = "ft-bert-base-cased-run-v-0-3"

Token has not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [3]:
from torch import nn
from transformers import Trainer

class CustomTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False):
        labels=inputs.pop("labels")
        # forward pass
        outputs=model(**inputs)
        logits=outputs.get("logits")
        # compute custom loss (suppose one has 3 labels with different weights)
        loss_fct=nn.CrossEntropyLoss(weight=torch.tensor([1.0,2.0,3.0], device=model.device))
        loss=loss_fct(logits.view(-1, self.model.config.num_labels), labels.view(-1))
        return (loss, outputs) if return_outputs else loss



# Loading Data

The first step of us to fine-tune a model is download a dataset and prepare it for training. Here we are going to load the Yelp Reviews datasets from `datasets`

In [4]:
from datasets import load_dataset

dataset=load_dataset("yelp_review_full")
dataset

Downloading readme:   0%|          | 0.00/6.72k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/299M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/23.5M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/650000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/50000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['label', 'text'],
        num_rows: 650000
    })
    test: Dataset({
        features: ['label', 'text'],
        num_rows: 50000
    })
})

# Tokenizing

Most of times, we use [padding and truncation strategy](https://www.kaggle.com/code/aisuko/preprocess-natural-language-processing/notebook) to handle any variable sequence lengths.

In [5]:
from transformers import AutoTokenizer

tokenizer=AutoTokenizer.from_pretrained("bert-base-cased")

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets=dataset.map(tokenize_function, batched=True)

small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Map:   0%|          | 0/650000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

# Loading the Model

Here we are going to use Transformers Trainer class in the training process. It is easier to start training without manually writing the training loop. And we can also have logging, gradient accumulation, and mixed precision.

Note: Sometimes we can see a warning about some of the pretrained weights not being used and some weights being randomly initialized. Don't worry, this is completely normal! The pretrained head of the BERT model is discarded, and replaced with a randomly initialized classification head. We will fine-tune this new model head on our sequence classification task, transferring the knowledge of the pretrained model to it.

In [6]:
from transformers import AutoModelForSequenceClassification

model=AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5)
print(model)

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(28996, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12,

# Define Metrics

Trainer does not automatically evaluate model performance during training. So, we will need to pass Trainer a function to compute and report metrics. The Evaluate library provides a simple accuracy function we can load with the `evaluate.load`. Call compute on metric to calcualte the accuracy of our predictions. Before passing our predictions to compute, we need to convert the predictions to logits.

In [7]:
import numpy as np
import evaluate

metric=evaluate.load("accuracy")

def compute_metrics(eval_pred):
    logits, labels=eval_pred
    predictions=np.argmax(logits, axis=1)
    return metric.compute(predictions=predictions, references=labels)

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

# Training Hyperparameters

Here we are going to create TrainingArguments class which contains all the hyperparameters we can tune as well as flags for activating different training options.

In [8]:
from transformers import TrainingArguments

training_args=TrainingArguments(
    output_dir=os.getenv("WANDB_NAME"),
    num_train_epochs=2,
    max_steps=100,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end="True",
    push_to_hub=False,
    # wandb
    report_to="wandb",
    run_name=os.getenv("WANDB_NAME"),
)


trainer=Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,
)

trainer.train()

[34m[1mwandb[0m: Currently logged in as: [33murakiny[0m ([33mcausal_language_trainer[0m). Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: wandb version 0.16.3 is available!  To upgrade, please run:
[34m[1mwandb[0m:  $ pip install wandb --upgrade
[34m[1mwandb[0m: Tracking run with wandb version 0.16.0
[34m[1mwandb[0m: Run data is saved locally in [35m[1m/kaggle/working/wandb/run-20240228_004828-g12k2gt0[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
[34m[1mwandb[0m: Syncing run [33mft-bert-base-cased-run-v-0-3[0m
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/causal_language_trainer/Fine-tune-models[0m
[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.ai/causal_language_trainer/Fine-tune-models/runs/g12k2gt0[0m


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,1.155426,0.523




TrainOutput(global_step=100, training_loss=1.390007781982422, metrics={'train_runtime': 157.909, 'train_samples_per_second': 10.132, 'train_steps_per_second': 0.633, 'total_flos': 418884082802688.0, 'train_loss': 1.390007781982422, 'epoch': 1.59})

# Evaluate Accuracy

In [9]:
import math

eval_results=trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")



Perplexity: 3.18


# Saving to Hub

In [10]:
trainer.push_to_hub(os.getenv("WANDB_NAME"))
tokenizer.push_to_hub(os.getenv("WANDB_NAME"))

model.safetensors:   0%|          | 0.00/433M [00:00<?, ?B/s]

training_args.bin:   0%|          | 0.00/4.22k [00:00<?, ?B/s]

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

CommitInfo(commit_url='https://huggingface.co/aisuko/ft-bert-base-cased-run-v-0-3/commit/39e5b9665d4f161d202e43d3eba76d0a66ae0ad8', commit_message='Upload tokenizer', commit_description='', oid='39e5b9665d4f161d202e43d3eba76d0a66ae0ad8', pr_url=None, pr_revision=None, pr_num=None)

# Inference

We can see the original model is a Fill mask model, after the fine-tune process. It is a text-classification model. Here is the reason below:

When a Fill-Mask model is fine-tuned, it can be used as a classification model because the fine-tuning process adapts the model to a specific task, such as text classification. Fine-tuning a masked language model is almost identical to fine-tuning a sequence classification model. The only difference is that a special data collator is needed to randomly mask some of the tokens in each batch of texts. Once the model is fine-tuned, it can be used for tasks such as filling int he masked tokens in a sentence, which is a form of language generation, or for text classification.

In [11]:
from transformers import pipeline

classifier=pipeline("text-classification", model="aisuko/"+os.getenv("WANDB_NAME"))
classifier("I like you. I love you")

config.json:   0%|          | 0.00/955 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/433M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.19k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/669k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

[{'label': 'LABEL_4', 'score': 0.32115018367767334}]