# Fine-tuning

*by Arif Ozan Kızıldağ*

In the last tutorial, we explored how to train a transformer model from scratch. Although it is sometimes necessary, training these models can be time-consuming due to their large size. To mitigate this, instead of training them from scratch, we can fine-tune pre-trained models.

Fine-tuning refers to the process of taking a pre-trained model, which was initially trained on a larger or more general dataset and adapting it for a specific task. For example, let's say you want to classify traffic lights in photos, and you have access to a classifier that can already identify traffic lights, cars, pedestrians, bridges, and so on. Instead of using this comprehensive model as is, you can make minor adjustments to tailor it to your specific needs. This often involves training the model on a smaller dataset related to your task, allowing you to create a specialized model without having to train it from scratch.

Let's look at an example to see what we can do with a trained model, utilizing the model we created in the last session.

In [6]:
from time import time
import torch
import torchtext
from torch import nn
import torchdata
import math
import numpy as np
%matplotlib inline
print('version of the torch:' + torch.__version__)
print('version of the torchtext:' + torchtext.__version__)
print('version of the torchdata:' + torchdata.__version__)

version of the torch:2.0.1
version of the torchtext:0.15.2
version of the torchdata:0.6.1


In [7]:
class PositionalEncoding(nn.Module):
    def __init__(self,
                 embedding_size: int,
                 dropout: float= 0.1,
                 maximum_length: int = 5000):
        super(PositionalEncoding, self).__init__()

        divider = torch.exp(- torch.arange(0, embedding_size, 2)* math.log(10000) / embedding_size)
        position = torch.arange(0, maximum_length).unsqueeze(1)

        positionalembedding = torch.zeros((maximum_length,1, embedding_size))
        positionalembedding[:,0, 0::2] = torch.sin(position * divider)
        positionalembedding[:,0, 1::2] = torch.cos(position * divider)

        self.dropout = nn.Dropout(dropout)
        self.register_buffer('positionalembedding', positionalembedding)

    def forward(self, input_token: torch.Tensor):
        embedding = input_token + self.positionalembedding[:input_token.size(0), :]
        return self.dropout(embedding)
class TransformerModel(nn.Module):

    def __init__(self, ntoken: int, embedding_size: int, nhead: int, d_hid: int,
                 nlayers: int, nclass: int, dropout: float = 0.5):
        super().__init__()
        self.embedding_size = embedding_size

        self.embedding = nn.Embedding(ntoken, embedding_size)

        self.PositionalEncoding = PositionalEncoding(embedding_size, dropout)
        encoder_layers = nn.TransformerEncoderLayer(embedding_size, nhead, d_hid, dropout)
        self.transformer_encoder = nn.TransformerEncoder(encoder_layers, nlayers)


        self.linear = nn.Linear(embedding_size, nclass)

        self.init_weights()

    def init_weights(self) -> None:
        initrange = 0.1
        self.embedding.weight.data.uniform_(-initrange, initrange)
        self.linear.bias.data.zero_()
        self.linear.weight.data.uniform_(-initrange, initrange)

    def forward(self, src: torch.Tensor, src_mask: torch.Tensor = None) -> torch.Tensor:
        """
        Arguments:
            src: Tensor, shape ``[seq_len, batch_size]``
            src_mask: Tensor, shape ``[seq_len, seq_len]``

        Returns:
            output Tensor of shape ``[seq_len, batch_size, ntoken]``
        """
        src = self.embedding(src) * math.sqrt(self.embedding_size)
        src = self.PositionalEncoding(src)
        output = self.transformer_encoder(src, src_mask)
        output = self.linear(output)
        return output

In [8]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [9]:
ntokens = 100  # size of vocabulary
emsize = 16  # embedding dimension
d_hid = 8  # dimension of the feedforward network model in ``nn.TransformerEncoder``
nlayers = 2  # number of ``nn.TransformerEncoderLayer`` in ``nn.TransformerEncoder``
nhead = 2  # number of heads in ``nn.MultiheadAttention``
dropout = 0.2  # dropout probability
nclasses = 4
model = TransformerModel(ntokens, emsize, nhead, d_hid, nlayers,nclasses, dropout).to(device)

In [10]:
print(model)

TransformerModel(
  (embedding): Embedding(100, 16)
  (PositionalEncoding): PositionalEncoding(
    (dropout): Dropout(p=0.2, inplace=False)
  )
  (transformer_encoder): TransformerEncoder(
    (layers): ModuleList(
      (0-1): 2 x TransformerEncoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): NonDynamicallyQuantizableLinear(in_features=16, out_features=16, bias=True)
        )
        (linear1): Linear(in_features=16, out_features=8, bias=True)
        (dropout): Dropout(p=0.2, inplace=False)
        (linear2): Linear(in_features=8, out_features=16, bias=True)
        (norm1): LayerNorm((16,), eps=1e-05, elementwise_affine=True)
        (norm2): LayerNorm((16,), eps=1e-05, elementwise_affine=True)
        (dropout1): Dropout(p=0.2, inplace=False)
        (dropout2): Dropout(p=0.2, inplace=False)
      )
    )
  )
  (linear): Linear(in_features=16, out_features=4, bias=True)
)


To remind you, last time we trained this model using a transformer encoder and a linear layer. For fine-tuning, we have two approaches. The first approach is to fine-tune the entire model, which is feasible in this case due to its small size. The second approach is to fine-tune only specific layers by "freezing" some of them, allowing only the last few layers to be updated during training. In our case, that would mean fine-tuning only one layer.

In [11]:
model.transformer_encoder.requires_grad_ = False

### With this change, the model will not calculate gradients for the encoder layer

Another thing to consider is that the model currently classifies four different types of news. Let's say we are only interested in determining whether a news article is about sports or not. Although we could keep the model as it is and focus on just one output, doing so would increase our computational cost. To address this, we have two options: either modify the last layer of the model to output only the classification we're interested in or add a new layer to make that specific classification. Both approaches aim to optimize computational resources while tailoring the model to our specific needs.

If we want to add a new layer, we spimply create a new layer freeze the model as whole utulizing its pre trained weights.

In [12]:
inferance = nn.Linear(4, 1).to(device)
duminput = torch.randint(0, 100, (100,1)).to(device)
model.requires_grad_ = False
x = model(duminput)[-1]
inferance(x)

tensor([[0.6412]], device='cuda:0', grad_fn=<AddmmBackward0>)

In [13]:
nclasses = 1
model2 = TransformerModel(ntokens, emsize, nhead, d_hid, nlayers,nclasses, dropout).to(device)

model2.transformer_encoder.load_state_dict(model.transformer_encoder.state_dict())
model2.transformer_encoder.requires_grad_ = False

After modifying the model, you can proceed to train it as you normally would.

## HuggingFace

You've probably heard about the Hugging Face library if you're interested in transformers. Hugging Face describes itself as "The platform where the machine learning community collaborates on models, datasets, and applications." It serves as a hub where researchers and engineers can access a wide variety of pre-trained models, making it a valuable resource for those looking to implement or fine-tune transformer models for specific tasks.

For fine-tuning, you need a pre-trained model that is relevant to your task. You may not always have such a model readily available, and that's where Hugging Face becomes useful. The platform allows you to easily search for and access a wide range of pre-trained models that you can then fine-tune on your own dataset. This not only saves time but also leverages the generalizability of existing models to better perform your specific task.

For detailed information on the Hugging Face library, you can check out their comprehensive [tutorials](https://huggingface.co/learn) covering various topics. These tutorials provide valuable insights into how to utilize pre-trained models, fine-tune them for specific tasks, and even create your own models using their framework.





In [14]:
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding

raw_datasets = load_dataset("ag_news", split='train[:10%]') # we are only taking  %10 of the dataset
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

In [15]:
def tokenize_function(example):
    return tokenizer(example["text"], truncation=True)


tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [16]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=4,)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [17]:
for param in model.bert.parameters():
    param.requires_grad = False

In [18]:
model

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12,

In [19]:
from transformers import TrainingArguments

training_args = TrainingArguments(output_dir="test-trainer",num_train_epochs=2,evaluation_strategy="epoch")

In [20]:
print(training_args)

TrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
dispatch_batches=None,
do_eval=True,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=IntervalStrategy.EPOCH,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_always_pu

In [23]:
import evaluate
metric = evaluate.load('accuracy')
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

In [24]:
raw_datasets = load_dataset("ag_news", split='test[:10%]') # we are only taking  %10 of the dataset
tokenized_datasets2 = raw_datasets.map(tokenize_function, batched=True)

In [25]:
from transformers import Trainer

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets,
    eval_dataset= tokenized_datasets2,
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

In [26]:
trainer.train()

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Accuracy
1,1.2691,1.248207,0.5
2,1.2231,1.216287,0.522368


TrainOutput(global_step=3000, training_loss=1.273492696126302, metrics={'train_runtime': 60.0288, 'train_samples_per_second': 399.808, 'train_steps_per_second': 49.976, 'total_flos': 1086898398104064.0, 'train_loss': 1.273492696126302, 'epoch': 2.0})