<a href="https://www.kaggle.com/code/aisuko/introducing-pro-process-of-ft-a-llm-step-by-steps?scriptVersionId=160569575" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Overview

In this notebook, we will describe in detail how to modify the process of fine-tuning a pre-trained BERT model for the classification task.

We are going to use pre-trained BERT model on a large corpus of data in a self-supervised fashion. That is, the training data consists of raw texts only, without human labeling. The model is evaluated by two objectives: guessing the masked word in a sentence and prediction if one sentence comes after another. However, that both of these tasks are concerned only with separate sentences and not the entire context. Hence, there is no truncation of longer texts.

In [1]:
!pip install transformers==4.36.2
!pip install datasets==2.15.0
# !pip install accelerate==0.25.0
# !pip install peft==0.7.1
# !pip install bitsandbytes==0.41.3

Collecting datasets==2.15.0
  Obtaining dependency information for datasets==2.15.0 from https://files.pythonhosted.org/packages/e2/cf/db41e572d7ed958e8679018f8190438ef700aeb501b62da9e1eed9e4d69a/datasets-2.15.0-py3-none-any.whl.metadata
  Downloading datasets-2.15.0-py3-none-any.whl.metadata (20 kB)
Collecting pyarrow-hotfix (from datasets==2.15.0)
  Obtaining dependency information for pyarrow-hotfix from https://files.pythonhosted.org/packages/e4/f4/9ec2222f5f5f8ea04f66f184caafd991a39c8782e31f5b0266f101cb68ca/pyarrow_hotfix-0.6-py3-none-any.whl.metadata
  Downloading pyarrow_hotfix-0.6-py3-none-any.whl.metadata (3.6 kB)
Collecting fsspec[http]<=2023.10.0,>=2023.1.0 (from datasets==2.15.0)
  Obtaining dependency information for fsspec[http]<=2023.10.0,>=2023.1.0 from https://files.pythonhosted.org/packages/e8/f6/3eccfb530aac90ad1301c582da228e4763f19e719ac8200752a4841b0b2d/fsspec-2023.10.0-py3-none-any.whl.metadata
  Downloading fsspec-2023.10.0-py3-none-any.whl.metadata (6.8 

# The approach of fine-tuning

We want to adapt the model for our specific task of binary sequence classification. To do this, we use the supervised learning approach and prepare the training set of reviews manually labeled as positive or negative, and then feed it to the model with an additional classification layer on top of the model.

Furthermore, we want to modify the fine-tuning step to look at the entire text and not just the first 512 tokens turned out to be untrivial. 

# Load the model

We can see the warning in the outputs. It informs us that the downloaded model must be fine-tuned on the downstream task. And this is normal.

In [2]:
from transformers import AutoModel, BitsAndBytesConfig
import torch

model_name='bert-base-uncased'

# We do not need to use GPU to run this notebook, so we don't use model quantization either.
# bnb_config=BitsAndBytesConfig(
#     load_in_4bit=True,
#     bnb_4bit_use_double_quant=True,
#     bnb_4bit_quant_type='nf4',
#     bnb_4bit_compute_type=torch.bfloat16
# )

model=AutoModel.from_pretrained(
    model_name, 
#     quantization_config=bnb_config,
    use_cache=False,
#     device_map='auto',
    torch_dtype=torch.bfloat16
)
model.config

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

BertConfig {
  "_name_or_path": "bert-base-uncased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "torch_dtype": "bfloat16",
  "transformers_version": "4.36.2",
  "type_vocab_size": 2,
  "use_cache": false,
  "vocab_size": 30522
}

# Load the tokenizer

From the outputs of tokenizer, we can get the `model_max_length=512` listed above. It is the main obstacle we will work around in this notebook. Applying this model without modification just truncates every text to 512 tokens. All the information and context in the rest of the document are discarded during fine-tuning and prediction stages.

In [3]:
from transformers import AutoTokenizer

tokenizer=AutoTokenizer.from_pretrained('bert-base-uncased')
tokenizer

tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

BertTokenizerFast(name_or_path='bert-base-uncased', vocab_size=30522, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

# Load the dataset and find the longest review as an example

Let's find a very long review from the dataset. And it is 2278 words. We want to split it into chunks that are small enough to fit into the 512 limits of the BERT input.

In [4]:
from datasets import load_dataset

imdb=load_dataset('imdb')

Downloading readme:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

# Demonstrate processing data one by one step

In [5]:
long_review=imdb['test']['text'][21132]
number_of_words=len(long_review.split())
number_of_words

2278

## 1. Tokenization of the whole text with already fine-tuned BERT classifier

We assume that we have an already fine-tuned BERT classifier.

In [6]:
from transformers import BertForSequenceClassification, BertTokenizer
import torch

tokenizer=BertTokenizer.from_pretrained('fabriceyhc/bert-base-uncased-imdb')
model = BertForSequenceClassification.from_pretrained(
    'fabriceyhc/bert-base-uncased-imdb', 
#     device_map='auto'
)
tokenizer

tokenizer_config.json:   0%|          | 0.00/321 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/826 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

BertTokenizer(name_or_path='fabriceyhc/bert-base-uncased-imdb', vocab_size=30522, model_max_length=512, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

In [7]:
tokens=tokenizer(
    long_review,
    # We will add special tokens at the beginning and the end manually after the splitting procedure.
    add_special_tokens=False,
    # we do not want to throw away any part of the text
    truncation=False,
    # get the result in the form of the torch Tensor
    return_tensors='pt'
)
tokens

Token indices sequence length is longer than the specified maximum sequence length for this model (3155 > 512). Running this sequence through the model will result in indexing errors


{'input_ids': tensor([[2045, 1005, 1055,  ..., 1997, 2184, 1012]]), 'token_type_ids': tensor([[0, 0, 0,  ..., 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1,  ..., 1, 1, 1]])}

The warning informs us that the tokenized sequence is too long(after tokenization we obtained 3155 tokens which are even significantly more than the number of words). If we just put such a tensor into the model it will not work.


## The reason of preprocess the data 

**Note:You can see this is the reason we need to do tokenization to the training data.**

In [8]:
try:
    prediction=model(**tokens)
except RuntimeError as e:
    print(e)

The size of tensor a (3155) must match the size of tensor b (512) at non-singleton dimension 1


## 2. What are the tokens?

Let us now take a look at what exactly are these tokens we are referring to.

From the tokenized text below, we can see the following keys:

* `input_ids` - this part is crucial - it encodes the words as integers. It can also contain some special tokens indicating the beginning(value 101) and the end of the text (value 102). We will add them manually after the splitting procesure.

* `token_type_ids` - this binary tensor is used to separate question and answer in some specific applications of BERT. Because we are interested only in the classification task, we can ignore this part.

* `attention_mask` - this binary tensor indicates the position of the padded indices. Later we will manually add zeros there to make that all chunks have precisely the demanded size of 512.

In [9]:
example=['the man went to the store and bought a gallon of milk']
tokens=tokenizer(example, add_special_tokens=False, truncation=False, return_tensors='pt')
tokens

{'input_ids': tensor([[ 1996,  2158,  2253,  2000,  1996,  3573,  1998,  4149,  1037, 25234,
          1997,  6501]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [10]:
tokens=tokenizer(example, add_special_tokens=True, truncation=False, return_tensors='pt')
tokens

{'input_ids': tensor([[  101,  1996,  2158,  2253,  2000,  1996,  3573,  1998,  4149,  1037,
         25234,  1997,  6501,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

## 3.Splitting the tokens

To fit the tokens into the model, we need to split them into chunks with a length of 512 tokens or less. However, we also need to put 2 special tokens at the beginning and the end;hence the upper bound is 510.

The procedure of splitting and pooling is determined by the hyperparameters. These are:
* maximal_text_length
* chunk_size
* stride
* minimal_chunk_length
* pooling_strategy


**maximal_text_length**

It is used to truncate the tokens. It can be either `None`, which means no truncation, or an integer, determining the number of tokens to consider. Standard BERT truncates to 510 tokens because it needs 2 additional tokens at the beginning and the end.

**chunk_size**

It is an integer, and it determines the size(in several tokens) of each chunk. This parameter cannot be larger than 510. Otherwise, we will not be able to fit the chunk into the input of BERT.

**stride**

Tokens may overlap depdending on the parameter stride. We get chunks by moving the window of the size chunk_size by the length equal to stride. A stride cannot be bigger than a chunk size. Chunks must overlap or be near each other.

**minimal_chunk_length**

We ignore chunks that are too small-smaller than `minimal_chunk_length`

**pooling_strategy**

The string parameter `pooling_strategy` is used at the end to aggregate the model results. It can be either mean or max.

In [11]:
from torch import Tensor

def split_overlapping(tensor, chunk_size, stride, minimal_chunk_length=5):
    result=[tensor[i:i+chunk_size] for i in range(0, len(tensor), stride)]
    if len(result)>1:
        result=[x for x in result if len(x)>=minimal_chunk_length]
    return result

example_tensor=tokens["input_ids"][0]
example_tensor

tensor([  101,  1996,  2158,  2253,  2000,  1996,  3573,  1998,  4149,  1037,
        25234,  1997,  6501,   102])

In [12]:
splitted=split_overlapping(example_tensor, chunk_size=5, stride=5, minimal_chunk_length=5)
splitted

[tensor([ 101, 1996, 2158, 2253, 2000]),
 tensor([1996, 3573, 1998, 4149, 1037])]

In [13]:
splitted=split_overlapping(example_tensor, chunk_size=5, stride=3, minimal_chunk_length=5)
splitted

[tensor([ 101, 1996, 2158, 2253, 2000]),
 tensor([2253, 2000, 1996, 3573, 1998]),
 tensor([ 3573,  1998,  4149,  1037, 25234]),
 tensor([ 1037, 25234,  1997,  6501,   102])]

In [14]:
splitted=split_overlapping(example_tensor, chunk_size=5, stride=3, minimal_chunk_length=3)
splitted

[tensor([ 101, 1996, 2158, 2253, 2000]),
 tensor([2253, 2000, 1996, 3573, 1998]),
 tensor([ 3573,  1998,  4149,  1037, 25234]),
 tensor([ 1037, 25234,  1997,  6501,   102])]

## 4.Adding special tokens

After splitting into smaller chunks:

1. We must add special tokens at the beginning and the end
2. We must add some padding tokens to ensure that all chunks have a size of precisely 512

In [15]:
def add_special_tokens_at_beginning_and_end(input_id_chunks, mask_chunks):
    """
    Adds special CLS token (token id =101) at the beginning.
    Adds SEP token (token id =102) at the end of each chunk.
    Adds corresponding attention masks equal to 1 (attention mask is boolean)
    """
    
    for i in range(len(input_id_chunks)):
        # adding CLS (token id 101) and SEP (token id 102) tokens
        input_id_chunks[i]=torch.cat([Tensor([101]), input_id_chunks[i], Tensor([102])])
        # adding attention masks corresponding to special tokens
        mask_chunks[i]=torch.cat([Tensor([1]), mask_chunks[i], Tensor([1])])

def add_padding_tokens(input_id_chunks, mask_chunks):
    """
    Adds padding tokens (token id=0) at the end to make sure that all chunks have exactly 512 tokens
    """
    for i in range(len(input_id_chunks)):
        # get required padding length
        pad_len=512-input_id_chunks[i].shape[0]
        # check if tensor length satisfies required chunk size
        if pad_len>0:
            # if padding length is more than 0, we must add padding
            input_id_chunks[i]=torch.cat([input_id_chunks[i], Tensor([0]*pad_len)])
            mask_chunks[i]=torch.cat([mask_chunks[i], Tensor([0]*pad_len)])

## 5.Stacking the tensors

After applying this procesure to a single text, the input_ids is a list of K tensors of the size 512, where K is the number of chunks. To put this into the BERT model, we must stack these K tensors into one tensor of the size Kx512 and ensure that tensor values have the appropriate type.

In [16]:
def stack_tokens_from_all_chunks(input_id_chunks, mask_chunks):
    """
    Reshapes data to a form compatible with BERT model input.
    """
    input_ids=torch.stack(input_id_chunks)
    attention_mask=torch.stack(mask_chunks)
    
    return input_ids.long(), attention_mask.int()

# Wrapping all into one preprocess function

In [17]:
def tokenize_whole_text(text, tokenizer):
    """Tokenizes the entire text without truncation and without special tokens."""
    tokens = tokenizer(text, add_special_tokens=False, truncation=False, return_tensors="pt")
    return tokens


def tokenize_text_with_truncation(text, tokenizer, maximal_text_length):
    """Tokenizes the text with truncation to maximal_text_length and without special tokens."""
    tokens = tokenizer(
        text, add_special_tokens=False, max_length=maximal_text_length, truncation=True, return_tensors="pt"
    )
    return tokens

def split_tokens_into_smaller_chunks(
    tokens,
    chunk_size,
    stride,
    minimal_chunk_length,
):
    """Splits tokens into overlapping chunks with given size and stride."""
    input_id_chunks = split_overlapping(tokens["input_ids"][0], chunk_size, stride, minimal_chunk_length)
    mask_chunks = split_overlapping(tokens["attention_mask"][0], chunk_size, stride, minimal_chunk_length)
    return input_id_chunks, mask_chunks

def preprocess_func(
    text,
    tokenizer,
    chunk_size,
    stride,
    minimal_chunk_length,
    maximal_text_length
):
    """Transforms (the entire) text to model input of BERT model."""
    if maximal_text_length:
        tokens=tokenize_text_with_truncation(text, tokenizer, maximal_text_length)
    else:
        tokens=tokenize_whole_text(text, tokenizer)
    
    input_id_chunks, mask_chunks=split_tokens_into_smaller_chunks(tokens, chunk_size, stride, minimal_chunk_length)
    add_special_tokens_at_beginning_and_end(input_id_chunks, mask_chunks)
    add_padding_tokens(input_id_chunks, mask_chunks)
    input_ids, attention_mask=stack_tokens_from_all_chunks(
        input_id_chunks,
        mask_chunks
    )
    return input_ids, attention_mask

In [18]:
input_ids, attention_mask = preprocess_func(long_review, tokenizer, 510, 510, 1, None)
print(input_ids)
print(attention_mask)
print(input_ids.shape)

tensor([[  101,  2045,  1005,  ...,  7367, 29287,   102],
        [  101,  1010,  1037,  ...,  1028,  1026,   102],
        [  101,  7987,  1013,  ...,  1010, 26577,   102],
        ...,
        [  101,  2227,  3544,  ...,  2014,  2125,   102],
        [  101,  1010,  2016,  ...,  2014, 10069,   102],
        [  101,  1012,  1026,  ...,     0,     0,     0]])
tensor([[1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1],
        ...,
        [1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 0, 0, 0]], dtype=torch.int32)
torch.Size([7, 512])


# Using the fine-tuned model on the prepared data

The prepared data is ready to plug into our fine-tuned classifier:

In [19]:
model_output=model(input_ids, attention_mask)
probs=torch.nn.functional.softmax(model_output[0], dim=-1)
probabilities=probs[:,-1]
print(probabilities.mean())
print(probabilities.max())

tensor(0.9335, grad_fn=<MeanBackward0>)
tensor(0.9997, grad_fn=<MaxBackward1>)


# Summary

* The fine-tuned model returned logit values for each chunk
* We applied the softmax and slicing to get the probability that the review is positive
* We obtained the list of probabilities for each: `[0.9997, 0.9996, 0.5399, 0.9994, 0.9995, 0.9975, 0.9987]`
* Finally, we can apply some pooling function (mean or maximum) to obtain one aggregated probability for the entire review


# Credit

* https://medium.com/mim-solutions-blog/fine-tuning-bert-model-for-arbitrarily-long-texts-part-1-299f1533b976
* https://github.com/mim-solutions/bert_for_longer_texts/blob/fb889e220fa22bc379d279b9efd312d7061e8a2c/belt_nlp/splitting.py