# Fine-tuning BERT

In this section, we initially provide a brief overview of BERT and the process of fine-tuning it. And then:   
- we execute the fine-tuning of a pretrained BERT model on the 'ag_news' dataset, a process known as domain adaptation.     
- we compare the accuracy of this model with a BERT model that has been fine-tuned specifically for the classification task at hand.     
- we proceed furthur to fine-tune the domain-adapted model on the specific task.

## BERT or Bidirectional Encoder Representations from Transformers    

BERT or Bidirectional Encoder Representations from Transformers, is a deep learning model developed by Google for natural language processing (NLP) tasks in 2018 and was a major breakthrough.   

BERT has a **transformer architecture**, a specific type of deep learning model that uses **self-attention** mechanisms. The transformer model **learns contextual relationships** between words in a text. In contrast to previous models such as LSTM (Long Short-Term Memory) that read text input sequentially (either from left-to-right or right-to-left), **BERT reads the entire sequence of words at once**, which is why it's considered **bidirectional**.    

This bidirectional approach allows BERT to understand the context and meaning of a word based on all of its surroundings (left and right of the word). For example, in the sentence "I picked up a pen to write", the word "write" informs the model about the meaning of "pen". This feature makes BERT particularly effective for NLP tasks that require understanding context, including sentiment analysis, named entity recognition, and question answering among others.     

BERT is trained on a large amount of text data, and it uses **two types of training strategies**:

- **Masked Language Model (MLM)**: In this strategy, some percentage of the input words are masked (hidden) at random, and the model is trained to predict those masked words based on the context provided by the non-masked words.

- **Next Sentence Prediction (NSP)**: In this strategy, the model is trained to predict whether one sentence follows another in a given text, learning to understand the relationship between sentences.

After this pretraining, a BERT model can be **fine-tuned on a specific task with a smaller amount of data** because it has already learned useful representations of language from the pretraining stage. This fine-tuning is done by adding an extra output layer that matches the task, and then training the entire model on the specific task.      

Provided that the corpus used for pretraining is not too different from the corpus used for fine-tuning, transfer learning will usually produce good results.

### Fine-tuning on Task

For many NLP applications involving Transformer models, you can simply take a pretrained BERT and fine-tune it directly on your data **for the task at hand**. For exammple here we use it for News classification task with our labeled data.

### Fine-tuning on dataset (domain adaptation)

There are certain instances where it's preferable to first adjust the language models based on your data, prior to training a task-specific head. For instance, if your dataset comprises legal contracts or scientific articles, a standard Transformer model such as BERT may often treat the domain-specific words in your corpus as infrequent tokens, leading to possibly subpar performance. By fine-tuning the language model on data from the same domain, you can enhance the performance of numerous downstream tasks. This implies that you typically only need to do this step once! This process of **fine-tuning** a pretrained language model on **in-domain data** is usually called **domain adaptation**.

## Fine-tuning a pretrained BERT model on the 'ag_news' dataset using masked language modeling (domain adaptation)

In [6]:
from transformers import BertTokenizerFast, BertForMaskedLM, DataCollatorForLanguageModeling
from datasets import load_dataset
from transformers import Trainer, TrainingArguments

# Load the dataset
raw_datasets = load_dataset("ag_news")
train_subset = raw_datasets['train'].select(range(1000))
texts = [example['text'] for example in train_subset]

# Tokenize the texts
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')
inputs = tokenizer(texts, truncation=True, padding=True)

# Prepare for masked language modeling
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)

# Convert inputs to a Dataset
from torch.utils.data import Dataset

class MyDataset(Dataset):
    def __init__(self, encodings):
        self.encodings = encodings

    def __getitem__(self, idx):
        return {key: tensor[idx] for key, tensor in self.encodings.items()}

    def __len__(self):
        return len(self.encodings.input_ids)

train_dataset = MyDataset(inputs)

# Load pre-trained model
model = BertForMaskedLM.from_pretrained('bert-base-uncased')

# Specify the training arguments
training_args = TrainingArguments(
    output_dir="./MyBERT",
    overwrite_output_dir=True,
    num_train_epochs=1,
    per_device_train_batch_size=8,
    save_steps=100,  # Decrease this if your dataset is small
    save_total_limit=2,
)

# Create a Trainer instance
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
)

# Train the model
trainer.train()


Found cached dataset ag_news (/Users/mahnaz/.cache/huggingface/datasets/ag_news/default/0.0.0/bc2bcb40336ace1a0374767fc29bb0296cdaf8a6da7298436239c54d79180548)
100%|██████████| 2/2 [00:00<00:00, 154.40it/s]
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
  0%|          | 0/125 [00:00<?, ?it/s]You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster

{'train_runtime': 1563.3703, 'train_samples_per_second': 0.64, 'train_steps_per_second': 0.08, 'train_loss': 2.811888916015625, 'epoch': 1.0}





TrainOutput(global_step=125, training_loss=2.811888916015625, metrics={'train_runtime': 1563.3703, 'train_samples_per_second': 0.64, 'train_steps_per_second': 0.08, 'train_loss': 2.811888916015625, 'epoch': 1.0})

## Fine-tuning a pretrained BERT model specifically for the classification task:

The fine tuning of BERT on task can be found in initial_training.ipynb.

## Fine-tuning the domain-adapted model on the specific task:

In [12]:
import os
import numpy as np
from datasets import load_dataset, load_metric
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer


def load_data_and_model(dataset_name, model_name, num_labels):
    raw_datasets = load_dataset(dataset_name)
    tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased') # Load the original tokenizer
    model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=num_labels) # Load fine-tuned model
    return raw_datasets, tokenizer, model


def tokenize_data(raw_datasets, tokenizer):
    def tokenize_function(examples):
        return tokenizer(examples["text"], padding="max_length", truncation=True)
    
    tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
    small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000)) 
    small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000)) 
    full_train_dataset = tokenized_datasets["train"]
    full_eval_dataset = tokenized_datasets["test"]
    
    return small_train_dataset, small_eval_dataset, full_train_dataset, full_eval_dataset

def get_training_args(path, evaluation_strategy):
    return TrainingArguments(path, evaluation_strategy=evaluation_strategy)

def get_metric(metric_name):
    return load_metric(metric_name)

def compute_metrics(eval_pred, metric):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

def get_trainer(model, args, train_dataset, eval_dataset, compute_metrics):
    return Trainer(model=model, args=args, train_dataset=train_dataset, eval_dataset=eval_dataset, compute_metrics=compute_metrics)

def evaluate_model(trainer):
    return trainer.evaluate()

def save_model(trainer, model_path):
    if not os.path.exists(model_path):
        os.makedirs(model_path)
        
    trainer.save_model(model_path)

def main():
    try:
        # Load your fine-tuned model
        
        raw_datasets, tokenizer, model = load_data_and_model("ag_news", "./MyBERT/checkpoint-100", 4)
    
        small_train_dataset, small_eval_dataset, full_train_dataset, full_eval_dataset = tokenize_data(raw_datasets, tokenizer)
        
        # Specify the number of epochs
        training_args = get_training_args("test_trainer", "epoch")
        training_args.num_train_epochs = 5

        metric = get_metric("accuracy")

        trainer = get_trainer(model, training_args, small_train_dataset, small_eval_dataset, lambda eval_pred: compute_metrics(eval_pred, metric))
        
        # Train the model
        trainer.train()

        eval_results = evaluate_model(trainer)
        print(eval_results)

        save_model(trainer, './models')

    except Exception as e:
        print(f"An error occurred: {e}")

if __name__ == "__main__":
    main()


Found cached dataset ag_news (/Users/mahnaz/.cache/huggingface/datasets/ag_news/default/0.0.0/bc2bcb40336ace1a0374767fc29bb0296cdaf8a6da7298436239c54d79180548)
100%|██████████| 2/2 [00:00<00:00, 416.39it/s]
Some weights of the model checkpoint at ./MyBERT/checkpoint-100 were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initia

## Comparing the diffrently fine-tuned BERT models 

The accuracy of the BERT fine-tuned on the task (find it in initial_training.ipynb) is 0.885. The accuracy of the BERT that is first fine-tuned on data and the fine-tuned on the task is  . This shows ...    

Now that we stablished the models on a small sample of the dataset, we move forward and do the fine-tuning on the entire dataset which contains 120k samples. 