# Fine-tuning a pre-trained model

The ability to adapt pre-trained models to specific tasks has revolutionized how we approach complex language problems. BERT (Bidirectional Encoder Representations from Transformers) has emerged as a particularly powerful model for a wide range of NLP tasks. The process of adapting a pre-trained model to a specific task is known as fine-tuning. Fine-tuning allows us to leverage the model's learned representations, adjusting them slightly with additional training on a smaller, task-specific dataset. 

### Objective
In this notebook, we illustrate the process of using a pre-trained BERT model for a sentiment analysis task. We will demonstrate effectiveness of fine-tuning by comparing the performance of the pre-trained model used for transfer learning vs after fine-tuning on a small custom dataset. The task involves classifying text into two categories: positive and negative sentiment.

#### Import Libraries

In [8]:
from transformers import BertTokenizer, BertForSequenceClassification
import torch
from torch.utils.data import DataLoader, Dataset
import torch.nn.functional as F
import torch.optim as optim

### Creating a Custom Dataset for Sentiment Analysis
First, we construct a small dataset for demonstration. This dataset consists of texts labeled for sentiment: 1 for positive and 0 for negative. Normally, you'd use a more extensive dataset for robust model training.


In [9]:
class ClassificationDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_len=128):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, item):
        text = self.texts[item]
        label = self.labels[item]

        encoding = self.tokenizer.encode_plus(
          text,
          add_special_tokens=True,
          max_length=self.max_len,
          return_token_type_ids=False,
          pad_to_max_length=True,
          return_attention_mask=True,
          return_tensors='pt',
        )

        return {
          'review_text': text,
          'input_ids': encoding['input_ids'].flatten(),
          'attention_mask': encoding['attention_mask'].flatten(),
          'labels': torch.tensor(label, dtype=torch.long)
        }

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
texts = [
        'I love this product!',
        'Absolutely wonderful service.',
        'Not what I expected, sadly.',
        'The experience was bad, very bad.',
        'Fantastic! Will come again.',
        'Do not recommend.',
        'Great value for the money.',
        'Worst purchase I ever made.',
        'Happy with my purchase!',
        'Terrible, I hated it.'
    ]
labels = [1, 1, 0, 0, 1, 0, 1, 0, 1, 0] # 1 for positive, 0 for negative sentiment
dataset = ClassificationDataset(texts, labels, tokenizer)
loader = DataLoader(dataset, batch_size=2, shuffle=True)

### Transfer Learning by Freezing Layers of BERT
In transfer learning, we use a pre-trained model as it is for our task. Here, we'll use BERT to encode our texts and then use those embeddings for classification. In transfer learning, it's common to freeze the pre-trained layers of the model to prevent overfitting. This means you keep the weights of most layers (especially the earlier ones) fixed.

In [10]:
# Initialize the model for transfer learning
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

# Freezing Layers of BERT from weight updates
for param in model.bert.parameters():
    param.requires_grad = False

optimizer = optim.Adam(model.parameters(), lr=2e-5)

def compute_accuracy(predictions, labels):
    _, preds = torch.max(predictions, dim=1)
    correct = (preds == labels).float()
    accuracy = correct.sum() / len(correct)
    return accuracy * 100

# Transfer Learning Training Loop
model.train()  # Set the model to training mode
for epoch in range(4):  # Assume 4 epochs for simplicity
    model.train()
    for batch in loader:
        optimizer.zero_grad()
        input_ids = batch['input_ids']
        attention_mask = batch['attention_mask']
        labels = batch['labels']
        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        loss.backward()
        optimizer.step()

    # Compute accuracy
    model.eval()
    total_accuracy = 0
    with torch.no_grad():
        for batch in loader:
            input_ids = batch['input_ids']
            attention_mask = batch['attention_mask']
            labels = batch['labels']
            outputs = model(input_ids, attention_mask=attention_mask)
            logits = outputs.logits
            total_accuracy += compute_accuracy(logits, labels).item()

    avg_accuracy = total_accuracy / len(loader)
    print(f'Epoch {epoch+1}, Loss: {loss.item()}, Accuracy: {avg_accuracy:.2f}%')
print(f'Accuracy after Transfer Learning: {avg_accuracy:.2f}%')

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


Epoch 1, Loss: 0.7083183526992798, Accuracy: 50.00%
Epoch 2, Loss: 0.5474739074707031, Accuracy: 50.00%
Epoch 3, Loss: 0.8631421327590942, Accuracy: 50.00%
Epoch 4, Loss: 0.7385900616645813, Accuracy: 50.00%
Accuracy after Transfer Learning: 50.00%


We observe a consistent accuracy of 50% across all epochs which is no better than random guessing on the binary classification task. The key factor contributing to this outcome is the approach of freezing the BERT model's layers, which means that the weights of the pre-trained BERT layers are not updated during the training process, limiting the model's ability to adapt to the nuances of the custom dataset. 

### Fine-Tuning BERT with All Layers Trainable
In this example, we'll fine-tune the entire BERT model on our dataset. During fine-tuning, the common practice is to allow all (or most) of the layers to update their weights slightly according to the new data. This means that the layers aren not frozen, and the entire model is trainable, as shown in the example below

In [11]:
# Re-initialize the model for fine-tuning
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)
optimizer = optim.Adam(model.parameters(), lr=2e-5)

# Fine-Tuning Training Loop
model.train()  # Set the model to training mode
for epoch in range(4):  # Assume 4 epochs for simplicity
    model.train()
    for batch in loader:
        optimizer.zero_grad()
        input_ids = batch['input_ids']
        attention_mask = batch['attention_mask']
        labels = batch['labels']
        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        loss.backward()
        optimizer.step()

    # Compute accuracy
    model.eval()
    total_accuracy = 0
    with torch.no_grad():
        for batch in loader:
            input_ids = batch['input_ids']
            attention_mask = batch['attention_mask']
            labels = batch['labels']
            outputs = model(input_ids, attention_mask=attention_mask)
            logits = outputs.logits
            total_accuracy += compute_accuracy(logits, labels).item()

    avg_accuracy = total_accuracy / len(loader)
    print(f'Epoch {epoch+1}, Loss: {loss.item()}, Accuracy: {avg_accuracy:.2f}%')

print(f'Accuracy after Fine-Tuning: {avg_accuracy:.2f}%')


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1, Loss: 0.6042753458023071, Accuracy: 50.00%
Epoch 2, Loss: 0.5723692178726196, Accuracy: 90.00%
Epoch 3, Loss: 0.46594011783599854, Accuracy: 100.00%
Epoch 4, Loss: 0.3435141444206238, Accuracy: 100.00%
Accuracy after Fine-Tuning: 100.00%


With fine-tuning, the accuracy improved significantly from 50% in the first epoch to 100% by the third epoch, maintaining this level through the fourth epoch. This marked improvement demonstrates the effectiveness of fine-tuning, where updating the weights of the entire BERT model allows it to adapt more closely to the specific characteristics of the custom dataset.

If we want to selectively freeze layers during fine-tuning, you can apply a similar technique as shown above for freezing layers, but you might choose to unfreeze more layers or only specific layers based on the task and dataset size.

### Transfer Learning vs Fine-tuning
When comparing the outcomes of transfer learning and fine-tuning on a custom dataset using BERT, we observe a stark contrast in performance:

- Transfer Learning Accuracy: 50%
- Fine-tuning Accuracy: 100%

Transfer learning, with most of BERT's layers frozen, limited the model's ability to learn from the custom dataset, resulting in an accuracy equivalent to random guessing. On the other hand, fine-tuning the entire BERT model led to a significant improvement, with accuracy reaching 100%. This demonstrates the model's capacity to adapt to the specific nuances of the dataset through fine-tuning.

For tasks involving custom datasets, especially when high accuracy is paramount, fine-tuning is recommended. Fine-tuning allows pre-trained models like BERT to adjust their learned representations to better fit the specific characteristics and requirements of the task at hand. While it may require more computational resources and time compared to transfer learning with layer freezing, the substantial gains in performance, as evidenced by our experiment, highlight its value for achieving optimal results.

### Summary
- **Transfer Learning**: Leverages the pre-trained knowledge of BERT on vast amounts of text data. Good starting point, especially with limited data, but may not fully adapt to your custom dataset's nuances. Typically involves freezing most of the pre-trained model's layers to prevent overfitting on a small dataset or a task very different from the original training task.
- **Fine-tuning**: Further customizes the BERT model to your specific dataset and task. This usually leads to higher performance if you have sufficient data. Involves training most or all of the model's layers on the new task, allowing the model to adjust its weights to better fit the specific task.
Adjusting the freezing/unfreezing of layers is a crucial aspect of adapting pre-trained models to new tasks and is often fine-tuned based on experimental results.