## Tutorial: Transfer learning - Fine Tuning Langauge Models for various Language Taks

In this tutorial, let's explore how to fine tune various language models to carry out some impotant text processing tasks. **This tutorial is NOT graded but could be useful for your course project***

There are three main type of text processing tasks that are a part of many text applications.

1. **Text Classicfiaciton**: Given an input text, classify the text into different categories. Example: Sentiment Classification, Topic Classificaiton, Intent Classificaiton

2. **Sequence classification**: Given an input text, classify each word in the text. Example: Part of Speech Tagging, Named Entity Recognition etc.

3. **Sequence to sequence generation**: Given an input text, generate another text by first encoding the text (or understanding the text) and then generating another text.

We will **not build** and train models from scratch; we woiuld rather take advantage of transfer learning, i.e, given a pre-trained language model, can we consider a portion of the language model to encode the text and then use a classifier or generator "head" to classify or generate text. Training such a customized model is often referred to as fine-tuning.

Some good the choices of language models are:

- **BERT:** Bidirectional Encoder Representations from Transformers, a pretrained langauge model based on transformers, that is trained on a large amount of web-scale data (Devlin et al, 2018). For fine-tuning, we only use BERT's encoder part. BERT can not be used for text generation in a trivial manner.

- **RoBERTa:** similar to BERT but trained with different input representation and vocabulary (Liu et al, 2019). This is also an "encoder-only" model.

- **T5:** Stands for Text-to_text transfer transformer. This is auto-regressive i.e., it can be used to generate text. We can also use only the encoder part to encode text.

- **GPT X**: Stands for Generative Pre-trained Transformers. This is also auto regressive in nature i.e., can be used to generatete text. We can also use the encoder part to encode text. GPT has many versions such as GPT 1, 2, 3, 3.5, 4.

- **BART (Bidirectional and Auto-Regressive Transformers)** is a sequence-to-sequence model designed for text generation tasks, combining both left-to-right and right-to-left training objectives. BART is adept at tasks such as text summarization, language modeling, and text generation due to its ability to efficiently handle bidirectional context. It achieves this through a combination of masked language modeling and denoising autoencoder objectives during pre-training.

These models are implemented in various Python libraries based of of Pytorch and Tensorflow. One very popular library is the `transformer` library by Huggingface.

**Note:** In order to complete the excercises, it is strongly recommended that you switch to GPU based machines (at least T4) GPUs. Check out this link to enable GPUs in colab: https://www.tutorialspoint.com/google_colab/google_colab_using_free_gpu.htm

**Also Note:** We will be working on train and test files for various task. Please download all *txt* files from Canvas and upload them under Files to make the code work.

Let's first install necessary libraries (such as Huggingface's `transformers`, `dataset` and `accelerate`).


In [1]:
!pip install transformers[torch]
!pip install datasets
!pip install accelerate -U

Collecting accelerate>=0.21.0 (from transformers[torch])
  Downloading accelerate-0.28.0-py3-none-any.whl (290 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m290.1/290.1 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch->transformers[torch])
  Downloading nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.7/23.7 MB[0m [31m21.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting nvidia-cuda-runtime-cu12==12.1.105 (from torch->transformers[torch])
  Downloading nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m823.6/823.6 kB[0m [31m13.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting nvidia-cuda-cupti-cu12==12.1.105 (from torch->transformers[torch])
  Downloading nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
[2K     


Deep learning based implementations often rely on randomized processes and algorithms and the results may significantly vary from time to time. To ensure result replicability and deterministic model behavior, let's set various random seeds, as shown below:

In [2]:
import random
import torch
import numpy as np

def set_seed(seed_value=42):
    """Set seed for reproducibility for PyTorch and NumPy.

    Args:
        seed_value (int): The seed value to set for random number generators.
    """
    random.seed(seed_value)
    np.random.seed(seed_value)
    torch.manual_seed(seed_value)
    torch.cuda.manual_seed_all(seed_value)

    # Additional steps for deterministic behavior
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

# Set the seed
set_seed(42)  # You can replace 42 with any other seed value of your choice

## PART 1. Fine-tuning a language model for Text Classificaiton Task - use ONLY 4 examples for training

The first task is to fine tune a pretrained model (in this case, we will use BERT) for text classification task.

The idea is to load a pretrained BERT encoder, which encodes text inputs into features, we then use a feed forward network (also known as classification head), to consume the features and output class labels.

Let's finetune the model using only just **FOUR** training examples.

In [18]:
import torch
from torch.utils.data import Dataset
from transformers import BertForSequenceClassification, BertTokenizer, Trainer, TrainingArguments

# Define your dataset class
class MyDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

# Load pre-trained BERT model and tokenizer
model_name = 'bert-base-uncased'
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=2)

# Prepare your dataset
# Here we are using only two examples for illustration purposes
# Also we are considering a binary classificaiton task
# For multiclass classification, you can set num_labels to more than 2.
# You can replace this example with your own examples and labels loaded from data files

train_texts = ["This movie is good", "Very bad acting", "I hate the movie", "lovely acting", "too good"]
train_labels = [1, 0, 0, 1, 1]

# We need to tokenize the data first
train_encodings = tokenizer(train_texts, truncation=True, padding=True, max_length=128, return_tensors='pt')

print ("Tokenized training data", train_encodings)

# Create instances of the dataset and dataloader
train_dataset = MyDataset(train_encodings, train_labels)

training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=2,
    logging_dir='./logs',
    logging_steps = 1
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
)

# Fine-tune the model
trainer.train()

model.save_pretrained('fine_tuned_bert_model')
tokenizer.save_pretrained('fine_tuned_bert_model')

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Tokenized training data {'input_ids': tensor([[ 101, 2023, 3185, 2003, 2204,  102],
        [ 101, 2200, 2919, 3772,  102,    0],
        [ 101, 1045, 5223, 1996, 3185,  102],
        [ 101, 8403, 3772,  102,    0,    0],
        [ 101, 2205, 2204,  102,    0,    0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 0],
        [1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 0, 0],
        [1, 1, 1, 1, 0, 0]])}


  item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}


Step,Training Loss
1,0.7556
2,0.8929
3,0.9696
4,0.6914
5,0.5001
6,0.5128
7,0.6535
8,0.7298
9,0.4645


('fine_tuned_bert_model/tokenizer_config.json',
 'fine_tuned_bert_model/special_tokens_map.json',
 'fine_tuned_bert_model/vocab.txt',
 'fine_tuned_bert_model/added_tokens.json')

We now delete all variables and models from memory. This marks the end of training.

In [19]:
# Empty VRAM
del model
del trainer

# Invoke garbage collector
import gc
gc.collect()
gc.collect()

0

We will load the model into memory again and perfrom testing.

In [20]:
from transformers import  BertForSequenceClassification, BertTokenizer
from transformers import pipeline

model_path = "./fine_tuned_bert_model"

# Example usage of the saved model for evaluation
model = BertForSequenceClassification.from_pretrained(model_path)
tokenizer = BertTokenizer.from_pretrained(model_path)

classifier = pipeline(task = "text-classification",model = model, tokenizer =tokenizer)

Illustrating one example test-case. You can loop over a test dataset to compute classificaiton accuracy.

In [31]:
#input = "Could the movie be more boring!!!"
input = "I Agree: This Is The Best War Movie Ever Made"

output = classifier(input)

print (output)

input = "I know it's fashionable to trash successful movies but at least be honest about the trashing... Pvt. Ryan was fiction but it was pretty good HISTORICAL fiction. The details were well thought out and based on reality.!!!"

output = classifier(input)

print (output)

input = "I hate the movie"

output = classifier(input)

print (output)


[{'label': 'LABEL_1', 'score': 0.6073551177978516}]
[{'label': 'LABEL_1', 'score': 0.5899153351783752}]
[{'label': 'LABEL_0', 'score': 0.6984620094299316}]


## Exercise E1 (not graded) . Try evaluating the classifier's performance using test data.

1. Load the test data `imdb_test.csv` , process the inputs and predict the sentiment labels using the last block of code above. Now comute the accuracy of the classifier by comparing the predicted and actual labels.

2. Try to form and use 10 examples instead of 5. Retrain the model and recompute the accuracy. Do you see any difference?

## 2. Fine tuning a language model for text generation : Summarization Example

Summarization is a sequence-to-sequence learning task or a text generation task as the inputs and outputs donot have any correspondence. Ideally, a good system should process and comprehend the input text and generate summaries that are **adequate** (i.e., retain the gist) and **fluent** (i.e., maintain proper grammar structure).

For this task, we choose the XSUM (Extreme Summarization) dataset (https://github.com/EdinburghNLP/XSum). As fine-tuning the models on the entire data can be time consuming, we have derived a small portion of the data to be used for training and testing.

The dataset I have provided has six files:

1. **summary_train.input**: contains 200 paragraphs. One per line. This is our training data.

2. **summary_train.output** contains 200 summaries. One per line and aligned with the inputs. This is our supervision signal.

Similarly the validation and test splits containing 50 inputs and summaries , summary_\*.inputs and summary\*_outputs respectively.

Now, throughout the course, we have never discussed the importance of having a validation set. It is a reminder that validaiton set is often used for model selection (i.e., which model overfits the less / offers better training + validation accuracy is the best model).

Let's get onboard with fine-tuning. I am showing an example of fine tuning using Facebook's BART model. You could choose another pretrained model such as `t5-small` which is also an autoregressive model.

For processing inputs and tokenizing it (i.e., extract words and converting them into one-hot vectors), we also have to initialize a tokenizer which often comes bundled with the pre-trained model.

In [33]:
from transformers import BartForConditionalGeneration, BartTokenizer
from transformers import Trainer, TrainingArguments
from datasets import Dataset
import torch

# Load the tokenizer and model
tokenizer = BartTokenizer.from_pretrained("facebook/bart-base")
model = BartForConditionalGeneration.from_pretrained("facebook/bart-base")


vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.72k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/558M [00:00<?, ?B/s]

In [34]:
def create_dataset(input_file, output_file):
    with open(input_file, "r") as f:
        inputs = f.readlines()
    with open(output_file, "r") as f:
        targets = f.readlines()

    # Throw error if number of documents is not equal to number of summaries
    assert len(inputs) == len(targets)

    dataset_dict = {"input_text": inputs, "target_text": targets}

    # Create a huggingface dataset from dictionary
    dataset = Dataset.from_dict(dataset_dict)

    # Tokenize the data into 1-hot encoded values for both inputs and outputs
    def tokenize_and_encode(examples):
        inputs = tokenizer(examples["input_text"], padding="max_length", truncation=True, max_length=300, return_tensors="pt")
        targets = tokenizer(examples["target_text"], padding="max_length", truncation=True, max_length=300, return_tensors="pt")
        print ("Dataset input shape", inputs["input_ids"].shape)
        print ("Dataset output shape", targets["input_ids"].shape)
        return {"input_ids": inputs.input_ids, "attention_mask": inputs.attention_mask, "labels": targets.input_ids}

    dataset = dataset.map(tokenize_and_encode, batched=True)
    return dataset


train_data = create_dataset("summary_training.input","summary_training.output")
validation_data = create_dataset("summary_validation.input","summary_validation.output")

# Fine-tune the model
training_args = TrainingArguments(
    output_dir="./results",          # output directory
    num_train_epochs=3,              # total number of training epochs
    per_device_train_batch_size=4,   # batch size per device during training
    save_steps=1000,                 # number of updates steps before checkpoint saves
    save_total_limit=2,         # limit the total amount of saved checkpoints
    logging_steps = 10          #print losses after 10 steps
    )

trainer = Trainer(
    model=model,                          # the instantiated 🤗 Transformers model to be trained
    args=training_args,                   # training arguments, defined above
    train_dataset=train_data,       # training dataset
    eval_dataset = validation_data
)

trainer.train()

# Save the model after training
model_path = "./fine_tuned_bart_summarization"
model.save_pretrained(model_path)
tokenizer.save_pretrained(model_path)


Map:   0%|          | 0/200 [00:00<?, ? examples/s]

Dataset input shape torch.Size([200, 300])
Dataset output shape torch.Size([200, 300])


Map:   0%|          | 0/51 [00:00<?, ? examples/s]

Dataset input shape torch.Size([51, 300])
Dataset output shape torch.Size([51, 300])


Step,Training Loss
10,11.2943
20,6.5652
30,4.506
40,3.5049
50,2.7051
60,1.9953
70,1.4814
80,1.0534
90,0.7791
100,0.6281


Non-default generation parameters: {'early_stopping': True, 'num_beams': 4, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}


('./fine_tuned_bart_summarization/tokenizer_config.json',
 './fine_tuned_bart_summarization/special_tokens_map.json',
 './fine_tuned_bart_summarization/vocab.json',
 './fine_tuned_bart_summarization/merges.txt',
 './fine_tuned_bart_summarization/added_tokens.json')

Now that we have the fine tuned model and we have saved the artefacts, we can delete the models from our GPU memory. We will perform testing by loading the fine tuned model from our saved location.

In [35]:
# Empty VRAM
del model
del trainer

# Invoke garbage collector
import gc
gc.collect()
gc.collect()

0

## 2.1 Testing the summarizer

We can load the model and test it using the transformers library. This amounts to carrying out the following steps:

1. For each text in test data, tokenize the text, converting words into 1-hot vectors. Also prepare the attention masks to ignore special tokens like pad tokens.

2. Pass the one hot vectors and attention masks as inputs to the fine tuned GPT2 model. Get the outputs (which are again 1 hot vectors corresponding to the tokens in the generated summary).

3. Decode the 1-hot vectors back to string form, using the same tokenizer.

Instead of doing this manually, we can make use of transformers pipeline implementation which takes care of all these steps.

In [36]:
from transformers import  BartForConditionalGeneration, BartTokenizer
from transformers import pipeline

model_path = "./fine_tuned_bart_summarization"

# Example usage of the saved model for evaluation
model = BartForConditionalGeneration.from_pretrained(model_path)
tokenizer = BartTokenizer.from_pretrained(model_path)

summarizer = pipeline(task = "summarization",model = model, tokenizer =tokenizer)

Test one example.

In [37]:
tokenizer_kwargs = {'truncation':True,'max_length':100}


input = "Skills Development Scotland, Highlands and Islands Enterprise, \
ScotlandIS and Education Scotland are backing the Â£250,000 fund called Digital \
Xtra.Among the aims of the scheme is to support extracurricular computing clubs \
for youngsters aged 16 and under.A panel will evaluate submissions for funding.\
Representatives from technology businesses, Scottish government and education will \
be on the panel."

generated_summary = summarizer(input,**tokenizer_kwargs)

print(generated_summary)

Your max_length is set to 100, but your input_length is only 78. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=39)


[{'summary_text': 'The Scottish Government has announced plans to fund a £250,000 fund to help young people with computing skills.\n'}]


## E1 (Not-graded). Explore Fine-tuning

1. Fine-tune `bart-small` for another task of automatic generation of paper titles from abstracts of scientific papers. The sample data is provided as `titlegen_*.input` and `titlegen_*.output`. Report BLEU and ROUGE scroes both.

2. Change the base model (pre-trained model) from `bart-small` to `t5-small` and try to fine-tune a t5 model for the same task above? What do you observe? Any change in BLEU / ROUGE scores?


(**Hint:** You can use the following tokenizer and model loading starter code and everything else should be similar to the BART example)

```
from transformers import T5ForConditionalGeneration, T5Tokenizer, Trainer, TrainingArguments

tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = T5ForConditionalGeneration.from_pretrained("t5-small")
```
