# Language Models and Parameter-Efficient Fine-Tuning

## Language Models


Language models are fundamental to natural language processing. They come in three major categories:

1. **Encoder-only models**: (e.g., BERT, RoBERTa, ELECTRA) - Best suited for understanding tasks such as classification and regression.
2. **Encoder-decoder models**: (e.g., T5, BART) - Ideal for tasks like translation and summarization.
3. **Decoder-only models**: (e.g., GPT-n models) - Primarily used for text generation.


## Autoregressive Language Models


Autoregressive models predict the next token in a sequence based on previous tokens. This enables **conditional generation**, where outputs depend on the given prompt.


In [1]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
## Now let's use old - GPT2 for generating text
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel


# Initialize GPT-2 model and tokenizer
def initialize_gpt2():
    tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
    model = GPT2LMHeadModel.from_pretrained("gpt2", pad_token_id=tokenizer.eos_token_id)
    return tokenizer, model


gpt_tokenizer, gpt_model = initialize_gpt2()


# Function to generate text using GPT-2
def generate_text(prompt, max_length=50, temperature=1.0, top_k=50):
    """
    Generate text using GPT-2 with customizable parameters.

    Args:
        prompt (str): The initial text to seed the model.
        max_length (int): The maximum length of the generated text.
        temperature (float): Sampling temperature. Higher values make output more random.
        top_k (int): Limits sampling to the top-k most likely tokens.

    Returns:
        str: The generated text.
    """
    input_ids = gpt_tokenizer.encode(prompt, return_tensors="pt")

    # Generate text
    outputs = gpt_model.generate(
        input_ids,
        max_length=max_length,
        temperature=temperature,
        top_k=top_k,
        num_beams=5,
        no_repeat_ngram_size=2,
        early_stopping=True,
    )
    return gpt_tokenizer.decode(outputs[0], skip_special_tokens=True)


prompts = [
    'Whose dialog is this? "Say my name."'
]

for prompt in prompts:
    print(f"Input Prompt: {prompt}")
    print(f"Generated Text: {generate_text(prompt, max_length=100, temperature=0.7)}\n")

## Large Language Models (LLMs)


Large Language Models (LLMs) scale up the size and capacity of traditional language models. Key concepts include:

- **Scale**: Models like GPT-3 have billions of parameters, leading to significant improvements in performance.
- **Pre-training and Adaptation**: Pre-trained on massive datasets and later adapted to specific tasks.


[link text](https://)## Ways to Adapt to New Tasks


Methods to adapt pre-trained models include:

1. **Zero-shot learning**: Use task descriptions as prompts without any training examples.
2. **Few-shot learning**: Provide a small number of task-specific examples.
3. **Lightweight Fine-tuning**: Modify only a subset of the model's parameters.
4. **Fine-tuning for human-aligned models**: Align models with human preferences using fine-tuning.


## Zero-shot Learning


 Fine-tuning T5 on a multi-task dataset for zero-shot learning.


In [None]:
!pip install datasets transformers

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch

def zero_shot_question_answering(context, question, model_name="bigscience/T0_3B"):
    # Load the tokenizer and model from Hugging Face Hub
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSeq2SeqLM.from_pretrained(model_name)


    # Prepare the input with the context and question
    prompt = f"context: {context} question: {question}"

    # Tokenize the prompt , with padding and truncation,
    inputs = ## TODO

    # Generate an answer from the model
    with torch.no_grad():
        outputs = model.generate(inputs["input_ids"], max_length=50, num_beams=5)

    # Decode the generated output to get the answer
    answer = tokenizer.decode(outputs[0], skip_special_tokens=True)

    return answer

# Define the context and the question
context = "Albert Einstein was a theoretical physicist who developed the theory of relativity."
question = "Who developed the theory of relativity?"

# Perform question answering
answer = zero_shot_question_answering(context, question)
print("Answer:", answer)


## Few-shot Learning


Few-shot learning strategies include:

1. **Prompt-based fine-tuning**: Modify the prompt to improve performance.
2. **In-context learning (ICL)**: Provide a few examples as part of the prompt for task demonstrations.


#### Prompt-based fine-tuning

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from torch.optim import AdamW  # Use PyTorch's AdamW
from torch.utils.data import DataLoader, Dataset
from datasets import load_dataset

# Define a custom dataset for prompt-based fine-tuning
class PromptDataset(Dataset):
    def __init__(self, data, tokenizer, max_length=512):
        self.data = data
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        prompt, response = self.data[idx]
        encoded_input = self.tokenizer(
            prompt,
            truncation=True,
            padding="max_length",
            max_length=self.max_length,
            return_tensors="pt",
        )
        encoded_response = self.tokenizer(
            response,
            truncation=True,
            padding="max_length",
            max_length=self.max_length,
            return_tensors="pt",
        )
        return {
            "input_ids": encoded_input["input_ids"].squeeze(0),
            "attention_mask": encoded_input["attention_mask"].squeeze(0),
            "labels": encoded_response["input_ids"].squeeze(0),
        }

# Load pre-trained model and tokenizer of t5-small
tokenizer = AutoTokenizer.from_pretrained("t5-small")
model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")

# Load SST-2 dataset
raw_dataset = load_dataset("glue", "sst2") ## use a dataset for sentiment analysis task like sst-2 or sst-5

# Prepare data for fine-tuning
processed_data = []
for example in raw_dataset["train"]:
    sentence = example["sentence"] #sentence of your each example in raw_dataset_train
    label = "positive" if example["label"] == 1 else "negative"  #define the label based on your dataset
    prompt = f"Sentiment analysis: {sentence} The sentiment is [MASK]."
    response = label
    processed_data.append((prompt, response))

# Create dataset and dataloader
dataset = PromptDataset(processed_data, tokenizer, max_length=64)
dataloader = DataLoader(dataset, batch_size=16, shuffle=True) # use the Dataloader with arbitary batch size.for example 16

# Define optimizer and device
optimizer = AdamW(model.parameters(), lr=5e-5)  # Use PyTorch's AdamW
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Fine-tuning loop
num_epochs = 3 # define number of epochs
model.train()
for epoch in range(num_epochs):
    for batch in dataloader:
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        labels = batch["labels"].to(device)

        optimizer.zero_grad()
        outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
        loss = ## TODO the loss of output
        ## TODO backward
        optimizer.step()

    print(f"Epoch {epoch + 1}, Loss: {loss.item()}") # print each epoch's loss)

# Evaluate the fine-tuned model
def generate_response(prompt):
    model.eval()
    inputs = tokenizer(prompt, return_tensors="pt", truncation=True, padding="max_length", max_length=64).to(device)
    with torch.no_grad():
        outputs = model.generate(inputs.input_ids, max_length=64)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Test the fine-tuned model
prompt = "Sentiment analysis: This movie was great! The sentiment is [MASK]."
response = generate_response(prompt)
print(f"Prompt: {prompt}")
print(f"Response: {response}")


#### In-Context Learning(ICL)

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM


# Initialize a lighter model for efficient in-context learning
def initialize_model():
    model_name = "facebook/opt-125m"  # A lightweight OPT model
    tokenizer = AutoTokenizer.from_pretrained(model_name)  # use pretrained tokenizer
    model = AutoModelForCausalLM.from_pretrained(model_name)  # use pretrained model

    return tokenizer, model


opt_tokenizer, opt_model = initialize_model()


def in_context_learning(prompt_examples, test_prompt, max_length=50, temperature=0.7):
    """
    Perform in-context learning by providing a few examples as part of the prompt.

    Args:
        prompt_examples (list of tuples): List of (input, output) examples.
        test_prompt (str): The input for which the output needs to be predicted.
        max_length (int): Maximum length of the generated text.
        temperature (float): Sampling temperature for controlling randomness.

    Returns:
        str: Generated output for the test_prompt.
    """
    # Build the in-context prompt
    context = "\n---\n".join(
        [f"Input: {inp}\nOutput: {out}" for inp, out in prompt_examples]
    )
    final_prompt = f"{context}\n---\nInput: {test_prompt}\nOutput:"

    # Tokenize and encode the prompt
    inputs = opt_tokenizer(
        final_prompt, return_tensors="pt", padding=True, truncation=True
    )
    input_ids = inputs["input_ids"]
    attention_mask = inputs["attention_mask"]

    # Generate the output
    outputs = opt_model.generate(
        input_ids=input_ids,
        attention_mask=attention_mask,
        max_length=max_length,
        temperature=temperature,
        top_k=50,
        pad_token_id=opt_tokenizer.eos_token_id,
    )

    # Decode and return the generated text
    generated_text = opt_tokenizer.decode(outputs[0], skip_special_tokens=True)

    # Extract only the generated answer (remove context and test_prompt)
    generated_output = generated_text[len(final_prompt) :].strip()
    return generated_output


# Few-shot examples
prompt_examples = [
    ("I love the sunny weather.", "positive"),
    ("I am not happy with the service.", "negative"),
    ("The food was excellent!", "positive"),
]

# Test input
test_prompt = "The product quality is very good."

# Generate the answer
output = in_context_learning(
    prompt_examples, test_prompt, max_length=100, temperature=0.7
)
print(f"Input: {test_prompt}")
print(f"Generated Output: {output}")

## Prompting Paradigm


Prompt engineering is critical for leveraging models like GPT-3. It involves:

- Task-specific prompts to guide the model.
- Advantages: Rapid prototyping, no parameter updates.
- Disadvantages: Sensitivity to prompt design and structure.


## Fine-tuning vs. In-context Learning


Comparison of techniques:

- **Fine-tuning**: Adjusts model weights, often leading to better performance but requires more resources.
- **In-context learning**: Provides task demonstrations as input without modifying the model.


## Parameter-Efficient Fine-Tuning (PEFT)


PEFT methods include:

1. **Adapters**: Add lightweight layers between transformer layers.
2. **Prompt Tuning & Prefix Tuning**: Optimize prompts or prefixes without modifying the model.
3. **LoRA (Low-Rank Adaptation)**: Fine-tune only low-rank updates to model parameters.


#### Adapters


In [None]:
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, Dataset
from transformers import BertTokenizer, BertModel
from datasets import load_dataset


# Define a custom dataset for classification
class TextClassificationDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_length):
        """
        Args:
            texts (list): List of input texts.
            labels (list): List of corresponding labels.
            tokenizer: Pre-trained tokenizer.
            max_length (int): Maximum sequence length.
        """
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx]
        label = self.labels[idx]
        encoded = self.tokenizer(
            text,
            truncation=True,
            padding="max_length",
            max_length=self.max_length,
            return_tensors="pt",
        )
        return {
            "input_ids": encoded["input_ids"].squeeze(0),
            "attention_mask": encoded["attention_mask"].squeeze(0),
            "label": torch.tensor(label, dtype=torch.long),
        }


# Define the Adapter class
class Adapter(nn.Module):
    def __init__(self, hidden_size, adapter_size):
        super(Adapter, self).__init__()
        self.down_proj = nn.Linear(hidden_size, adapter_size)
        self.relu = nn.ReLU()
        self.up_proj = nn.Linear(adapter_size, hidden_size)

    def forward(self, x):
        down = self.down_proj(x)
        activated = self.relu(down)
        up = self.up_proj(activated)
        return x + up  # Residual connection


# Define the model class using BERT with Adapters
class BERTWithAdapters(nn.Module):
    def __init__(self, num_classes, adapter_size=64):
        super(BERTWithAdapters, self).__init__()
        self.bert = BertModel.from_pretrained("bert-base-uncased")
        self.adapters = nn.ModuleList(
            [
                Adapter(self.bert.config.hidden_size, adapter_size)
                for _ in range(self.bert.config.num_hidden_layers)
            ]
        )
        self.dropout = nn.Dropout(0.1)
        self.classifier = nn.Linear(self.bert.config.hidden_size, num_classes)

    def forward(self, input_ids, attention_mask):
        outputs = self.bert(
            input_ids=input_ids,
            attention_mask=attention_mask,
            output_hidden_states=True,
        )
        hidden_states = outputs.hidden_states

        # Apply adapters to each layer's hidden states
        adapted_hidden_states = []
        for i, adapter in enumerate(self.adapters):
            adapted_hidden_states.append(adapter(hidden_states[i + 1]))

        # Use the last adapter output for classification
        pooled_output = adapted_hidden_states[-1][
            :, 0
        ]  # Use [CLS] token representation
        x = self.dropout(pooled_output)
        return self.classifier(x)


# Load tokenizer and model
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BERTWithAdapters(num_classes=2)  # Binary classification (Positive/Negative)

# Load GLUE SST-2 dataset
data = load_dataset("glue", "sst2")
train_texts = data["train"]["sentence"]
train_labels = data["train"]["label"]
val_texts = data["validation"]["sentence"]
val_labels = data["validation"]["label"]

# Create dataset and dataloaders
train_dataset = TextClassificationDataset(
    train_texts, train_labels, tokenizer, max_length=64
)
val_dataset = TextClassificationDataset(val_texts, val_labels, tokenizer, max_length=64)
train_dataloader = DataLoader(
    train_dataset, batch_size=16, shuffle=True
)  # with arbitary batch size.  ** Note : remember to shuffle
val_dataloader = DataLoader(val_dataset, batch_size=16)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Define optimizer and loss function
optimizer = torch.optim.Adam(model.parameters(), lr=2e-5)  # use Adam
loss_fn = nn.CrossEntropyLoss()  # suitable loss function for binary classification ? :)

# Training loop
num_epochs = 3
for epoch in range(num_epochs):
    model.train()
    total_loss = 0
    for batch in train_dataloader:
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        labels = batch["label"].to(device)

        optimizer.zero_grad()
        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        loss = loss_fn(outputs, labels)  # use the loss function you defined earlier
        loss.backward()
        optimizer.step()

        total_loss += loss.item()

    avg_loss = total_loss / len(
        train_dataloader
    )  # average loss of the train_dataloader
    print(f"Epoch {epoch + 1}, Loss: {avg_loss}")

# Evaluation loop
model.eval()
correct = 0
total = 0
with torch.no_grad():
    for batch in val_dataloader:
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        labels = batch["label"].to(device)

        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        predictions = torch.argmax(outputs, dim=1)
        correct += (predictions == labels).sum().item()
        total += labels.size(0)

accuracy = correct / total
print(f"Validation Accuracy: {accuracy * 100:.2f}%")

## Intrinsic Dimensionality


Research shows that LLMs operate in a low intrinsic dimension, meaning effective fine-tuning can occur in smaller parameter spaces.


## LoRA (Low-Rank Adaptation)


LoRA reduces the number of tunable parameters by:

1. Keeping original weights fixed.
2. Adding low-rank matrices to capture task-specific adaptations.

### Results and Takeaways
- Comparable performance to full fine-tuning with fewer parameters.
- Sometimes even outperforms full fine-tuning.


## Parameter-Efficient Fine-Tuning (PEFT) and Low-Rank Adaptation (LoRA)

# Introduction to PEFT and LoRA
## Parameter-Efficient Fine-Tuning (PEFT) and Low-Rank Adaptation (LoRA)

### Why PEFT?
Traditional fine-tuning of large language models requires updating all model parameters, which can be computationally expensive and memory-intensive. Parameter-Efficient Fine-Tuning (PEFT) techniques address this limitation by modifying only a small subset of parameters, reducing resource requirements significantly.

### What is LoRA?
LoRA (Low-Rank Adaptation) is a specific PEFT method that inserts low-rank matrices into the architecture of a pre-trained model, enabling efficient adaptation for downstream tasks. It is widely used for fine-tuning large language models like GPT and BERT without modifying their core parameters.

In this section, we will implement LoRA for fine-tuning a pre-trained BART model on a text summarization task.


In [None]:
# Install required libraries
!pip install transformers peft datasets rouge-score --quiet

In [None]:
# Import libraries
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, Seq2SeqTrainer, Seq2SeqTrainingArguments
from peft import get_peft_model, LoraConfig, TaskType
from datasets import load_dataset
# from datasets import load_metric
import matplotlib.pyplot as plt

In [None]:
# Load dataset
dataset = load_dataset("cnn_dailymail", "3.0.0") # load the cnn_dailymail dataset with 5000 training samples and train/test split

In [None]:
# Load pre-trained model and tokenizer
model_name = "facebook/bart-base" # specify the pre-trained model name (e.g., "facebook/bart-base")
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name) #load the pre-trained model for Seq2SeqLM

In [None]:
# Tokenize dataset
def preprocess_function(examples):
    inputs = examples["article"]
    model_inputs = tokenizer(inputs, max_length=1024, truncation=True, padding="max_length") #tokenize the inputs with a max length of 1024, truncation, and padding

    with tokenizer.as_target_tokenizer():
        labels = tokenizer(examples["highlights"], max_length=1024, truncation=True, padding="max_length")

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenized_datasets = dataset.map(preprocess_function, batched=True)

In [None]:
# LoRA Configuration
lora_config = lora_config = LoraConfig(
    task_type="seq2seq",
    rank=8,
    alpha=16,
    target_modules=["encoder", "decoder"],
    dropout=0.1,
)

In [None]:
# Apply LoRA to the model
peft_model = get_peft_model(model, lora_config)
print("LoRA model created with PEFT.")

In [None]:
# Define training arguments
training_args = # TODO

# Trainer setup
trainer = # TODO

In [None]:
# Train model
trainer.train()

# Save the fine-tuned LoRA model
#TODO

### ROUGE Metric
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a popular metric for evaluating text summarization tasks. It measures the overlap between the generated summaries and reference summaries using metrics such as ROUGE-1, ROUGE-2, and ROUGE-L:

- **ROUGE-1**: Measures overlap of unigrams (single words).
- **ROUGE-2**: Measures overlap of bigrams (two consecutive words).
- **ROUGE-L**: Considers the longest common subsequence (LCS).

In this implementation, we use the `evaluate` library to compute ROUGE scores for the generated summaries compared to the ground truth.


In [None]:
# Install evaluate library
!pip install evaluate --quiet

# evaluation metric
rouge = # TODO

In [None]:
# Generate predictions and evaluate
def evaluate_model(trainer, dataset, tokenizer):
    predictions, labels, _ = trainer.predict(
        dataset
    )  # use the trainer to generate predictions on the dataset
    decoded_preds = tokenizer.batch_decode(
        predictions, skip_special_tokens=True
    )  # decode the predictions using the tokenizer
    decoded_labels = tokenizer.batch_decode(
        labels, skip_special_tokens=True
    )  # decode the labels using the tokenizer

    result = rouge.compute(
        predictions=decoded_preds, references=decoded_labels
    )  # compute ROUGE scores using predictions and references
    return decoded_preds, decoded_labels, {key: value for key, value in result.items()}


preds, labels, results = evaluate_model(
    trainer, tokenized_datasets["validation"], tokenizer
)
print("ROUGE scores:", results)

### Predictions vs Ground Truth

Below are some examples of the model's predictions compared to the ground truth summaries:


In [None]:
for i in range(3):  # Display 3 samples
    print(f"\n**Input Article {i+1}:**\n", # TODO: display the first 500 characters of the article, "...")
    print(f"\n**Ground Truth Summary {i+1}:**\n", # TODO: display the ground truth summary)
    print(f"\n**Model Prediction {i+1}:**\n", # TODO: display the model prediction)

# Plot ROUGE scores
# TODO: create a bar plot of the ROUGE scores
plt.bar(results.keys(), results.values(), color='skyblue')
plt.title("ROUGE Scores")
plt.ylabel("F1-Score")
plt.xlabel("Metric")
plt.show()

After fine-tuning with LoRA, we can compare the performance against traditional fine-tuning:

1. **Training Time**: LoRA reduces training time by only modifying specific parameters.
2. **Memory Usage**: The low-rank matrices minimize memory consumption.
3. **Performance Metrics**: Evaluate the ROUGE scores on the CNN/DailyMail test dataset.

The bar chart above visualizes the ROUGE metrics, and the displayed predictions provide qualitative insights into the model's performance.
