### Regis University

**MSDS688_X70: Artificial Intelligence**  
Master of Science in Data Science Program

#### Week 4: Natural Language Processing for AI  
*GPU Required*

## Lecture: Week 4 - Fine-Tuning GPT-2 and T5 for Text Generation

### Overview

This week, we focus on two powerful models in **Natural Language Processing (NLP)**: **GPT-2** and **T5**. Both models are widely used for tasks like text generation, summarization, and translation, but they differ significantly in architecture and training approach. We will also discuss how these models can be **fine-tuned** for specific tasks, like generating text from the **Amazon Polarity dataset**.

---

### 1. **How GPT-2 Works**

#### GPT-2 Architecture:
**GPT-2** (Generative Pre-trained Transformer 2) is a **transformer-based** language model that uses a **decoder-only architecture**. GPT-2 is trained in an unsupervised manner on large amounts of text to predict the next word in a sentence, making it excellent for tasks like text generation.

#### Key Concepts:
- **Unidirectional (Left-to-Right)**: GPT-2 generates text by predicting the next token in a sequence based on all the previous tokens. This makes it ideal for generative tasks.
- **Pretraining**: GPT-2 is pretrained on a massive corpus using the task of **next-word prediction** (also called **causal language modeling**). During pretraining, it learns the statistical properties of the language, which can later be fine-tuned for specific tasks.
- **Self-Attention Mechanism**: GPT-2 uses self-attention layers to allow the model to focus on different parts of the input sequence when making predictions. This mechanism is key to handling long-range dependencies in text.

#### How GPT-2 Generates Text:
1. **Tokenization**: The input text is tokenized into smaller units (tokens). These tokens are then fed into the model.
2. **Sequential Generation**: Given an input sequence, GPT-2 generates the next token one at a time, based on the context provided by the previous tokens.
3. **Sampling Methods**: During text generation, different sampling methods like **top-k sampling** or **nucleus sampling (top-p)** can be used to control randomness and creativity in the generated text.

#### Fine-Tuning GPT-2:
Fine-tuning GPT-2 involves training it on a new, specific dataset, allowing it to adapt its knowledge to a particular domain or task. This can be done using smaller datasets and fewer resources compared to full pretraining.

Steps for Fine-Tuning GPT-2:
1. **Load Pretrained Model**: Start with a pretrained GPT-2 model from libraries like Hugging Face.
2. **Tokenization**: Preprocess the dataset by tokenizing the text. Since GPT-2 expects input in token format, use its specific tokenizer.
3. **Training**: Use the fine-tuned dataset to adjust the weights of the model. The objective is still next-token prediction, but now it’s applied to the new dataset.
4. **Evaluation**: Evaluate the fine-tuned model using metrics like **perplexity**, which measures how well the model predicts the next word in a sequence.

---

### 2. **How T5 Works**

#### T5 Architecture:
**T5** (Text-To-Text Transfer Transformer) is a transformer model designed to handle **all NLP tasks** in a unified format: **text-to-text**. Unlike GPT-2, T5 uses an **encoder-decoder architecture**, which makes it versatile for tasks like text generation, translation, summarization, and classification.

#### Key Concepts:
- **Encoder-Decoder**: T5 uses an encoder to process the input sequence and a decoder to generate the output. The encoder captures the context, while the decoder generates the final text.
- **Text-to-Text Framework**: In T5, all tasks are cast into a text-to-text format. For instance, a classification task would look like:
  ```
  Input: "Is this review positive or negative? Review: The product was great."
  Output: "Positive"
  ```
- **Pretraining Task (Span Corruption)**: T5 is pretrained using a method called **span corruption**. In this task, random spans of text are replaced with a special token, and the model has to predict the missing text. This makes T5 excellent at understanding context and generating coherent text.

#### How T5 Generates Text:
1. **Input Encoding**: The input text is tokenized and fed into the encoder, which transforms it into a latent representation.
2. **Text Generation**: The decoder then takes the latent representation and generates the output token by token, similar to GPT-2, but using context from the encoder.
3. **Task Prefixes**: T5 uses task-specific prefixes (e.g., “summarize:”, “translate English to French:”) to tell the model what type of task it should perform.

#### Fine-Tuning T5:
Fine-tuning T5 for a specific task is slightly different from GPT-2 because T5 can handle a variety of tasks using the text-to-text framework. You can fine-tune T5 by providing task-specific prompts.

Steps for Fine-Tuning T5:
1. **Load Pretrained T5 Model**: Start with a pretrained T5 model.
2. **Tokenization**: Preprocess the input text and add a task-specific prefix to each input. For example, for text generation tasks, use “summarize:” or “generate:”.
3. **Training**: Train the model on the new dataset. The loss function minimizes the difference between the predicted output and the expected output.
4. **Evaluation**: T5 can be evaluated using metrics like **perplexity** for text generation or **accuracy** for classification tasks.

---

### 3. **Comparing GPT-2 and T5**

#### Similarities:
- **Transformer Models**: Both GPT-2 and T5 are based on the transformer architecture, which uses attention mechanisms to handle long-range dependencies in text.
- **Pretraining and Fine-Tuning**: Both models are pretrained on large corpora and can be fine-tuned for specific tasks with smaller datasets.
- **Tokenization**: Both models require tokenizing input text before passing it to the transformer layers.

#### Differences:
- **Architecture**:
  - GPT-2 is a **decoder-only** model (unidirectional), generating text by predicting the next token based on previous tokens.
  - T5 uses an **encoder-decoder** architecture (bidirectional), allowing it to understand the context of both input and output sequences.
- **Task Flexibility**:
  - GPT-2 excels at generative tasks like text completion and creative writing.
  - T5 is versatile and can handle both **generation** (e.g., summarization) and **understanding tasks** (e.g., question-answering).
- **Pretraining**:
  - GPT-2 is pretrained on the task of predicting the next word in a sentence (causal language modeling).
  - T5 is pretrained on **span corruption**, making it more robust for understanding context across a broader range of tasks.

---

### 4. **Fine-Tuning in This Assignment**

In this assignment, you will fine-tune both **GPT-2** and **T5** on a subset of the **Amazon Polarity dataset** for text generation tasks. You will:
1. **Tokenize the Dataset**: Prepare the data for both models using their respective tokenizers.
2. **Train GPT-2**: Fine-tune GPT-2 using its causal language modeling head to generate text based on the dataset.
3. **Train T5**: Fine-tune T5 by framing the task as a text-to-text problem (e.g., "generate: positive review").
4. **Compare Results**: Evaluate the models on metrics like **perplexity** and compare their performance in terms of inference time and the quality of the generated text.

---

### Conclusion

This week’s assignment dives into the intricacies of **GPT-2** and **T5**, two state-of-the-art transformer models for NLP tasks. You will learn how to fine-tune these models for specific tasks and evaluate their performance. Pay attention to the differences in how they approach text generation, and observe how fine-tuning can adapt these models to new datasets and tasks.

---


## Assignment Part 1: Follow Me – Fine-Tuning GPT-2 on a Subset of the Dataset

In this section, you will fine-tune the GPT-2 model on a subset of the dataset for text generation tasks. You'll explore how pre-trained language models can be customized for specific tasks by adjusting their weights through fine-tuning.


In [None]:
import os
os.environ["WANDB_MODE"] = "disabled"
os.environ["WANDB_DISABLED"] = "true"

In [None]:
# Install the necessary libraries
!pip install transformers datasets torch

In [None]:
# Import necessary libraries
from datasets import load_dataset
from transformers import GPT2Tokenizer, GPT2LMHeadModel, Trainer, TrainingArguments
import math
import torch

In [None]:
# Load the Amazon Polarity dataset and take a small subset
dataset = load_dataset("amazon_polarity")
train_dataset = dataset["train"].select(range(1000))  # Select only 1000 examples for training
test_dataset = dataset["test"].select(range(200))  # Select only 200 examples for evaluation

In [None]:
# GPT-2 Tokenizer and Model
tokenizer_gpt2 = GPT2Tokenizer.from_pretrained("gpt2", clean_up_tokenization_spaces=True)
tokenizer_gpt2.pad_token = tokenizer_gpt2.eos_token  # Set pad token to eos token

In [None]:
# Preprocess and tokenize for GPT-2
def preprocess_gpt2(examples):
    inputs = ["Positive review: " + example if label == 1 else "Negative review: " + example for example, label in zip(examples["content"], examples["label"])]
    return tokenizer_gpt2(inputs, padding="max_length", truncation=True, max_length=128)

In [None]:
tokenized_gpt2_train = train_dataset.map(preprocess_gpt2, batched=True)
tokenized_gpt2_test = test_dataset.map(preprocess_gpt2, batched=True)

In [None]:
# Ensure labels are the same as input_ids for GPT-2 fine-tuning
def group_texts(examples):
    examples["labels"] = examples["input_ids"].copy()  # Set labels as input_ids for GPT-2 fine-tuning
    return examples

In [None]:
# Apply labels to both train and test datasets
tokenized_gpt2_train = tokenized_gpt2_train.map(group_texts, batched=True)
tokenized_gpt2_test = tokenized_gpt2_test.map(group_texts, batched=True)

In [None]:
# Fine-tune GPT-2 on the subset
model_gpt2 = GPT2LMHeadModel.from_pretrained("gpt2")

In [None]:
# Set device (GPU if available, else CPU)
device = "cuda" if torch.cuda.is_available() else "cpu"
model_gpt2 = model_gpt2.to(device)  # Move model to device

In [None]:
training_args_gpt2 = TrainingArguments(
    output_dir="./results_gpt2",
    eval_strategy="epoch",  # Use eval_strategy instead of deprecated evaluation_strategy
    num_train_epochs=1,  # Reduce the number of epochs to speed up training
    per_device_train_batch_size=2,
    save_steps=500,
    save_total_limit=2,
)

In [None]:
trainer_gpt2 = Trainer(
    model=model_gpt2,
    args=training_args_gpt2,
    train_dataset=tokenized_gpt2_train,
    eval_dataset=tokenized_gpt2_test,
)

In [None]:
# Train GPT-2
trainer_gpt2.train()

In [None]:
# Generate sentiment-based text using GPT-2 with attention mask and pad_token_id
input_ids_positive_gpt2 = tokenizer_gpt2.encode("Positive review: ", return_tensors="pt").to(device)
attention_mask_gpt2 = input_ids_positive_gpt2.ne(tokenizer_gpt2.pad_token_id).long().to(device)

In [None]:
output_gpt2 = model_gpt2.generate(
    input_ids_positive_gpt2,
    attention_mask=attention_mask_gpt2,  # Add attention mask
    max_length=50,
    pad_token_id=tokenizer_gpt2.eos_token_id  # Set pad_token_id to eos_token_id
)

In [None]:
generated_text_gpt2 = tokenizer_gpt2.decode(output_gpt2[0], skip_special_tokens=True)
print("Generated Text from GPT-2:", generated_text_gpt2)

In [None]:
print("Sample content:", train_dataset[0]["content"])
print("Sample label (summary):", train_dataset[0]["label"])

In [None]:
# Evaluate GPT-2
eval_results_gpt2 = trainer_gpt2.evaluate()
perplexity_gpt2 = math.exp(eval_results_gpt2['eval_loss'])
print(f"Perplexity from GPT-2: {perplexity_gpt2}")

## Assignment Part 2: Your Turn – Fine-Tuning T5 for Text Generation (on a subset)

In this section, you will fine-tune the T5 model on a subset of the dataset for text generation. You will compare the performance and text generation capabilities of T5 with the GPT-2 model from Part 1. **A framework has been provided, and your job is to complete the TODOs.**

In [None]:
from datasets import load_dataset
from transformers import T5Tokenizer, T5ForConditionalGeneration, Trainer, TrainingArguments
import math
import torch

In [None]:
# Load the Amazon Polarity dataset and take a small subset
dataset = load_dataset("amazon_polarity")
train_dataset = dataset["train"].select(range(1000))  # Select only 1000 examples for training
test_dataset = dataset["test"].select(range(200))  # Select only 200 examples for evaluation


In [None]:
# T5 Tokenizer and Model
tokenizer_t5 = T5Tokenizer.from_pretrained("t5-small")
model_t5 = T5ForConditionalGeneration.from_pretrained("t5-small")


In [None]:
#### TODO: Preprocess and tokenize for T5 model fine-tuning ####
def preprocess_t5(examples):
    # HINT: Start by preparing input prompts with a phrase that indicates the task, such as "classify sentiment:".
    # Use the "content" field from the dataset for the text.

    # HINT: Next, create the target labels as strings. Assign "positive" or "negative" based on the label value.

    # HINT: Tokenize the inputs, setting a maximum length to handle variable text sizes.
    # Make sure to use padding and truncation.

    # HINT: Tokenize the labels separately from inputs, keeping them shorter since they only contain the target sentiment.
    # Include the tokenized labels in `model_inputs` under the key "labels".

    return model_inputs


In [None]:
tokenized_t5_train = train_dataset.map(preprocess_t5, batched=True)
tokenized_t5_test = test_dataset.map(preprocess_t5, batched=True)


In [None]:
# Fine-tune T5 on the subset
device = "cuda" if torch.cuda.is_available() else "cpu"
model_t5 = model_t5.to(device)


In [None]:
#### TODO: Set up training arguments for fine-tuning the T5 model ####
training_args_t5 = TrainingArguments(
    # HINT: Specify the directory path where you want to save the model checkpoints and other outputs.

    # HINT: Choose an evaluation strategy to determine how often the model should be evaluated. Consider setting it to evaluate after each epoch.

    # HINT: Define the total number of training epochs. Start with a small number if you want faster results or are experimenting.

    # HINT: Set the batch size per device (GPU or CPU) to manage memory usage effectively.

    # HINT: Specify the number of steps between saving checkpoints to avoid excessive storage use.

    # HINT: Limit the total number of checkpoints saved. This prevents storage from being overloaded with old checkpoints.
)


In [None]:
trainer_t5 = Trainer(
    model=model_t5,
    args=training_args_t5,
    train_dataset=tokenized_t5_train,
    eval_dataset=tokenized_t5_test,
)

In [None]:
# Train T5
trainer_t5.train()

In [None]:
# Generate sentiment-based text using T5
sample_review = train_dataset[0]["content"]
input_text = f"classify sentiment: {sample_review}"
input_ids_positive_t5 = tokenizer_t5(input_text, return_tensors="pt").input_ids.to(device)
output_t5 = model_t5.generate(input_ids_positive_t5, max_length=10)  # Limiting output length

generated_text_t5 = tokenizer_t5.decode(output_t5[0], skip_special_tokens=True)
print("Generated Text from T5:", generated_text_t5)

print("Sample content:", train_dataset[0]["content"])
print("Sample label (summary):", train_dataset[0]["label"])


In [None]:
# Evaluate T5
eval_results_t5 = trainer_t5.evaluate()
perplexity_t5 = math.exp(eval_results_t5['eval_loss'])
print(f"Perplexity from T5: {perplexity_t5}")

In [None]:
# Imports for Visualization
import matplotlib.pyplot as plt
import time

In [None]:
# Measure inference time for GPT-2
start_gpt2 = time.time()
input_ids_positive_gpt2 = tokenizer_gpt2.encode("Positive review: ", return_tensors="pt").to(device)
attention_mask_gpt2 = input_ids_positive_gpt2.ne(tokenizer_gpt2.pad_token_id).long().to(device)

In [None]:
output_gpt2 = model_gpt2.generate(
    input_ids_positive_gpt2,
    attention_mask=attention_mask_gpt2,  # Add attention mask
    max_length=50,
    pad_token_id=tokenizer_gpt2.eos_token_id  # Set pad_token_id to eos_token_id
)
time_gpt2 = time.time() - start_gpt2

In [None]:
generated_text_gpt2 = tokenizer_gpt2.decode(output_gpt2[0], skip_special_tokens=True)
print("Generated Text from GPT-2:", generated_text_gpt2)

In [None]:
# Evaluate GPT-2 perplexity
eval_results_gpt2 = trainer_gpt2.evaluate()
perplexity_gpt2 = math.exp(eval_results_gpt2['eval_loss'])
print(f"Perplexity from GPT-2: {perplexity_gpt2}")

In [None]:
import time
import math

# Start the timer
start_t5 = time.time()

# Evaluate T5 perplexity
eval_results_t5 = trainer_t5.evaluate()

# Stop the timer and calculate elapsed time
time_t5 = time.time() - start_t5

# Calculate and print T5 perplexity
if 'eval_loss' in eval_results_t5:
    perplexity_t5 = math.exp(eval_results_t5['eval_loss'])
    print(f"Perplexity from T5: {perplexity_t5}")
else:
    print("Evaluation loss not found in eval_results for T5.")

# Print evaluation time
print(f"Evaluation Time for T5: {time_t5:.2f} seconds")


In [None]:
# Compare perplexity for GPT-2 and T5
models = ['GPT-2', 'T5']
training_times = [time_gpt2, time_t5]
perplexities = [perplexity_gpt2, perplexity_t5]

In [None]:
# Plot inference time
plt.subplot(1, 2, 1)
plt.bar(models, training_times, color=['red', 'green'])
plt.title('Inference Time (Seconds)')
plt.ylabel('Time in Seconds')

In [None]:
# Plot perplexity
plt.subplot(1, 2, 2)
plt.bar(models, perplexities, color=['green', 'orange'])
plt.title('Perplexity Comparison')
plt.ylabel('Perplexity')

In [None]:
# Evaluate Generated Text Quality (GPT-2 and T5 Placeholder)
print("GPT-2 Generated Text:")
print(generated_text_gpt2)
print("\nT5 Generated Text:")
print(generated_text_t5)

### TODO: T5 Fine-Tuning Analysis and Comparison to GPT-2

Now that you've fine-tuned the T5 model on the Amazon Polarity dataset, summarize your observations and learning by addressing the following questions:

- **T5 Performance:**  
  How well did T5 perform on text generation tasks in terms of perplexity, coherence, and inference time?

- **Challenges with T5:**  
  What specific challenges or issues did you encounter while fine-tuning T5? How did these differ from your experience observing GPT-2?

- **Comparative Insights:**  
  Based on your results, how does T5’s encoder-decoder architecture impact its text generation capability compared to GPT-2’s decoder-only design?

- **Practical Considerations:**  
  Under what circumstances would T5 be a better choice than GPT-2 for real-world NLP applications?

**Action:**  
Write a brief analysis (1-2 paragraphs) summarizing your findings and insights clearly in a markdown cell below.
