# ADS-509 Assignment 5.1

## Finetuning LLMs
**Student Version**

In this assignment, you will use a small, locally-hosted LLM (`google/flan-t5-small`) to evaluate performance on the SST‑2 sentiment classification benchmarking dataset. You will compare how the same model performs after:
1) Zero‑shot prompting
2) Few‑shot prompting
3) Fine‑tuning


## General Assignment Instructions

These instructions are included in every assignment, to remind you of the coding standards for the class. Feel free to delete this cell after reading it.

Work through this notebook as if it were a worksheet, completing the code sections marked with **TODO** in the cells provided. Similarly, written questions will be marked by a "Q:" and will have a corresponding "A:" spot for you to fill in with your answers. **Make sure to answer every question marked with a Q: for full credit**.

Your code should be relatively easy-to-read, sensibly commented, and clean. Writing code is a messy process, so please be sure to edit your final submission. Remove any cells that are not needed or parts of cells that contain unnecessary code. Remove inessential import statements and make sure that all such statements are moved into the designated cell.

A .pdf of this notebook, with your completed code and written answers, is what you should submit in Canvas for full credit. **DO NOT SUBMIT A NEW NOTEBOOK FILE OR A RAW .PY FILE**. Submitting in a different format makes it difficult to grade your work, and students who have done this in the past inevitably miss some of the required work or written questions.

## Imports and Downloads

In [20]:
try:
    import datasets, transformers, evaluate, torch  # type: ignore
except Exception:
    %pip install -q datasets transformers evaluate accelerate sentencepiece

import os, random, numpy as np, warnings
import torch
from sklearn.metrics import confusion_matrix
from transformers import DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer
warnings.filterwarnings('ignore')


## Load Dataset and Model

For this assignment, you will be comparing performance on a common language model benchmarking task: predicting the sentiment for the Stanford Sentiment Treebank ([SST-2](https://nlp.stanford.edu/sentiment/index.html)). We will use the same model, [Flan-T5-Small](https://huggingface.co/google/flan-t5-small), across all of our "training" methods so that the results are directly comparable.

**TODO**:

- Use your preferred method to select a sample of 2000 sentences from the train dataset.

**Q**: After reading a little bit about the Sentiment Treebank project at the link above, and recognizing that the paper was written in 2013, what method do we now use to provide the same kind of benefit that they intended with their tree-based sentiment representations?

**A**: The 2013 Stanford Sentiment Treebank paper addressed a fundamental limitation in earlier sentiment analysis systems: they treated sentences as bags of words, simply summing up positive and negative word scores while ignoring word order and compositional structure. This made them easy to fool with sentences like "This movie was actually neither that funny, nor super witty," where positive words like "funny" and "witty" appear but the overall sentiment is negative due to negation and structure. The solution was to use Recursive Neural Networks that operated on syntactic parse trees, building up sentiment representations compositionally from smaller phrases to larger ones following the grammatical structure. This allowed the model to understand how words combine to create meaning at different levels of the sentence, properly handling negation, contrastive conjunctions, and other compositional phenomena, creating the Sentiment Treebank dataset with fine-grained labels for over 215,000 phrases in parse trees to train and evaluate these models.
Today, we achieve the same benefits using Transformer models with self-attention mechanisms like BERT, T5, and GPT. Rather than requiring explicit parse trees, these models learn compositional semantics implicitly through layers of self-attention that allow each word to attend to and interact with all other words in context. The attention mechanisms automatically discover which words and phrases are relevant to each other, building up contextual representations that capture how meaning composes across the sentence, proving to be even more effective than explicit tree-based methods, eliminating the need for syntactic parsing while achieving superior performance on understanding how words combine to create sentence-level meaning.

In [2]:
!pip install -U transformers
# Load model directly
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-small")
model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-small")

Collecting transformers
  Downloading transformers-5.1.0-py3-none-any.whl.metadata (31 kB)
Downloading transformers-5.1.0-py3-none-any.whl (10.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.3/10.3 MB[0m [31m45.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: transformers
  Attempting uninstall: transformers
    Found existing installation: transformers 5.0.0
    Uninstalling transformers-5.0.0:
      Successfully uninstalled transformers-5.0.0
Successfully installed transformers-5.1.0


config.json: 0.00B [00:00, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/308M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/190 [00:00<?, ?it/s]



generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [19]:
from datasets import load_dataset
raw = load_dataset('glue', 'sst2')
raw

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 67349
    })
    validation: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 872
    })
    test: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1821
    })
})

In [30]:
raw['train'][5], raw['validation'][5]

({'sentence': "that 's far too tragic to merit such superficial treatment ",
  'label': 0,
  'idx': 5},
 {'sentence': 'although laced with humor and a few fanciful touches , the film is a refreshingly serious look at young women . ',
  'label': 1,
  'idx': 5})

In [31]:
# TODO: select a sample for your training dataset

train_size = 2000
train_ds = raw['train'].shuffle(seed=42).select(range(train_size))
label_names = train_ds.features['label'].names
eval_ds  = raw['validation']


In [32]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
MODEL_NAME = 'google/flan-t5-small'
tok = AutoTokenizer.from_pretrained(MODEL_NAME)
flan = AutoModelForSeq2SeqLM.from_pretrained(MODEL_NAME)

Loading weights:   0%|          | 0/190 [00:00<?, ?it/s]



## Zero‑Shot and Few‑Shot Prompting

One of the major benefits of today's generative models is that they can often be used effectively with no supervised, task-specific training (i.e. fine tuning). This avoids the time and expense needed to compile and train on a labeled dataset and is called zero-shot or few-shot prompting. However, using a generative model makes performance evaluation more complex, since the set of possible outputs is not pre-defined (i.e. the model can potentially produce any tokens in its vocabulary).

**TODO**:

- Define a function to format zero-shot prompts. *HINT: Your prompt should introduce the labeling task (without providing an example), and end with an indication that it should respond.*
- Define a function to format few-shot prompts using the examples provided. *HINT: This should be very similar to the zero-shot format, but with a couple example input/output provided.*
- Define a function that will normalize the generated output for evaluation.

**Q**: Discuss one additional benefit and drawback to using a generative model with zero-shot or few-shot prompting in place of traditional supervised learning methods.

**A**: One major benefit of using zero-shot or few-shot prompting is the flexibility and speed of deployment. Unlike traditional supervised learning methods that require collecting thousands of labeled examples, training a model for hours or days, and then retraining whenever requirements change, prompting allows you to immediately adapt to new tasks or modify behavior simply by changing the prompt text. If you need to add a third sentiment category or change the output format, you can do so instantly without any retraining. This makes it particularly valuable for rapidly prototyping solutions, handling rare or niche tasks where large labeled datasets don't exist, or situations where requirements frequently evolve.
However, a significant drawback is the lack of reliability and consistency in outputs. Traditional supervised models trained on specific tasks produce predictions from a fixed set of classes with well-calibrated confidence scores, making their behavior predictable and easier to validate. In contrast, generative models can produce arbitrary text outputs that may not match your expected format, requiring complex post-processing and normalization logic. The model might respond with "I think this is positive" instead of just "positive," or use synonyms, or even refuse to answer in unexpected ways. This unpredictability makes it harder to achieve consistent high performance, more difficult to debug failures, and challenging to deploy in production systems that require reliable, structured outputs. Performance can also vary based on subtle prompt wording changes, introducing an additional layer of engineering complexity

In [33]:
FEW_SHOTS = [
    ('This movie was fantastic and heartwarming.', 'positive'),
    ('The plot was boring and predictable.', 'negative'),
]

def zshot_prompt(text):
    # TODO: format your input text into a zero-shot prompt
    return f"Classify the sentiment of this movie review as positive or negative.\nReview: {text}\nSentiment:"

def fshot_prompt(text):
    # TODO: format your input text into a few-shot prompt, using the examples above
    prompt = "Classify the sentiment of movie reviews as positive or negative.\n\n"
    for review, label in FEW_SHOTS:
        prompt += f"Review: {review}\nSentiment: {label}\n\n"
    prompt += f"Review: {text}\nSentiment:"
    return prompt

def norm_label(s: str):
    s = (s or '').lower()
    # TODO: normalize the generated output to produce the labels 'positive' or 'negative'
    if 'positive' in s or 'pos' in s:
        return 'positive'
    if 'negative' in s or 'neg' in s:
        return 'negative'
    return s.strip()

@torch.no_grad() #tells pytorch not to store any gradients
def flan_predict(texts, mode='zero', max_new_tokens=3):
    preds = []
    for t in texts:
        prompt = zshot_prompt(t) if mode == 'zero' else fshot_prompt(t)
        inputs = tok(prompt, return_tensors='pt')
        outputs = flan.generate(**inputs, max_new_tokens=max_new_tokens)
        out = tok.batch_decode(outputs, skip_special_tokens=True)[0]
        preds.append(norm_label(out))
    return preds

print('Zero-shot:', flan_predict(['I loved this movie', 'This was terrible'], mode='zero'))
print('Few-shot :', flan_predict(['I loved this movie', 'This was terrible'], mode='few'))

Zero-shot: ['positive', 'negative']
Few-shot : ['positive', 'negative']


#### Evaluate Prompting Methods

**TODO**:

- Define a function (or use a pre-existing implementation) that computes accuracy, with lists of labels and predictions as input.
- Use your `flan_predict` function from above to produce zero-shot and few-shot predictions over your evaluation data.

**Q**: Reflect on the performance of these two methods. Was there anything that surprised you?

**A**: The results show both zero-shot and few-shot prompting achieved identical overall accuracy of 87.96%, which is surprisingly strong performance without any task-specific training. What's particularly interesting is that while the overall accuracy is the same, the confusion matrices reveal they make different types of errors. The zero-shot approach appears more balanced, with 373 correct negatives and 394 correct positives, showing relatively symmetric performance on both classes with 55 false positives and 50 false negatives.
The few-shot method, despite having the same accuracy, shows a notable shift in error patterns. It correctly identified more negative examples (390 vs 373) but made more false negatives (67 vs 50), meaning it was more conservative about predicting positive sentiment. This suggests the few-shot examples may have influenced the model to be slightly more cautious, though the net effect on accuracy was neutral due to the offsetting improvement in negative classification. We usually expect few-shot prompting to outperform zero-shot, but here the model appears to already understand the sentiment classification task well enough from its pre-training that additional examples offered no benefit. This suggests Flan-T5 was heavily exposed to sentiment analysis during training. It's also worth noting that both methods achieved about 88% accuracy, which is quite respectable compared to the original 2013 paper's 80% baseline, showing how far pre-trained models have come in capturing this task without fine-tuning.

In [9]:
def accuracy(y_true, y_pred):
    # TODO: compute the accuracy of your predictions
    return sum(1 for true, pred in zip(y_true, y_pred) if true == pred) / len(y_true)

def label_to_str(y): # the SST-2 dataset stores the sentiment labels as integers
    return label_names[int(y)]

eval_texts = [ex['sentence'] for ex in eval_ds]
eval_labels = [label_to_str(ex['label']) for ex in eval_ds]

# TODO: predict the sentiment of your evaluation dataset with the zero-shot and few-shot prompting methods
z_preds = flan_predict(eval_texts, mode='zero')
f_preds = flan_predict(eval_texts, mode='few')

z_acc = accuracy(eval_labels, z_preds)
f_acc = accuracy(eval_labels, f_preds)
print({'zero_shot_acc': round(z_acc, 4), 'few_shot_acc': round(f_acc, 4)})
print("Zero-Shot Confusion Matrix:\n", confusion_matrix(eval_labels, z_preds))
print("Few-Shot Confusion Matrix:\n", confusion_matrix(eval_labels, f_preds))



{'zero_shot_acc': 0.8796, 'few_shot_acc': 0.8796}
Zero-Shot Confusion Matrix:
 [[373  55]
 [ 50 394]]
Few-Shot Confusion Matrix:
 [[390  38]
 [ 67 377]]


## Model Fine Tuning

Now we will fine tune the same Flan-T5 model on the SST-2 training dataset and compare performance. Since we are still using a generative model, the data needs to be formatted to support generation as we did above. We also need to define a custom metric function for the model to use during training, ensuring that the output matches our expected labels for evaluation.

**TODO**:

- Define a function to format data to use in a generative model training pipeline.
- Use your `norm_label` function from above to process the model output for evaluation.

**Q**: How would the model training pipeline change if we were using a representative model (i.e. an encoder-side tranformer model like BERT) instead of a generative model? Which type of model makes more sense when doing a supervised training task?

**A**:
With a representative model like BERT, the training pipeline would change fundamentally in its architecture and objective. Instead of training a sequence-to-sequence model to generate text outputs like "positive" or "negative," adding a classification head on top of BERT's encoder outputs and train it to predict one of two classes directly. The loss function would be categorical cross-entropy rather than the token-level generation loss used here, and you wouldn't need to tokenize target labels or use a data collator designed for generation. The model would output logits for each class, and you'd simply take the argmax to get predictions, eliminating the need for any text normalization or decoding steps.For supervised classification tasks like sentiment analysis, representative models like BERT make significantly more sense than generative models. BERT-style classifiers are more efficient because they only need a single forward pass through the encoder to produce a prediction, whereas generative models must auto-regressively generate each output token. They're also more reliable since they output structured probability distributions over a fixed set of classes rather than free-form text that requires normalization.

The fine-tuned Flan-T5 achieved 88.76% accuracy, which is only marginally better than the 87.96% from zero-shot and few-shot prompting. This modest improvement after training on 2000 examples further illustrates a key point: generative models are powerful for their flexibility across tasks without fine-tuning, but when you have labeled data and a well-defined classification problem, a dedicated encoder model trained as a classifier would likely achieve better performance more efficiently. The generative approach adds unnecessary complexity for what is fundamentally a binary classification task.

In [34]:
def format_data(text):
    # TODO: format the model input to support generation
    inp = ??
    tgt = label_to_str(text['label'])
    return {'input_text': inp, 'target_text': tgt}

train_formatted = train_ds.map(format_data) # we can use the .map() function in place of looping over the whole dataset
eval_formatted  = eval_ds.map(format_data)

def tokenize_fn(batch):
    model_inputs = tok(batch['input_text'], truncation=True) # convert text to token ids and cut off very long inputs
    with tok.as_target_tokenizer():
        labels = tok(batch['target_text'], truncation=True)
    model_inputs['labels'] = labels['input_ids']
    return model_inputs

train_toks = train_formatted.map(tokenize_fn, batched=True, remove_columns=train_formatted.column_names)
eval_toks  = eval_formatted.map(tokenize_fn,  batched=True, remove_columns=eval_formatted.column_names)

data_collator = DataCollatorForSeq2Seq(tok, model=flan) # the collator handles data batching during training

SyntaxError: invalid syntax (ipython-input-2287091097.py, line 3)

In [35]:
def format_data(text):
    # as_target_tokenizer is deprecated
    inp = f"Classify the sentiment of this movie review as positive or negative.\nReview: {text['sentence']}\nSentiment:"
    tgt = label_to_str(text['label'])
    return {'input_text': inp, 'target_text': tgt}

train_formatted = train_ds.map(format_data)
eval_formatted  = eval_ds.map(format_data)

def tokenize_fn(batch):
    model_inputs = tok(batch['input_text'], truncation=True)
    labels = tok(batch['target_text'], truncation=True)
    model_inputs['labels'] = labels['input_ids']
    return model_inputs

train_toks = train_formatted.map(tokenize_fn, batched=True, remove_columns=train_formatted.column_names)
eval_toks  = eval_formatted.map(tokenize_fn,  batched=True, remove_columns=eval_formatted.column_names)

data_collator = DataCollatorForSeq2Seq(tok, model=flan)

Map:   0%|          | 0/872 [00:00<?, ? examples/s]

In [36]:
def compute_metrics(eval_pred):
    preds, labels = eval_pred
    pred_texts = tok.batch_decode(preds, skip_special_tokens=True) # translate token_ids to text
    labels_clean = []
    for row in labels:
        row = [id for id in row if id != -100] # skip padding tokens
        labels_clean.append(row)
    ref_texts = tok.batch_decode(labels_clean, skip_special_tokens=True) # translate token_ids to text

    # TODO: normalize the model outputs for evaluation
    preds_norm = [norm_label(p) for p in pred_texts]
    return {'accuracy': accuracy(ref_texts, preds_norm)}


In [38]:
training_args = Seq2SeqTrainingArguments(
    output_dir='outputs/flan_t5_sst2',
    learning_rate=5e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    num_train_epochs=2,
    weight_decay=0.01,
    eval_strategy='epoch',
    save_strategy='no',
    predict_with_generate=True,
    logging_steps=50,
    report_to=[],
)

trainer = Seq2SeqTrainer(
    model=flan,
    args=training_args,
    train_dataset=train_toks,
    eval_dataset=eval_toks,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train();

Epoch,Training Loss,Validation Loss,Accuracy
1,0.159461,0.142931,0.889908
2,0.1445,0.144568,0.887615


In [39]:
eval_out = trainer.predict(eval_toks)
ft_acc = round(eval_out.metrics["test_accuracy"], 4)
print({'finetuned_flan_t5_eval_accuracy': ft_acc})
# Decode predicted sequences
pred_texts = tok.batch_decode(eval_out.predictions, skip_special_tokens=True)

# Decode reference (true) sequences, removing padding (-100)
labels_clean = [[id for id in row if id != -100] for row in eval_out.label_ids]
ref_texts = tok.batch_decode(labels_clean, skip_special_tokens=True)
print(confusion_matrix(ref_texts, pred_texts, labels=["positive", "negative"]))

{'finetuned_flan_t5_eval_accuracy': 0.8876}
[[393  51]
 [ 47 381]]


### Model Comparison

**Q**: Reflect on the performance of your three methods. Is there anything that was surprising? Would you do anything to improve the performance of any of the methods? Are there any other methods that you would like to compare?

**A**:
The most surprising result is how minimal the performance differences are across all three methods. The zero-shot and few-shot approaches achieved identical 87.96% accuracy, and fine-tuning on 2000 labeled examples only improved performance by 0.8 percentage points to 88.76%. This suggests that Flan-T5's pre-training already equipped it with strong sentiment analysis capabilities, making it difficult to extract significant gains through either example demonstrations or supervised fine-tuning on this particular task and dataset size.
The fact that few-shot prompting didn't outperform zero-shot is particularly interesting and counterintuitive. This could indicate that the task description alone is sufficient for the model to understand what's required, or that the two demonstration examples weren't diverse enough to provide additional value. It's also possible that the prompt formatting matters more than the examples themselves. The modest improvement from fine-tuning raises questions about whether 2000 examples is enough to meaningfully shift the model's behavior, or whether the learning rate and training hyperparameters could be better optimized.
To improve prompting methods, experimenting with different prompt phrasings, adding more diverse few-shot examples covering edge cases like negation and contrastive conjunctions, or using chain-of-thought prompting to encourage reasoning could help. For fine-tuning, training on the full dataset rather than just 2000 examples would likely yield better results, and hyperparameter tuning of learning rate, batch size, and number of epochs could optimize the training process.

In [40]:
summary = {
    'zero_shot_flan_t5_acc': round(z_acc, 4),
    'few_shot_flan_t5_acc': round(f_acc, 4),
    'finetuned_flan_t5_acc' : ft_acc,
}
summary

{'zero_shot_flan_t5_acc': 0.8796,
 'few_shot_flan_t5_acc': 0.8796,
 'finetuned_flan_t5_acc': 0.8876}