<a href="https://colab.research.google.com/github/GoldPapaya/info256-applied-nlp/blob/main/10.llms/HW9_LLM_Inference.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/dbamman/anlp25/blob/main/10.llms/HW9_LLM_Inference.ipynb)

# HW9: LLM Inference

In this homework, you will experiment with different ways of improving LLM classification performance.

In [1]:
import torch

from tqdm import tqdm
from transformers import AutoModelForCausalLM, AutoTokenizer

In [2]:
# use the 4B model

model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-4B", device_map="cuda", dtype="auto")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-4B")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/726 [00:00<?, ?B/s]

model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/3.99G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/99.6M [00:00<?, ?B/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/3.96G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

## Loading data

For the majority of this homework, we will be using data from *Who Feels What and Why? Annotation of a Literature Corpus with Semantic Roles of Emotions* [(Kim and Klinger, 2018)](https://aclanthology.org/C18-1114.pdf).

In [3]:
!wget https://raw.githubusercontent.com/bamman-group/ca-classification-data/refs/heads/main/data/emotion/train.jsonl
!wget https://raw.githubusercontent.com/bamman-group/ca-classification-data/refs/heads/main/data/emotion/test.jsonl

--2025-10-24 22:19:07--  https://raw.githubusercontent.com/bamman-group/ca-classification-data/refs/heads/main/data/emotion/train.jsonl
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 112051 (109K) [text/plain]
Saving to: ‘train.jsonl’


2025-10-24 22:19:08 (47.4 MB/s) - ‘train.jsonl’ saved [112051/112051]

--2025-10-24 22:19:08--  https://raw.githubusercontent.com/bamman-group/ca-classification-data/refs/heads/main/data/emotion/test.jsonl
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 114133 (111K) [text/plain]
Saving to: ‘test.

In [4]:
import json

def load_data(filepath):
    with open(filepath, "r") as f:
        data = [
            json.loads(line) for line in f
        ]
    return data

In [5]:
train_data = load_data("train.jsonl")
test_data = load_data("test.jsonl")

### Question 1

Take a look through the paper, as well as the actual dataset. What are the classification labels? **Fill them in below.**

In [6]:
train_data[0]

{'id': '4588|897',
 'text': 'Jacob led his horses down the carriage-way to the gate, which he closed carefully after passing through; and then mounting to his seat, drove off rapidly. But little conversation took place between Mrs. Allen and her traveling companion; and that was in so low a tone of voice, that Jacob Perkins failed to catch a single word, though ***he*** bent his ear and listened with the closest attention whenever he heard a murmur of voices. It was after daylight when they arrived in Boston, where Jacob Perkins left them, and returned home with all speed, to wake up the town of S----with a report of his strange adventure.',
 'label': 'anticipation',
 'group': '4588'}

In [9]:
label_set = set()
for i in range(len(train_data)):
  if train_data[i]['label'] not in label_set:
    label_set.add(train_data[i]['label'])

print(label_set)

{'trust', 'surprise', 'fear', 'sadness', 'joy', 'anger', 'anticipation', 'disgust'}


In [24]:
# FILL ME IN
labels = ['trust', 'surprise', 'fear', 'sadness', 'joy', 'anger', 'anticipation', 'disgust']

## Setting up the LLM

For greater consistency, we set the temperature to a low value (0.01) by default, but this can be changed with the generation_config setting.

In [10]:
from textwrap import dedent
import itertools
import inspect

def call_llm(prompt, system_prompt="You are a helpful assistant.", generation_config=None):
    if generation_config is None:
        generation_config = {
            "max_new_tokens": 10,
            "temperature": 0.01
        }
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": prompt}
    ]
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
        enable_thinking=False
    )

    model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

    # conduct text completion
    generated = model.generate(
        **model_inputs,
        **generation_config
    )

    # let's break this down:
    #                      | we take the element of the batch (our batch size is 1)
    #                      |  |-----------------------------| skip our original input
    output_ids = generated[0][len(model_inputs.input_ids[0]):].tolist()

    # decode into token space
    return tokenizer.decode(output_ids, skip_special_tokens=True).strip("\n")

## Classification

In [11]:
def evaluate(classifier):
    predictions = classifier(train_data, test_data)
    return sum(pred == target["label"] for pred, target in zip(predictions, test_data)) / len(test_data)

In [82]:
from collections import Counter
from textwrap import dedent
import random


def classify_majority_label(train_data, test_data):
    """Majority label baseline"""

    majority_class = Counter([d["label"] for d in train_data]).most_common(1)[0][0]
    test_predictions = []
    for i, datum in enumerate(tqdm(test_data)):
        test_predictions.append(majority_class)

    print(test_predictions[:5])
    return test_predictions

In [21]:
evaluate(classify_majority_label)

100%|██████████| 276/276 [00:00<00:00, 1612295.13it/s]


0.2028985507246377

Fill the rest of these in!

## Question 2

We've implemented a majority vote baseline for you.

Implement a zero-shot prompting classifier. Try at least 3 versions of the prompt to compare their outputs.

**In a few sentences,** describe how different prompting choices result in different outputs.

In [83]:
# for the rest of my code I got rid of 'enumerate' from the baseline
# and just used 'i' to represent the item/datum, since I didn't really
# need to use the iterator

def classify_zero_shot1(train_data, test_data):
    """Classification with zero-shot prompting."""
    system_prompt = 'You are a psychologist'
    test_predictions = []
    for i in tqdm(test_data):
      text = i['text']
      prompt = f'''
      Here is a passage of text: {text}
      Classify the passage with the following labels:
      trust, surprise, fear, sadness, joy, anger, anticipation, disgust
      Only respond with a single label from this list
      '''
      model_prediction = call_llm(prompt, system_prompt)
      if model_prediction not in labels:
        print('classify_zero_shot1 is not using labels as specified')
        print(model_prediction)
      else:
        test_predictions.append(model_prediction)
    print(test_predictions[:5])
    return test_predictions

def classify_zero_shot2(train_data, test_data):
    """Classification with zero-shot prompting."""
    system_prompt = 'You are an emotions expert'
    test_predictions = []
    for i in tqdm(test_data):
      text = i['text']
      prompt = f'''
      Some text that you can classify is: {text}
      Use the following for your classification:
      trust, surprise, fear, sadness, joy, anger, anticipation, disgust
      Respond with a single label only, from this list
      '''
      model_prediction = call_llm(prompt, system_prompt)
      if model_prediction not in labels:
        print('classify_zero_shot2 is not using labels as specified')
        print(model_prediction)
      else:
        test_predictions.append(model_prediction)
    return test_predictions

def classify_zero_shot3(train_data, test_data):
    """Classification with zero-shot prompting."""
    system_prompt = 'You are a data scientist who is great at predicting feelings'
    test_predictions = []
    for i in tqdm(test_data):
      text = i['text']
      prompt = f'''
      Given the following text: {text}
      Classify the passage with labels:
      trust, surprise, fear, sadness, joy, anger, anticipation, disgust
      Your response should only contain one label from that list
      '''
      model_prediction = call_llm(prompt, system_prompt)
      if model_prediction not in labels:
        print('classify_zero_shot3 is not using labels as specified')
        print(model_prediction)
      else:
        test_predictions.append(model_prediction)
    return test_predictions

In [61]:
# testing prompt
text = 'I am very happy'

prompt = f'''
      Here is a passage of text: {text}
      Classify the passage with the following labels:
      trust, surprise, fear, sadness, joy, anger, anticipation, disgust
      Only respond with one label
      '''

print(call_llm('You are a psychologist', prompt))

joy


In [62]:
prompt1_predictions = classify_zero_shot1(train_data, test_data)

 43%|████▎     | 119/276 [01:17<01:45,  1.49it/s]

classify_zero_shot1 is not using labels as specified
contentment


 97%|█████████▋| 268/276 [02:54<00:06,  1.33it/s]

classify_zero_shot1 is not using labels as specified
serenity


100%|██████████| 276/276 [02:59<00:00,  1.54it/s]


In [63]:
prompt2_predictions = classify_zero_shot2(train_data, test_data)

  0%|          | 1/276 [00:01<04:57,  1.08s/it]

classify_zero_shot2 is not using labels as specified
neutral


 33%|███▎      | 92/276 [00:59<02:03,  1.49it/s]

classify_zero_shot2 is not using labels as specified
neutral


 97%|█████████▋| 268/276 [02:54<00:05,  1.36it/s]

classify_zero_shot2 is not using labels as specified
serenity


100%|██████████| 276/276 [02:59<00:00,  1.53it/s]


In [64]:
prompt3_predictions = classify_zero_shot3(train_data, test_data)

 97%|█████████▋| 268/276 [02:56<00:05,  1.36it/s]

classify_zero_shot3 is not using labels as specified
serenity


100%|██████████| 276/276 [03:02<00:00,  1.51it/s]


In [67]:
print(prompt1_predictions[:10])
print(prompt2_predictions[:10])
print(prompt3_predictions[:10])
for i in test_data[:10]:
  print(i['text'])

['surprise', 'fear', 'sadness', 'fear', 'sadness', 'fear', 'joy', 'surprise', 'trust', 'anticipation']
['fear', 'sadness', 'fear', 'sadness', 'fear', 'joy', 'surprise', 'trust', 'joy', 'sadness']
['trust', 'fear', 'sadness', 'fear', 'sadness', 'fear', 'joy', 'surprise', 'trust', 'anticipation']
They were all literary gentlemen, though unknown as yet to Pen. There was Mr. Bole, the real editor of the magazine, of which Mr. Wagg was the nominal chief; ***Mr. Trotter***, who, from having broken out on the world as a poet of a tragic and suicidial cast, had now subsided into one of Mr. Bungay's back shops as reader for that gentleman; and Captain Sumph, an ex-beau reader about town, and related in some indistinct manner to Literature and the Peerage. He was said to have written a book once, to have been a friend of Lord Byron, to be related to Lord Sumphington; in fact, anecdotes of Byron formed his staple, and he seldom spoke but with the name of that poet or some of his contemporaries in

The entire premise behind prompt engineering is to create prompt inputs that lead to the best LLM outputs, which is independent of adjusting the model itself. This is sort of like a form of preprocessing -- we essentially take a bunch of unordered data and try to organize it in such a way that maximizes its interpretablity to a computer. Using the zero-shot approach above, we can see that we get varying prompt outputs by making small adjustments to the input prompts. The data do not change, nor do the requirements given to perform the task -- just how it is worded.

## Question 3

Implement the following:

1. Few-shot (k=3) classification
3. Zero-shot with chain-of-thought
4. Few-shot (k=3) with chain of thought (you will need to write reasoning chains)
5. Zero-shot with self-consistency (use `generation_config` to change the temperature)

For each of these, print out the raw LLM output for the first 5 data points in the test data.

Use the `evaluate` function to measure the accuracy of your method. **Write a few sentences comparing the performance of different prompting methods (including the above, and zero-shot from Q2.**

In [87]:
def classify_few_shot(train_data, test_data):
    """Classification with 3-shots."""
    system_prompt = "You are an expert in emotion classification. Given examples and a new text, predict the emotion from: trust, surprise, fear, sadness, joy, anger, anticipation, disgust. Return only a single emotion"
    test_predictions = []
    # compile 3 examples using random training data
    random.seed(42)
    example_data = random.sample(train_data, 3)
    examples = ""
    for i in example_data:
        examples += f"Text:{i['text']}\nEmotion:{i['label']}\n\n"

    for i in tqdm(test_data):
        text = i["text"]
        prompt = f"""
            Below are examples of text with their emotions:
            {examples}
            Classify the emotion in the following text: {text}
            Choose one emotion from: trust, surprise, fear, sadness, joy, anger, anticipation, disgust.
            Return only the emotion label.
        """
        model_prediction = call_llm(prompt, system_prompt)
        if model_prediction not in labels:
          print('classify_few_shot is not using labels as specified')
          print(model_prediction)
        else:
          test_predictions.append(model_prediction)
    print(test_predictions[:5])
    return test_predictions


In [88]:
def classify_zero_shot_cot(train_data, test_data):
    """Classification with zero-shot chain-of-thought."""
    system_prompt = "You are an expert in emotion classification. You go step-by-step to find the emotion in text, and provide just the final emotion label."
    test_predictions = []
    for i in tqdm(test_data):
        text = i["text"]
        prompt = f"""
            To classify the emotion in the text: {text}
            Use the following steps:
            1. Identify words or phrases are emotional.
            2. Consider tone and context of text.
            3. Choose one emotion from: trust, surprise, fear, sadness, joy, anger, anticipation, disgust.
            Return only the emotion label.
        """
        model_prediction = call_llm(prompt, system_prompt)
        if model_prediction not in labels:
          print('classify_zero_shot_cot is not using labels as specified')
          print(model_prediction)
        else:
          test_predictions.append(model_prediction)
    print(test_predictions[:5])
    return test_predictions

In [92]:
def classify_few_shot_cot(train_data, test_data):
    """Classification with 3-shot chain-of-thought."""
    system_prompt = "You are an expert in emotion classification. You go step-by-step to find the emotion in text, and provide just the final emotion label."
    test_predictions = []

    # compile 3 examples using random training data
    random.seed(42)
    example_data = random.sample(train_data, 3)
    examples = ""
    for i in example_data:
        examples += f"Text:{i['text']}\nEmotion:{i['label']}\n\n"

    for i in tqdm(test_data):
        text = i["text"]
        prompt = f"""
            Below are examples of text with their emotions:
            {examples}
            To classify the emotion in the text: {text}
            Use the following steps:
            1. Identify words or phrases are emotional.
            2. Consider tone and context of text.
            3. Choose one emotion from: trust, surprise, fear, sadness, joy, anger, anticipation, disgust.
            Return only the emotion label.
        """
        model_prediction = call_llm(prompt, system_prompt)
        if model_prediction not in labels:
          print('classify_few_shot_cot is not using labels as specified')
          print(model_prediction)
        else:
          test_predictions.append(model_prediction)
    print(test_predictions[:5])
    return test_predictions

In [90]:
from statistics import mode

def classify_zero_shot_self_consistency(train_data, test_data):
    """Implement self-consistency for zero-shot prompting."""
    system_prompt = "You are an expert in emotion classification."
    temperatures = [0.2, 0.4, 0.6, 0.8]
    test_predictions = []
    for i in tqdm(test_data):
        text = i["text"]
        prompt = f"""
            Here is a passage of text: {text}
            Classify the passage with the following labels:
            trust, surprise, fear, sadness, joy, anger, anticipation, disgust
            Only respond with a single label from this list
        """
        outputs = []
        for temp in temperatures:
            outputs.append(call_llm(prompt, system_prompt=system_prompt, generation_config={"max_new_tokens": 10, "temperature": temp}))
        # Select most common valid label
        valid_outputs = [o for o in outputs if o in labels] # should catch bad labels in same way as before
        prediction = mode(valid_outputs) if valid_outputs else random.choice(labels)
        test_predictions.append(prediction)
    print(test_predictions[:5])
    return test_predictions

In [93]:
for name, fn in [
    ("majority", classify_majority_label),
    ("zero-shot", classify_zero_shot1), # just going to use the first one as I made 3
    ("few-shot", classify_few_shot),
    ("zero-shot-cot", classify_zero_shot_cot),
    ("few-shot-cot", classify_few_shot_cot),
    ("self-consistency", classify_zero_shot_self_consistency)
]:
    score = evaluate(fn)
    print(f"{name}\t{score}")

100%|██████████| 276/276 [00:00<00:00, 332460.63it/s]


['joy', 'joy', 'joy', 'joy', 'joy']
majority	0.2028985507246377


 43%|████▎     | 119/276 [01:17<01:42,  1.53it/s]

classify_zero_shot1 is not using labels as specified
contentment


 97%|█████████▋| 268/276 [02:54<00:05,  1.35it/s]

classify_zero_shot1 is not using labels as specified
serenity


100%|██████████| 276/276 [02:59<00:00,  1.53it/s]


['surprise', 'fear', 'sadness', 'fear', 'sadness']
zero-shot	0.20652173913043478


100%|██████████| 276/276 [06:24<00:00,  1.39s/it]


['anticipation', 'trust', 'sadness', 'anger', 'anticipation']
few-shot	0.35507246376811596


  0%|          | 1/276 [00:01<05:38,  1.23s/it]

classify_zero_shot_cot is not using labels as specified
neutral


 11%|█         | 31/276 [00:24<03:12,  1.27it/s]

classify_zero_shot_cot is not using labels as specified
shock


 15%|█▍        | 41/276 [00:32<03:10,  1.24it/s]

classify_zero_shot_cot is not using labels as specified
neutral


 19%|█▉        | 53/276 [00:42<03:08,  1.18it/s]

classify_zero_shot_cot is not using labels as specified
affection


 30%|███       | 84/276 [01:05<02:09,  1.48it/s]

classify_zero_shot_cot is not using labels as specified
neutral


 32%|███▏      | 89/276 [01:09<02:29,  1.25it/s]

classify_zero_shot_cot is not using labels as specified
neutral


 33%|███▎      | 92/276 [01:12<02:20,  1.31it/s]

classify_zero_shot_cot is not using labels as specified
neutral


 35%|███▌      | 97/276 [01:16<02:20,  1.27it/s]

classify_zero_shot_cot is not using labels as specified
jealousy


 39%|███▉      | 108/276 [01:24<01:57,  1.43it/s]

classify_zero_shot_cot is not using labels as specified
determination


 43%|████▎     | 119/276 [01:33<02:11,  1.20it/s]

classify_zero_shot_cot is not using labels as specified
contentment


 48%|████▊     | 133/276 [01:44<01:53,  1.26it/s]

classify_zero_shot_cot is not using labels as specified
serenity


 51%|█████     | 141/276 [01:50<01:52,  1.20it/s]

classify_zero_shot_cot is not using labels as specified
neutral


 67%|██████▋   | 184/276 [02:22<01:16,  1.20it/s]

classify_zero_shot_cot is not using labels as specified
neutral


 83%|████████▎ | 229/276 [02:58<00:37,  1.25it/s]

classify_zero_shot_cot is not using labels as specified
neutral


 97%|█████████▋| 267/276 [03:28<00:07,  1.13it/s]

classify_zero_shot_cot is not using labels as specified
curiosity


 97%|█████████▋| 269/276 [03:30<00:06,  1.10it/s]

classify_zero_shot_cot is not using labels as specified
neutral


 98%|█████████▊| 271/276 [03:31<00:04,  1.11it/s]

classify_zero_shot_cot is not using labels as specified
wonderment


100%|██████████| 276/276 [03:35<00:00,  1.28it/s]


['fear', 'sadness', 'sadness', 'sadness', 'sadness']
zero-shot-cot	0.10869565217391304


 11%|█         | 31/276 [00:45<05:52,  1.44s/it]

classify_few_shot_cot is not using labels as specified
shock


 19%|█▉        | 53/276 [01:17<05:33,  1.49s/it]

classify_few_shot_cot is not using labels as specified
affection


 35%|███▌      | 97/276 [02:20<04:21,  1.46s/it]

classify_few_shot_cot is not using labels as specified
jealousy


100%|██████████| 276/276 [06:40<00:00,  1.45s/it]


['anticipation', 'sadness', 'anticipation', 'anger', 'joy']
few-shot-cot	0.17753623188405798


100%|██████████| 276/276 [12:03<00:00,  2.62s/it]

['sadness', 'fear', 'sadness', 'fear', 'sadness']
self-consistency	0.322463768115942





I added some error handling so the output is a bit hard to see. I included the results below.
```
majority	0.2028985507246377
zero-shot	0.20652173913043478
few-shot	0.35507246376811596
zero-shot-cot	0.10869565217391304
few-shot-cot	0.17753623188405798
self-consistency	0.32971014492753625
```
Between all the prompting strategies, the most noteworthy observation may be that the chain-of-thought prompts (for both zero and few shot) seemed to underperform as compared with their non chain-of-thought prompt counterparts. A potential justification for this may be that the LLM doesn't do well with being told explicitly how it should categorize emotion, or that following the exact procedure outlined in the prompt leads to bad conclusions. Another general conclusion that can be drawn from the evaluation is that few-shot prompts, whether integrating chain-of-thought or not, always performed about twice as well as the zero-shot prompts. It does make sense that the few-shot would perform better as the prompt contains more context in the form of examples, but this seems a very significant improvement over zero-shot.

