# Assignment 1 - In-Context Learning

In this assignment, students experiment with in-context learning by selecting and ordering demonstrations to train a large language model at inference time to classify text. In this task, an online store is interested in classifying whether a review describes one or more general topics of interest. The topics are specific to a class of product, in this case vacuum cleaners. Other topics would be relevant to other products.

The dataset has been divided into a development, training and test sets. Students should practice setting up their experiments and writing their prompts using only the development set. Demonstrations for in-context leanring can be drawn from the training set. Final evaluation prior to submission should use the test set.

In [1]:
import random

In [3]:
from openai import OpenAI
client = OpenAI()

def prompt_model(prompt):
    completion = client.chat.completions.create(
        model="gpt-4o",
        store=True,
        messages=[
            {"role": "user", 'content': prompt}
        ]
    )
    return completion.choices[0].message.content

## Open Source Models (Optional)

If students wish to evaluate their solution on open source models, they may use Ollama, if their hardware supports it.

In [5]:
# from ollama import chat
# from ollama import ChatResponse

# def prompt_ollama(prompt):
#     response: ChatResponse = chat(model='llama3.3', messages=[{
#         'role': 'user',
#         'content': prompt,
#       },
#     ])
#     return response['message']['content']

## Load Reviews with Hashtags

The dataset is partitioned into development, training and testing sets. While writing the code to setup your experiments and write your prompts, only use the development set. The training set should be used to sample demonstrations. Only when your code is completed and you are ready to turn in your assignment should you run your experiment on the test set.

In [5]:
import json

data_dev = json.load(open('dataset-dev.json', 'r'))
data_train = json.load(open('dataset-train.json', 'r'))
data_test = json.load(open('dataset-test.json', 'r'))

print('\nDataset Sizes: Dev %i, Train %i, Test %i\n' % (len(data_dev), len(data_train), len(data_test)))

data_dev[0]


Dataset Sizes: Dev 100, Train 100, Test 300



{'text': 'Used the product and was very happy with it until about a month ago. Motor sounded like it was working harder; thought maybe I was imagining things. Look all through hoses and brush roller assembly for any blockages. Today it was not getting good suction; then motor suddenly cut back on output. Barely runs; does not run in upright position. No suction. Bought this as an "inexpensive" replacement to Dyson that died after 5 years. You get what you pay for evidently. Wondering if manufacturer warranty in effect, though I failed to send in the warranty card.',
 'expected': ['#PerformanceAndFunctionality',
  '#ValueForMoneyAndInvestment',
  '#CustomerExperienceAndExpectations'],
 'sentiment': ['N', 'N', 'N']}

## Define the Hashtag List for Prediction

In [7]:
tags = [
    '#DesignAndUsabilityIssues',
    '#PerformanceAndFunctionality',
    '#BatteryAndPowerIssues',
    '#DurabilityAndMaterialConcerns',
    '#MaintenanceAndCleaning',
    '#CustomerExperienceAndExpectations',
    '#ValueForMoneyAndInvestment',
    '#AssemblyAndSetup'
]

tag_list = ' '.join(tags)

## Review the Hashtag Distribution

In general, it is good practice when classifying items to know the distribution of target categories. Categories that are underrepresented, especially in the training data, would lead to underperformance.

In [10]:
# Helper Functions for Demonstrations & Prompt Construction

def sample_demonstrations(data, k, seed=42):
    """
    Randomly sample k demonstration examples from the training data.
    (You can later experiment with different ordering strategies.)
    """
    random.seed(seed)
    return random.sample(data, k)

def format_demo(example):
    """
    Format a single demonstration example.
    Assumes each example has keys "review" and "hashtags".
    """
    review = example.get('review', '')
    hashtags = example.get('hashtags', '')
    if isinstance(hashtags, list):
        hashtags = ' '.join(hashtags)
    return f"Review: {review}\nHashtags: {hashtags}"

def build_demonstrations_text(demos):
    """
    Concatenate formatted demonstration examples into one text block.
    """
    return "\n\n".join([format_demo(ex) for ex in demos])

## Define the Prompt and Experiment

The experiment generally has the following steps: (1) sample the training data to identify k demonstrations for 0 =< k < training set size; (2) construct linearize the demonstrations into text; (3) iterate over the test data and insert the test review and text linearization of the demonstrations into the prompt template; (4) send the prompt to the model and receive the response; (5) validate the response, if the response passes then store the response for later, else if the response fails validation, then save the response to a list of errors. It is generally good to save responses and errors with an index that can be linked back to the test data.

After running the experiment, the evaluation metrics should be computed from the answers and the errors should be inspected. Adjustments to the prompt and/or experiment can be made to reduce the errors, e.g., by post-processing the responses prior to validation.

In [13]:
PROMPT_TEMPLATE = """You are given a review for a vacuum cleaner and a list of allowed hashtag categories.
Allowed Categories: {tag_list}

Here are some examples:
{demonstrations}

Now, classify the following review by outputting the relevant hashtags from the allowed list.
Only output the hashtags (separated by a space) and nothing else.

Review: {review_text}
Hashtags:"""

def construct_prompt(review_text, demos_text):
    """
    Construct the full prompt by filling in the template.
    """
    return PROMPT_TEMPLATE.format(tag_list=tag_list, demonstrations=demos_text, review_text=review_text)

In [15]:
from tiktoken import encoding_for_model

def count_tokens(prompt, model="gpt-4o"):
    enc = encoding_for_model(model)
    return len(enc.encode(prompt))

print("Token Count:", count_tokens(PROMPT_TEMPLATE))

Token Count: 75


In [17]:
# Response Processing and Validation
def process_response(response):
    """
    Post-process the response:
      - Split the response into tokens.
      - Filter tokens so that only allowed tags are retained (to counter hallucinations).
    """
    tokens = response.split()
    formatted_tokens = [f"#{token}" if not token.startswith("#") else token for token in tokens]
    valid_tags = [token for token in formatted_tokens if token in tags]
    return valid_tags

In [19]:
# Run the Experiment
def run_experiment(data, demos, debug=False, debug_samples=5):
    """
    Run the experiment on a dataset:
    1. Build demonstration text from the provided demonstrations.
    2. For each record in the dataset:
       - Construct the prompt with the review and demonstrations.
       - Query the model.
       - Process and validate the response.
    3. Return predictions and a log of errors.
    
    If `debug=True`, only processes `debug_samples` reviews to inspect outputs.
    """
    demos_text = build_demonstrations_text(demos)
    predictions = {}  # Store predicted hashtags keyed by record index.
    errors = {}       # Store any responses that failed validation.
    
    # Limit to debug_samples if debugging
    dataset_size = debug_samples if debug else len(data)
    
    for i, record in enumerate(data[:dataset_size]):
        review_text = record.get('text', '')
        prompt = construct_prompt(review_text, demos_text)
        
        try:
            response = prompt_model(prompt)  # Get raw response from the model
            
            # print(f"\n### Example {i} ###")
            # print(f"Review: {review_text}")
            # print(f"Raw Model Output: {response}")  # Show unprocessed response
            
            processed = process_response(response)
            # print(f"Processed Hashtags: {processed}")  # Show cleaned tags
            
            # If no valid hashtags are returned, record it as an error.
            if not processed:
                errors[i] = response
            predictions[i] = processed
        
        except Exception as e:
            errors[i] = str(e)
            predictions[i] = []
    
    return predictions, errors

## Evaluate the Experimental Results

The evaluation metrics include precision, recall and F1 score. For the total number of true positives (tp), false positives (fp) and false negatives (fn), these calculations should be used to report results:
* Precision = tp / (tp + fp)
* Recall = tp / (tp + fn)
* F1 = 2tp / (2tp + fp + fn)

In [22]:
from sklearn.metrics import precision_score, recall_score, f1_score

def compute_metrics(data, predictions):
    """
    Compute precision, recall, and F1-score using sklearn.
    """
    all_ground_truths = []
    all_predictions = []
    
    for i, record in enumerate(data):
        ground_truth = record.get('expected', [])  # True labels
        pred = predictions.get(i, [])  # Model predictions
        
        # Convert sets to binary format for multi-label classification
        y_true = [1 if tag in ground_truth else 0 for tag in tags]
        y_pred = [1 if tag in pred else 0 for tag in tags]

        all_ground_truths.append(y_true)
        all_predictions.append(y_pred)

    # Compute metrics (averaged across all samples)
    precision = precision_score(all_ground_truths, all_predictions, average='micro', zero_division=0)
    recall = recall_score(all_ground_truths, all_predictions, average='micro', zero_division=0)
    f1 = f1_score(all_ground_truths, all_predictions, average='micro', zero_division=0)

    return precision, recall, f1

In [37]:
demo_counts = [16,32,8,24]
metrics_by_k = {}

print("\nRunning experiments on the development set for different values of k (demonstration count):\n")

for k in demo_counts:
    # Sample k demonstrations from the training set.
    demos = sample_demonstrations(data_train, k, seed=42)
    
    # Run the experiment on the development set using these demonstrations.
    dev_predictions, dev_errors = run_experiment(data_dev, demos)
    
    # Compute evaluation metrics.
    precision, recall, f1 = compute_metrics(data_dev, dev_predictions)
    
    # Store the metrics and number of errors for this k.
    metrics_by_k[k] = {
        "precision": precision,
        "recall": recall,
        "f1": f1,
        "num_errors": len(dev_errors)
    }
    
    # Print the results.
    print(f"Demonstration count (k) = {k}:")
    print(f"  Precision: {precision:.3f}")
    print(f"  Recall:    {recall:.3f}")
    print(f"  F1 Score:  {f1:.3f}")
    print(f"  Number of error responses: {len(dev_errors)}\n")

print("Metrics by demonstration count test:", metrics_by_k)


Running experiments on the development set for different values of k (demonstration count):

Demonstration count (k) = 16:
  Precision: 0.691
  Recall:    0.918
  F1 Score:  0.788
  Number of error responses: 0

Demonstration count (k) = 32:
  Precision: 0.711
  Recall:    0.909
  F1 Score:  0.798
  Number of error responses: 0

Demonstration count (k) = 8:
  Precision: 0.679
  Recall:    0.909
  F1 Score:  0.777
  Number of error responses: 0

Demonstration count (k) = 24:
  Precision: 0.689
  Recall:    0.881
  F1 Score:  0.774
  Number of error responses: 0

Metrics by demonstration count test: {16: {'precision': 0.6907216494845361, 'recall': 0.9178082191780822, 'f1': 0.788235294117647, 'num_errors': 0}, 32: {'precision': 0.7107142857142857, 'recall': 0.908675799086758, 'f1': 0.7975951903807615, 'num_errors': 0}, 8: {'precision': 0.6791808873720137, 'recall': 0.908675799086758, 'f1': 0.77734375, 'num_errors': 0}, 24: {'precision': 0.6892857142857143, 'recall': 0.8812785388127854, '

In [24]:
selected_k = 32 
seed = 42
print(f"\nRunning final evaluation on the test set with k = {selected_k} demonstrations.")
final_demos = sample_demonstrations(data_dev, selected_k, seed=seed)
test_predictions, test_errors = run_experiment(data_test, final_demos)

# Save test predictions to results.json
results = []
for i, record in enumerate(data_test):
    record_copy = record.copy()
    record_copy['predicted'] = test_predictions.get(i, [])
    results.append(record_copy)

with open('results.json', 'w') as f:
    json.dump(results, f, indent=2)

print("Final results saved to results.json")


Running final evaluation on the test set with k = 32 demonstrations.
Final results saved to results.json


### Evaluation metric for test data

In [48]:
import json

# Define different k values to test
demo_counts = [8, 16, 24, 32]
metrics_by_k = {}

print("\nRunning experiments on the test set for different values of k (demonstration count):\n")

for k in demo_counts:
    # Sample k demonstrations from the training set
    demos = sample_demonstrations(data_train, k, seed=42)

    # Run the experiment on the test set
    test_predictions, test_errors = run_experiment(data_test, demos)

    # Compute evaluation metrics on the test set
    precision, recall, f1 = compute_metrics(data_test, test_predictions)

    # Store results
    metrics_by_k[k] = {
        "precision": precision,
        "recall": recall,
        "f1": f1,
        "num_errors": len(test_errors)
    }

    # Print results
    print(f"Demonstration count (k) = {k}:")
    print(f"  Precision: {precision:.3f}")
    print(f"  Recall:    {recall:.3f}")
    print(f"  F1 Score:  {f1:.3f}")
    print(f"  Number of error responses: {len(test_errors)}\n")


Running experiments on the test set for different values of k (demonstration count):

Demonstration count (k) = 8:
  Precision: 0.698
  Recall:    0.923
  F1 Score:  0.795
  Number of error responses: 0

Demonstration count (k) = 16:
  Precision: 0.679
  Recall:    0.929
  F1 Score:  0.785
  Number of error responses: 0

Demonstration count (k) = 24:
  Precision: 0.696
  Recall:    0.926
  F1 Score:  0.795
  Number of error responses: 0

Demonstration count (k) = 32:
  Precision: 0.701
  Recall:    0.918
  F1 Score:  0.795
  Number of error responses: 0



### Hallucination Check on the Test Data

In [43]:
import json

def validate_categories(predictions, allowed_tags):
    validated_results = []
    hallucinated_results = []
    
    for item in predictions:
        predicted_tags = set(item["predicted"])  # No need to split, it's already a list
        valid_tags = [tag for tag in predicted_tags if tag in allowed_tags]
        
        if predicted_tags - set(valid_tags):  # If hallucinated tags exist
            hallucinated_results.append({
                "review_id": item["text"],  # Use review text instead of ID since it's missing
                "invalid_tags": list(predicted_tags - set(valid_tags))
            })
        
        validated_results.append({"text": item["text"], "predicted": valid_tags})  # Keep list format

    return validated_results, hallucinated_results

In [45]:
# Convert tags list to a set for validation
allowed_tags = set(tags)

# Validate predictions directly from `results`
validated_results, hallucinations = validate_categories(results, allowed_tags)

In [47]:
# Save the cleaned results
with open("validated_results.json", "w") as f:
    json.dump(validated_results, f, indent=4)

# Output hallucination findings
if hallucinations:
    print("Hallucinated tags found:", hallucinations)
else:
    print("No hallucinations detected. ✅")

No hallucinations detected. ✅
