### Section 2.2.3 Sentimental Classification via Prompting LLMs (20')

In [2]:
!pip install torch numpy nltk datasets matplotlib pydantic transformers outlines typing scikit-learn



In [3]:
from huggingface_hub import notebook_login

notebook_login()


VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

#### Task 2.2.3.3 Sentiment Classifier with Explaination (10')

Still with the pydantic and outlines, could you design a program, (1) first think step by step on generating an explaination for a classification. (2) output that label.

For example, the input sentence "I loved the soundtrack but the story was weak.".
Your prompt should be like
```Analyze the sentiment of this sentence with reasoning steps first, and then classifier it into 5 categories: very positive, positive, neutral, negative, very negative.\n```

Your model should output a structured output as following two parts in a json file.

{"reason": "The sentence expresses both positive sentiment towards the soundtrack and a slight criticism of the story.", "label": "neutral"}

Hint:
* This is an open design question, please feel free to use any prompt or tools, such as LangChain, LlamaInex. But we suggest you use pydantic and outlines. They are all similar.
* Please read more exmaples about complex structures to support both the reasoning and label. https://github.com/dottxt-ai/outlines
* Use model_validate_json in pydantic. https://docs.pydantic.dev/latest/concepts/models/#basic-model-usage

In [6]:
import os
import time
import json
import csv
import random
from typing import Literal
from pydantic import BaseModel, Field
from transformers import AutoModelForCausalLM, AutoTokenizer
import outlines
from datasets import load_dataset
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import numpy as np
import torch

# 1. Setup: Environment, Model, Tokenizer
device = "cuda" if torch.cuda.is_available() else "cpu"

os.environ["TOKENIZERS_PARALLELISM"] = "false"
trans_model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-3-1b-it",
    dtype=torch.bfloat16 if device == "cuda" else torch.float32,
).to(device)
tokenizer = AutoTokenizer.from_pretrained("google/gemma-3-1b-it")

# Wrap the model with outlines to generate structured output
model = outlines.from_transformers(trans_model, tokenizer)


# 2. Define the desired structured output using Pydantic
class SentimentAnalysis(BaseModel):
    reason: str = Field()
    label: Literal["very positive", "positive", "neutral", "negative", "very negative"]


# 3. Load the dataset and get examples for few-shot prompting
sst_dataset = load_dataset("SetFit/sst5")
train_dataset = sst_dataset["train"]
test_dataset = sst_dataset["test"]


def get_random_examples():
    """
    Randomly select one example from each sentiment category from the training set.
    Returns a dictionary with examples for each label.
    """
    examples = {
        "very negative": [],
        "negative": [],
        "neutral": [],
        "positive": [],
        "very positive": [],
    }

    # Collect all examples for each category
    for example in train_dataset:
        label_text = example["label_text"]
        text = example["text"]
        examples[label_text].append(text)

    # Randomly select one example from each category
    random_examples = {}
    for label, texts in examples.items():
        random_examples[label] = random.choice(texts)

    return random_examples


# Get random examples for the prompt
few_shot_examples = get_random_examples()

# 4. Create the prompt template with few-shot examples
messages = [
    [
        {
            "role": "system",
            "content": [
                {
                    "type": "text",
                    "text": "You are a helpful assistant that analyzes sentiment.",
                },
            ],
        },
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": f"""Analyze the sentiment of movie reviews using these examples:

Very positive: "{few_shot_examples['very positive']}"
Positive: "{few_shot_examples['positive']}"
Neutral: "{few_shot_examples['neutral']}"
Negative: "{few_shot_examples['negative']}"
Very negative: "{few_shot_examples['very negative']}"

Now classify this sentence into one of 5 categories (very positive, positive, neutral, negative, very negative) and explain your reasoning:

Sentence: {{{{ text }}}}
""",
                },
            ],
        },
    ],
]

prompt_template = tokenizer.apply_chat_template(messages, tokenize=False)[0]
outline_template = outlines.Template.from_string(prompt_template)

print("Few-shot examples used in the prompt:")
for label, example in few_shot_examples.items():
    print(f"{label}: {example[:100]}...")  # Print first 100 chars
print("-" * 50)


def run_inference_with_explanation():
    """
    Runs inference on the test dataset, generating a sentiment analysis
    with a reason for each example.
    """
    predictions = []
    ground_truths = []
    results = []

    i = 0

    # Using a smaller subset for demonstration to run faster.
    # To run on the full dataset, remove the slicing [:50].
    for example in test_dataset:
        text = example["text"]
        true_label_name = example["label_text"]
        i += 1
        if i % 50 == 0:
            print(f"Processed {i} examples so far.")
        try:
            # Generate structured output using outlines with Pydantic model
            json_response = model(
                outline_template(text=text),
                SentimentAnalysis,
                max_new_tokens=200,  # Increased tokens for complete reasoning
            )

            # Parse the JSON response into a Pydantic object
            structured_response = SentimentAnalysis.model_validate_json(json_response)

            # Access attributes directly from the Pydantic object
            predictions.append(structured_response.label)
            ground_truths.append(true_label_name)

            # Store results for CSV output
            results.append(
                {
                    "text": text,
                    "true_label": true_label_name,
                    "predicted_label": structured_response.label,
                    "reason": structured_response.reason,
                    "correct": true_label_name == structured_response.label,
                }
            )
            if i < 10:
                print(f"Example {i + 1}:")
                print(f"Text: {text}")
                print(f"True Label: {true_label_name}")
                print(f"Predicted Label: {structured_response.label}")
                print(f"Reason: {structured_response.reason}")
                print("-" * 30)

        except Exception as e:
            # If validation fails, log the error and continue with the next example
            print(f"ERROR processing text: {text}")
            print(f"Error details: {str(e)}")
            print(f"Skipping this example and continuing...")
            print("-" * 30)
            continue

    return predictions, ground_truths, results


def save_results_to_csv(results, filename="llm_sst_results.csv"):
    """
    Save the inference results to a CSV file using pipe separator.
    """
    with open(filename, "w", newline="", encoding="utf-8") as csvfile:
        fieldnames = ["text", "true_label", "predicted_label", "reason", "correct"]
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames, delimiter="|")

        writer.writeheader()
        for result in results:
            writer.writerow(result)

    print(f"\nResults saved to {filename}")


def main():

    start_time = time.time()
    predictions, ground_truths, results = run_inference_with_explanation()
    end_time = time.time()

    print(f"\nTotal inference time: {end_time - start_time:.2f} seconds")

    # Save results to CSV
    save_results_to_csv(results)

    # Define label order for consistent matrix display
    labels = ["very negative", "negative", "neutral", "positive", "very positive"]

    print("\nClassification Report:")
    print(
        classification_report(
            ground_truths,
            predictions,
            labels=labels,
            zero_division=0,
        )
    )

    print("\nAccuracy Score:")
    print(accuracy_score(ground_truths, predictions))

    print("\nConfusion Matrix:")
    cm = confusion_matrix(ground_truths, predictions, labels=labels)
    print("Labels order: very negative, negative, neutral, positive, very positive")
    print("Rows = True labels, Columns = Predicted labels")
    print(cm)

if __name__ == "__main__":
    # Check GPU info if available
    if torch.cuda.is_available():
        print(f"GPU: {torch.cuda.get_device_name(0)}")
        print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
        print(f"CUDA Version: {torch.version.cuda}")
    main()


Repo card metadata block was not found. Setting CardData to empty.


Few-shot examples used in the prompt:
very negative: the whole thing feels like a ruse , a tactic to cover up the fact that the picture is constructed ar...
negative: neither as scary-funny as tremors nor demented-funny as starship troopers , the movie is n't tough t...
neutral: it 's another retelling of alexandre dumas ' classic ....
positive: one feels the dimming of a certain ambition , but in its place a sweetness , clarity and emotional o...
very positive: brings an irresistible blend of warmth and humor and a consistent embracing humanity in the face of ...
--------------------------------------------------
GPU: NVIDIA L4
GPU Memory: 23.8 GB
CUDA Version: 12.6
Example 2:
Text: no movement , no yuks , not much of anything .
True Label: negative
Predicted Label: negative
Reason: The sentence expresses a feeling of emptiness and lack of engagement, indicating a negative sentiment.
------------------------------
Example 3:
Text: a gob of drivel so sickly sweet , even the eager consu