### Section 2.2.3 Sentimental Classification via Prompting LLMs (20')

In [None]:
!pip install torch numpy nltk datasets matplotlib pydantic transformers outlines typing scikit-learn

*In* this section, you need to write a program to use AutoModelForCausalLM to load the pretrained model and prompt Google's "gemma-3-1b-it" for the SST-5 classification task, and report the performance on test set.

Hint:

* Create a [Hugging Face account](https://huggingface.co).
* Accept Google terms on the [gemma-3-1b-it model page](https://huggingface.co/google/gemma-3-1b-it).  
* Generate an [Access Token](https://huggingface.co/settings/tokens) with *read* permission.  
* Set your Access Token as an environment variable, not in your code : export HF_TOKEN, see here: https://medium.com/@oadaramola/a-pitfall-i-almost-fell-into-d1d3461b2fb8
* Read through the pretrained models tutorial at https://community.ibm.com/community/user/blogs/ruslan-idelfonso-magaa-vsevolodovna/2023/10/05/how-to-work-with-pretrained-models-with-transforme
* Read the tutorial about structured output part, for each sentence in the SST-5 test set, prompt "gemma-3-1b-it" to produce one of the five labels. https://huggingface.co/docs/inference-providers/en/guides/structured-output
* Use OSCER to submit this job if you had try your best on your laptop.
* Report the performance and compare with the previous models.

In [2]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

#### Task 2.2.3.1 Sentiment Classifier without Constraint (5')

In [3]:
# use https://huggingface.co/google/gemma-3-1b-it to infer on the SST-5 dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers import BitsAndBytesConfig
# Ignore this you don't have the GPU or
# quantization_config = BitsAndBytesConfig(load_in_8bit=True)
# model = AutoModelForCausalLM.from_pretrained("google/gemma-3-1b-it", quantization_config=quantization_config).eval()

model = AutoModelForCausalLM.from_pretrained("google/gemma-3-1b-it").eval()
tokenizer = AutoTokenizer.from_pretrained("google/gemma-3-1b-it")
inputs = tokenizer("The move is nice. The sentiment is", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=5, pad_token_id=tokenizer.eos_token_id)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True)[0])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/899 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.00G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/215 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.16M [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.69M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/33.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/35.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/662 [00:00<?, ?B/s]

The move is nice. The sentiment is very sweet.



We found that the above output is relative open.
The output is not only within the 5 labels in sst-5.
Evenif you control that into 2-3 tokens, it still goes wrong.
Of course, you could give more guidance in the prompt as follows

In [4]:
inputs = tokenizer("The move is so great. Which is the sentiment among very positive, positive, neutral, negative, very negative.", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=10)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True)[0])

The move is so great. Which is the sentiment among very positive, positive, neutral, negative, very negative.
I think the sentiment is mostly positive.




However, the output still requires some parsing and random. Hence, in the next step, we will introduce pydantic and outlines to help you build your prompt and parse your results.

#### Task 2.2.3.2 Chat Format, Pydantic and Outlines (5')

ifferent LLM will use different details to support the chat format, such as adding special tokens or special separators between messages.
For example LlaMA-3 has the following
```
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are a helpful assistant.<|eot_id|>
<|start_header_id|>user<|end_header_id|>
Write a haiku about autumn.<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>
```

While the OpenAI/Mistral will use the <|im_start|> and <|im_end|>

You could use the Hugginface tokenizer.apply_chat_format to understand the details.
The following is an example to understand Gemma3-1b, which is another different format.


Here are more tutorials on pydantic and outlines here. https://github.com/dottxt-ai/outlines

In [5]:
from pydantic import BaseModel
from transformers import AutoModelForCausalLM, AutoTokenizer
from typing import Literal
import outlines

import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"
trans_model = AutoModelForCausalLM.from_pretrained("google/gemma-3-1b-it")
tokenizer = AutoTokenizer.from_pretrained("google/gemma-3-1b-it")

# use the outlines to wrap the model
# https://dottxt-ai.github.io/outlines/latest/features/models/transformers/
model = outlines.from_transformers(
    trans_model, tokenizer
)

# Define the output categories using Literal
SentimentCategory = Literal[
    "very positive",
    "positive",
    "neutral",
    "negative",
    "very negative"
]

# define the general chat messages template
messages = [
    [
        {
            "role": "system",
            "content": [{"type": "text", "text": "You are a helpful assistant."},]
        },
        {
            "role": "user",
            "content": [{"type": "text", "text": "Classify this sentence into 5 categories: very positive, positive, neutral, negative, very negative.\n Text: {{ text }}"},]
        },
    ],
]

# Different LLM will use different details to support the above chat format.
# such as adding special tokens or special separators between messages.

prompt_template = tokenizer.apply_chat_template(messages, tokenize=False)[0]
print(prompt_template)
outline_template = outlines.Template.from_string(prompt_template)

texts =[
    "I loved the soundtrack but the story was weak."
    "The movie is not great, but it has some good moments.",
]

sentiment_rst = model(
    outline_template(text=texts[0]),
    SentimentCategory, max_new_tokens=10
    )

print(sentiment_rst)


<bos><start_of_turn>user
You are a helpful assistant.

Classify this sentence into 5 categories: very positive, positive, neutral, negative, very negative.
 Text: {{ text }}<end_of_turn>

negative


* Please change the above code and prompt for whole your sst-5 datasets.
* Open exploration: please use batch processing to speedup. (optional)
* If you laptop doesnot work for inference this 1b model, please write it as sbatch job to run on OSCER.
* You are encouraged to run larger models or other models who support this structured output

**Using Gemma-3-1b for SST-5**

In [6]:
import torch
from pydantic import BaseModel
from transformers import AutoModelForCausalLM, AutoTokenizer
from typing import Literal
import outlines
from datasets import load_dataset
import json
import time
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import os

os.environ["TOKENIZERS_PARALLELISM"] = "false"
trans_model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-3-1b-it",
    device_map="auto",
    low_cpu_mem_usage=True,
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("google/gemma-3-1b-it")
tokenizer.pad_token_id = tokenizer.eos_token_id
trans_model.config.pad_token_id = tokenizer.eos_token_id
trans_model.generation_config.pad_token_id = tokenizer.eos_token_id


# use the outlines to wrap the model
# https://dottxt-ai.github.io/outlines/latest/features/models/transformers/
model = outlines.from_transformers(
    trans_model, tokenizer
)

# Define the output categories using Literal
SentimentCategory = Literal[
    "very positive",
    "positive",
    "neutral",
    "negative",
    "very negative"
]

# define the general chat messages template
messages = [
    [
        {
            "role": "system",
            "content": [{"type": "text", "text": "You are a helpful assistant."},]
        },
        {
            "role": "user",
            "content": [{"type": "text", "text": "Classify this sentence into 5 categories: very positive, positive, neutral, negative, very negative.\n Text: {{ text }}"},]
        },
    ],
]

prompt_template = tokenizer.apply_chat_template(messages, tokenize=False)[0]
outline_template = outlines.Template.from_string(prompt_template)

# Load SST-5 dataset
sst_dataset = load_dataset("SetFit/sst5")
test_dataset = sst_dataset["test"]

README.md:   0%|          | 0.00/421 [00:00<?, ?B/s]

Repo card metadata block was not found. Setting CardData to empty.


train.jsonl: 0.00B [00:00, ?B/s]

dev.jsonl: 0.00B [00:00, ?B/s]

test.jsonl: 0.00B [00:00, ?B/s]

Generating train split:   0%|          | 0/8544 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1101 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/2210 [00:00<?, ? examples/s]

In [7]:
def run_inference():
    predictions = []
    ground_truths = []
    i = 0
    for example in test_dataset:
        text = example["text"]
        true_label_name = example["label_text"]

        predicted_label = model(
            outline_template(text=text), SentimentCategory, max_new_tokens=10
        )

        predictions.append(predicted_label)
        ground_truths.append(true_label_name)
        if i % 10 == 0:
            print(f"Processed {i} examples")
        i += 1

    return predictions, ground_truths

In [8]:
# Check GPU info if available
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
    print(f"CUDA Version: {torch.version.cuda}")

def main():
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    print(f"Using device: {device}")

    print(f"Starting inference")
    start_time = time.time()
    predictions, ground_truths = run_inference()
    end_time = time.time()

    print(f"\n{'='*50}")
    print(f"INFERENCE COMPLETED")
    print(f"{'='*50}")
    print(f"Total examples processed: {len(predictions)}")
    print(f"Total inference time: {end_time - start_time:.2f} seconds")
    print(
        f"Average time per example: {(end_time - start_time)/len(predictions):.3f} seconds"
    )
    print(f"Examples per second: {len(predictions)/(end_time - start_time):.2f}")

    print(f"\n{'='*50}")
    print("PERFORMANCE METRICS")
    print(f"{'='*50}")

    # Define label order for consistent matrix display
    labels = ["very negative", "negative", "neutral", "positive", "very positive"]

    print("\nClassification Report:")
    print(
        classification_report(
            ground_truths,
            predictions,
            labels=labels,
            zero_division=0,
        )
    )

    print(f"\nAccuracy Score: {accuracy_score(ground_truths, predictions):.4f}")



    print("\nConfusion Matrix:")
    cm = confusion_matrix(ground_truths, predictions, labels=labels)
    print("Labels order: very negative, negative, neutral, positive, very positive")
    print("Rows = True labels, Columns = Predicted labels")

# Call the main function
if __name__ == "__main__":
    main()

GPU: Tesla T4
GPU Memory: 15.8 GB
CUDA Version: 12.6
Using device: cuda
Starting inference
Processed 0 examples
Processed 10 examples
Processed 20 examples
Processed 30 examples
Processed 40 examples
Processed 50 examples
Processed 60 examples
Processed 70 examples
Processed 80 examples
Processed 90 examples
Processed 100 examples
Processed 110 examples
Processed 120 examples
Processed 130 examples
Processed 140 examples
Processed 150 examples
Processed 160 examples
Processed 170 examples
Processed 180 examples
Processed 190 examples
Processed 200 examples
Processed 210 examples
Processed 220 examples
Processed 230 examples
Processed 240 examples
Processed 250 examples
Processed 260 examples
Processed 270 examples
Processed 280 examples
Processed 290 examples
Processed 300 examples
Processed 310 examples
Processed 320 examples
Processed 330 examples
Processed 340 examples
Processed 350 examples
Processed 360 examples
Processed 370 examples
Processed 380 examples
Processed 390 examples
