# Zero-shot text classification with LLMs

This notebook illustrates how to use `pydantic` and JSON schemas to generate **structured outputs** with different LLMs for text classification.

- closed-source LLMs models by OpenAI
- open-weights model hosted via Hugging Face Inference Providers/Endpoints
- open-weights LLMs models with `ollama`

## Setup

In [None]:
from pathlib import Path
import pandas as pd
from src.utils.io import read_tabular
import re

from tqdm.notebook import tqdm
from sklearn.metrics import classification_report

from pydantic import BaseModel, Field
from typing import List, Optional, Literal

### Load data

In [None]:
COLAB = False # no support for colab yet
base_path = Path("/content/advanced_text_analysis/" if COLAB else "../../")
data_path = base_path / "data" / "labeled" / "benoit_crowdsourced_2016"

In [None]:
## (down)load the data
fp = data_path / "benoit_crowdsourced_2016-policy_area.csv"
if not fp.exists():
    url = "https://cta-text-datasets.s3.eu-central-1.amazonaws.com/labeled/" + fp.parent.name + '/' + fp.name
    df = pd.read_csv(url)
    fp.parent.mkdir(parents=True, exist_ok=True)
    df.to_csv(fp, index=False)

df = read_tabular(fp, columns=['uid', 'text', 'label', 'metadata__gold'])

In [None]:
# subset to gold examples (i.e., those labeled by experts)
df = df[df.metadata__gold]
del df['metadata__gold']

In [None]:
id2label = {
    2: 'economic',
    3: 'social',
    1: 'neither',
}
df.label = df.label.map(id2label)

print(df.label.value_counts())

In [None]:
# get five examples per label class
expls = df.groupby('label').sample(20, random_state=42)

In [None]:
texts = expls.text.to_list()

## Define the task

In this example, we adapt the instruction for one of the tweet classification tasks examined in Benoit et al. ([2016](https://doi.org/10.1017/S0003055416000058)) "Crowd-sourced Text Analysis: Reproducible and Agile Production
of Political Data"

- see [this README file](../../data/labeled/benoit_crowdsourced_2016/README.md) for a description of the data and tasks covered in the paper
- see [this file](../../data/labeled/benoit_crowdsourced_2016/instructions/econ_social_policy.md) for a copy of their original task instructions

#### Using structured outputs

Let's say, in addtion to the classification, we also want an **explantation/justification**.
Then the output becomes more **complex** than a single label.

In this scenario, it is a **best practice** to define an output class that can be passed to the LLM to generate and adequately formatted response.

This is done by defining a class that represents the desired structure of the output.
This information is then passed as a JSON schema to the LLM and applied as a decoding constraint when generating next tokens.

In [None]:
# create a pydantic model for the response

# the `ClassificationResponse` response model defines the structure of the expected output from the LLM.
# It includes two fields: `category`, which can take one of three literal values ('economic', 'social', 'neither'),
# and `explanation`, which is a string providing a one-sentence justification for the classification.
class ClassificationResponse(BaseModel):
    category: Literal['economic', 'social', 'neither'] = Field(..., description="The category assigned to the sentence")
    explanation: str = Field(..., description="A one-sentence justification for the classification")

# we can get and look at the corresponding JSON schema for the response model
json_schema = ClassificationResponse.model_json_schema()
print(json_schema)

*Note:* the example of adding a post-hoc explanation here is just an example. See [this notebook](./structured_generation_outlines_demo.ipynb) for more and better examples.

**Important:** Now we also need to instruction the model to add an explanation and return a JSON (instead of only the chosen category).

In [None]:
instructions = f"""
Act as a text classification system versatile in performing content analysis.

You will read a sentence from a political text.
Yout will judge whether this sentence deals with economic or social policy.
You must classify posts into one of the following categories: "economic", "social", or "neither". 

## Definitions

These categories have the following definitions:

- Sentences should be coded as "economic" if they deal with aspects of the economy, such as: Taxation, Government spending, Services provided by the government or other public bodies, Pensions, unemployment and welfare benefits, and other state benefits, Property, investment and share ownership, public or private, Interest rates and exchange rates, Regulation of economic activity, public or private, Relations between employers, workers and trade unions
- Sentences should be coded as "social" if they deal with aspects of social and moral life, relationships between social groups, and matters of national and social identity. These include: Policing, crime, punishment and rehabilitation of offenders; Immigration, relations between social groups, discrimination and multiculturalism; The role of the state in regulating the social and moral behavior of individuals

## Step-by-step instructions

Follow these steps to classify the sentence:

1. Carefully read the text of the sentence, paying close attention to details.
2. Assess whether the sentence belongs to any of the categories. If not, return 'neither' as your response.
3. Classify the sentence with the category it belongs to. 
4. Add a one-sentence justification for your classification.

## Response format

Return your response as a JSON dictionary with the following fields:
- "category": the category you assigned to the sentence, one of "economic", "social", or "neither"
- "explanation": a one-sentence justification for your classification
"""

## With ChatGPT

In [None]:
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
MODEL = 'gpt-4o-2024-08-06'

#### illustration without passing a JSON schema

In [None]:
text = df.text.iloc[5]
print(text)

messages = [
    # system prompt
    {"role": "system", "content": instructions},
    # user input
    {"role": "user", "content": text},
]

response = client.chat.completions.create(
    model=MODEL,
    messages=messages,
    temperature=0.001,
    seed=42
)

In [None]:
print(response.choices[0].message.content)

This is not what we want. We only need the JSON dictionary without the sourroundung ```

In [None]:
text = df.text.iloc[5]
print(text)

messages = [
    # system prompt
    {"role": "system", "content": instructions},
    # user input
    {"role": "user", "content": text},
]

response = client.chat.completions.parse(
    model=MODEL,
    messages=messages,
    temperature=0.001,
    seed=42,
    response_format=ClassificationResponse,
)

In [None]:
# get the parsed output
parsed_output = response.choices[0].message.parsed

In [None]:
import textwrap
print("\033[1mcategory:\033[0m", repr(output.category))
print("\033[1mexplanation:\033[0m", textwrap.fill(output.explanation, width=70, subsequent_indent='    '))


**TODO** Create a function we can use to iterate over multiple examples.

## With Hugging Face _Inference Providers_


In [None]:
import os
from huggingface_hub import InferenceClient

MODEL = "meta-llama/Meta-Llama-3-70B-Instruct"
client = InferenceClient(MODEL, token=os.environ.get("HF_TOKEN"))

the **cool thing** is that the `InferenceClient` works exactly like the `openai.Client` class.
So the code from above really _doesn't change_!

For hugging face Client, the response formatting requires a different approach (in current version ): 

In [None]:
from huggingface_hub.inference._generated.types.chat_completion import (
    ChatCompletionInputResponseFormatJSONSchema, 
    ChatCompletionInputJSONSchema
)

hf_json_schema = ChatCompletionInputJSONSchema(name="extracted pledge statements JSON schema", schema=json_schema)
hf_response_format =ChatCompletionInputResponseFormatJSONSchema(type="json_schema", json_schema=hf_json_schema)

#### illustration with a _single_ sentence

In [None]:
text = df.text.iloc[5]
print(text)

messages = [
    # system prompt
    {"role": "system", "content": instructions},
    # user input
    {"role": "user", "content": text},
]

response = client.chat.completions.create(
    model=MODEL,
    messages=messages,
    temperature=0.001,
    seed=42,
    response_format=hf_response_format
)

In [None]:
# look at the output
response.choices[0].message.content

This is a JSON. We need to parse it first with out output class

In [None]:
output = ClassificationResponse.model_validate_json(response.choices[0].message.content)

In [None]:
print("\033[1mcategory:\033[0m", repr(output.category))
print("\033[1mexplanation:\033[0m", textwrap.fill(output.explanation, width=70, subsequent_indent='    '))

Tada! 🎉

## With Ollama

In [None]:
from ollama import Client
client = Client()
MODEL = 'gemma3:4b'

In [None]:
text = df.text.iloc[5]
print(text)
messages = [
    # system prompt
    {"role": "system", "content": instructions},
    # user input
    {"role": "user", "content": text},
]

# set some options controlling generation behavior
# NOTE: this changed slightly compared to using `openai` Client
opts = {
    'seed': 42,
    'temperature': 0.0,
}
# NOTE: this changed slightly compared to using `openai` Client
response = client.chat(
    model=MODEL,
    messages=messages,
    options=opts,
    format=json_schema
)

In [None]:
output = ClassificationResponse.model_validate_json(response.message.content)

In [None]:
print("\033[1mcategory:\033[0m", repr(output.category))
print("\033[1mexplanation:\033[0m", textwrap.fill(output.explanation, width=70, subsequent_indent='    '))