# Zero-shot text classification with LLMs

This notebook illustrates how to use different LLMs for text classification.

- closed-source LLMs models by OpenAI
- open-weights model hosted via Hugging Face Inference Providers/Endpoints
- open-weights LLMs models with `ollama`

## Setup

In [8]:
import os
from pathlib import Path
import pandas as pd
from src.utils.io import read_tabular
import re

from tqdm.notebook import tqdm
from sklearn.metrics import classification_report

### Load data

In [9]:
COLAB = False # no support for colab yet
base_path = Path("/content/advanced_text_analysis/" if COLAB else "../../")
data_path = base_path / "data" / "labeled" / "benoit_crowdsourced_2016"

In [10]:
## (down)load the data
fp = data_path / "benoit_crowdsourced_2016-policy_area.csv"
if not fp.exists():
    url = "https://cta-text-datasets.s3.eu-central-1.amazonaws.com/labeled/" + fp.parent.name + '/' + fp.name
    df = pd.read_csv(url)
    fp.parent.mkdir(parents=True, exist_ok=True)
    df.to_csv(fp, index=False)

df = read_tabular(fp, columns=['uid', 'text', 'label', 'metadata__gold'])

In [11]:
# subset to gold examples (i.e., those labeled by experts)
df = df[df.metadata__gold]
del df['metadata__gold']

In [12]:
id2label = {
    2: 'economic',
    3: 'social',
    1: 'neither',
}
df['label'] = df.label.map(id2label)

print(df['label'].value_counts())

label
economic    225
neither     181
social      100
Name: count, dtype: int64


In [13]:
# get five examples per label class
expls = df.groupby('label').sample(20, random_state=42)

In [15]:
len(expls)

60

## Define the task

In this example, we adapt the instruction for one of the tweet classification tasks examined in Benoit et al. ([2016](https://doi.org/10.1017/S0003055416000058)) "Crowd-sourced Text Analysis: Reproducible and Agile Production
of Political Data"

- see [this README file](../../data/labeled/benoit_crowdsourced_2016/README.md) for a description of the data and tasks covered in the paper
- see [this file](../../data/labeled/benoit_crowdsourced_2016/instructions/econ_social_policy.md) for a copy of their original task instructions

In [16]:
instructions = f"""
Act as a text classification system versatile in performing content analysis.

You will read a sentence from a political text.
Yout will judge whether this sentence deals with economic or social policy.
You must classify the sentence into one of the following categories: "economic", "social", or "neither". 

## Definitions

These categories have the following definitions:

- Sentences should be coded as "economic" if they deal with aspects of the economy, such as: Taxation, Government spending, Services provided by the government or other public bodies, Pensions, unemployment and welfare benefits, and other state benefits, Property, investment and share ownership, public or private, Interest rates and exchange rates, Regulation of economic activity, public or private, Relations between employers, workers and trade unions
- Sentences should be coded as "social" if they deal with aspects of social and moral life, relationships between social groups, and matters of national and social identity. These include: Policing, crime, punishment and rehabilitation of offenders; Immigration, relations between social groups, discrimination and multiculturalism; The role of the state in regulating the social and moral behavior of individuals

## Step-by-step instructions

Follow these steps to classify the sentence:

1. Carefully read the text of the sentence, paying close attention to details.
2. Assess whether the sentence belongs to any of the categories. If not, return 'neither' as your response.
3. Classify the sentence with the category it belongs to. Return only the name of the category.

## Response format

Only include the selected category in your response and no further text.
"""

In [17]:
texts = expls.text.to_list()
texts[:3]

['They are no longer content that some of the most important decisions in their lives what school their children attend, for example, or whether or not to go on strike should be taken by officialdom or trade union bosses.',
 'Any extra burden on business will destroy jobs.',
 'We will increase the bonus by paying a double pension in the first week of December.']

## With ChatGPT

In [18]:
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
MODEL = 'gpt-4o-2024-08-06'

#### illustration with a _single_ sentence

In [19]:
text = df.text.iloc[5]
print(text)

messages = [
    # system prompt
    {"role": "system", "content": instructions},
    # user input
    {"role": "user", "content": text},
]

response = client.chat.completions.create(
    model=MODEL,
    messages=messages,
    temperature=0.001,
    seed=42
)

response.choices[0].message.content

And that Labour's much-vaunted pay pact with the unions collapsed in the industrial anarchy of the winter of discontent, n which the dead went unburied, rubbish piled up in the streets and the country was gripped by a creeping paralysis which Labour was powerless to cure?


'economic'

In [20]:
df.label.iloc[5]

'economic'

### Iterate over multiple examples

Let's first define a custom function to classify texts:

In [21]:
def classify_text(text, system_message, model):

  # clean the text 
  text = re.sub(r'\s+', ' ', text).strip()

  # construct input

  messages = [
    # system prompt
    {"role": "system", "content": system_message},
    # user input
    {"role": "user", "content": text},
  ]

  response = client.chat.completions.create(
    model=model,
    messages=messages,
    temperature=0.001,
    seed=42
  )
  
  result = response.choices[0].message.content
  
  return result

In [23]:
len(texts)
MODEL

'gpt-4o-2024-08-06'

Now we can iterate over example texts:

In [24]:
classifications_gpt4o = [
    classify_text(text, instructions, model=MODEL)
    for text in tqdm(texts)
]

  0%|          | 0/60 [00:00<?, ?it/s]

In [25]:
cr = classification_report(
    y_true=expls.label,
    y_pred=classifications_gpt4o,
)
print(cr)

              precision    recall  f1-score   support

    economic       0.95      0.90      0.92        20
     neither       1.00      0.90      0.95        20
      social       0.87      1.00      0.93        20

    accuracy                           0.93        60
   macro avg       0.94      0.93      0.93        60
weighted avg       0.94      0.93      0.93        60




#### Caveate {style="color: orange"}

The annoying thing about OpenAI is that their models are closed-source, meaning we have no access to them.
This limits reproducibility (see Palmer et al. [2024](https://www.nature.com/articles/s43588-023-00585-1)).

So instead of relying them, we can use "open-weights" models.
These are models for which we can freely download the model weights (i.e., paramters).
We examine two options below: 

1. using Hugging Face _Inference Providers_ (via API)
2. using Ollama (run locally)

## With Hugging Face _Inference Providers_


In [26]:
import os
from huggingface_hub import InferenceClient

MODEL = "meta-llama/Meta-Llama-3-70B-Instruct"
client = InferenceClient(MODEL, token=os.environ.get("HF_TOKEN"))

the **cool thing** is that the `InferenceClient` works exactly like the `openai.Client` class.
So the code from above really _doesn't change_!

#### illustration with a _single_ sentence

In [27]:
text = df.text.iloc[5]
print(text)

messages = [
    # system prompt
    {"role": "system", "content": instructions},
    # user input
    {"role": "user", "content": text},
]

response = client.chat.completions.create(
    model=MODEL,
    messages=messages,
    temperature=0.001,
    seed=42
)

response.choices[0].message.content

And that Labour's much-vaunted pay pact with the unions collapsed in the industrial anarchy of the winter of discontent, n which the dead went unburied, rubbish piled up in the streets and the country was gripped by a creeping paralysis which Labour was powerless to cure?


'social'

In [28]:
df.label.iloc[5]

'economic'

### Iterate over multiple examples

Let's first define a custom function to classify texts:

In [29]:
def classify_text(text, system_message, model):
  # NOTE: `model` actually not needed because we setup the InferenceClient with the model already

  # clean the text 
  text = re.sub(r'\s+', ' ', text).strip()

  # construct input

  messages = [
    # system prompt
    {"role": "system", "content": system_message},
    # user input
    {"role": "user", "content": text},
  ]

  response = client.chat.completions.create(
    model=model,
    messages=messages,
    temperature=0.001,
    seed=42
  )
  
  result = response.choices[0].message.content
  
  return result

Now we can iterate over example texts:

In [30]:
classifications_llama3_70b = [
    classify_text(text, instructions, model=MODEL)
    for text in tqdm(texts)
]

  0%|          | 0/60 [00:00<?, ?it/s]

In [31]:
cr = classification_report(
    y_true=expls.label,
    y_pred=classifications_llama3_70b,
)
print(cr)

              precision    recall  f1-score   support

    economic       0.85      0.85      0.85        20
     neither       1.00      0.70      0.82        20
      social       0.77      1.00      0.87        20

    accuracy                           0.85        60
   macro avg       0.87      0.85      0.85        60
weighted avg       0.87      0.85      0.85        60



Alright, the performance is slightly lower but this is only one of many available models.

What if we try the very famous R1 model from DeepSeek?

In [32]:
MODEL= "deepseek-ai/DeepSeek-V3-0324"

client = InferenceClient(MODEL, provider="sambanova", token=os.environ.get("HF_TOKEN"))

In [33]:
classify_text(text=df.text.iloc[5], system_message=instructions, model=MODEL)

'social'

In [34]:
classifications_deepseekR1 = [
    classify_text(text, instructions, model=MODEL)
    for text in tqdm(texts)
]

  0%|          | 0/60 [00:00<?, ?it/s]

In [36]:
classification_report?

[31mSignature:[39m
classification_report(
    y_true,
    y_pred,
    *,
    labels=[38;5;28;01mNone[39;00m,
    target_names=[38;5;28;01mNone[39;00m,
    sample_weight=[38;5;28;01mNone[39;00m,
    digits=[32m2[39m,
    output_dict=[38;5;28;01mFalse[39;00m,
    zero_division=[33m'warn'[39m,
)
[31mDocstring:[39m
Build a text report showing the main classification metrics.

Read more in the :ref:`User Guide <classification_report>`.

Parameters
----------
y_true : 1d array-like, or label indicator array / sparse matrix
    Ground truth (correct) target values.

y_pred : 1d array-like, or label indicator array / sparse matrix
    Estimated targets as returned by a classifier.

labels : array-like of shape (n_labels,), default=None
    Optional list of label indices to include in the report.

target_names : array-like of shape (n_labels,), default=None
    Optional display names matching the labels (same order).

sample_weight : array-like of shape (n_samples,), default=Non

dict_values(['economic', 'social', 'neither'])

In [None]:
cr = classification_report(
    y_true=expls.label,
    y_pred=classifications_deepseekR1,
    labels=list(id2label.values())
)
print(cr)

              precision    recall  f1-score   support

    economic       1.00      0.90      0.95        20
      social       0.69      1.00      0.82        20
     neither       1.00      0.60      0.75        20

   micro avg       0.85      0.83      0.84        60
   macro avg       0.90      0.83      0.84        60
weighted avg       0.90      0.83      0.84        60



Well this didn't get any better but we could try other models very flexible, see [here](https://huggingface.co/inference/models) for available models by different providers.


## With Ollama

In [41]:
from ollama import Client
client = Client()
MODEL = 'gemma3:4b'

In [42]:
# list models
available_models = [m['model'] for m in client.list()['models']]
available_models

['deepseek-r1:32b',
 'gemma3:27b',
 'sikamikanikobg/OlympicCoder-7B:latest',
 'qwq:latest',
 'qwq:32b',
 'llama3.3:70b',
 'mistral-small:latest',
 'mistral-small:24b',
 'qwen2.5:32b',
 'llama3.1:8b',
 'aya:35b',
 'llama3.3:latest',
 'mxbai-embed-large:335m',
 'phi4:latest',
 'phi4:14b']

In [43]:
if MODEL not in available_models:
    import ollama
    ollama.pull(MODEL)

ResponseError: pull model manifest: 412: 

The model you are attempting to pull requires a newer version of Ollama.

Please download the latest version at:

	https://ollama.com/download

 (status code: 500)

### Iterate over multiple examples

Let's first define a custom function to classify tweets:

In [None]:
def classify_text(text, system_message, model):

  # clean the text 
  text = re.sub(r'\s+', ' ', text).strip()

  # construct input

  messages = [
    # system prompt
    {"role": "system", "content": system_message},
    # user input
    {"role": "user", "content": text},
  ]

  # set some options controlling generation behavior
  # NOTE: this changed slightly compared to using `openai` Client
  opts = {
      'seed': 42,         # seed controlling random number generation and thus stochastic generation
      'temperature': 0.0, # hyper parameter controlling "craetivity", see https://learnprompting.org/docs/basics/configuration_hyperparameters
      'max_tokens': 3     # maximum numbers of tokens to generate in completion
  }
  # NOTE: this changed slightly compared to using `openai` Client
  response = client.chat(
    model=model,
    messages=messages,
    options=opts
  )
  
  # NOTE: this changed slightly compared to using `openai` Client
  result = response.message.content.strip()
  
  return result

In [None]:
# for the first call, it migth take some tome because the model needs to be loaded first
classify_text(texts[5], instructions, MODEL)

Now we can iterate over example texts:

In [None]:
classifications_gemma3_4b = [
    classify_text(text, instructions, model=MODEL)
    for text in tqdm(texts)
]

In [None]:
cr = classification_report(
    y_true=expls.label,
    y_pred=classifications_gemma3_4b,
)
print(cr)

## With `transformers`

**_Caveat:_** We can only use a very small LLM for illustrative purposes here

Note: on CUDA GPU you can alos use quantization:

```pyhton
import torch
from transformers import BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16)

model = AutoModelForCausalLM.from_pretrained(..., quantization_config=bnb_config, ...)
````

In [44]:
import os
from transformers import AutoModelForCausalLM, AutoTokenizer, set_seed
os.environ["TRANSFORMERS_VERBOSITY"] = "error"

# load the model and tokenizer
model_id = "Qwen/Qwen2.5-1.5B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, pad_token_id=tokenizer.eos_token_id, device_map="auto")

In [45]:
text = df.text.iloc[5]
print(text)

messages = [
    # system prompt
    {"role": "system", "content": instructions},
    # user input
    {"role": "user", "content": text},
]

And that Labour's much-vaunted pay pact with the unions collapsed in the industrial anarchy of the winter of discontent, n which the dead went unburied, rubbish piled up in the streets and the country was gripped by a creeping paralysis which Labour was powerless to cure?


We first need to apply the chat template so we can perfrom [chat completion](https://huggingface.co/docs/inference-providers/en/tasks/chat-completion) instead of mere text generation/completion:

In [46]:
chat_messages = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
print(chat_messages)

<|im_start|>system

Act as a text classification system versatile in performing content analysis.

You will read a sentence from a political text.
Yout will judge whether this sentence deals with economic or social policy.
You must classify the sentence into one of the following categories: "economic", "social", or "neither". 

## Definitions

These categories have the following definitions:

- Sentences should be coded as "economic" if they deal with aspects of the economy, such as: Taxation, Government spending, Services provided by the government or other public bodies, Pensions, unemployment and welfare benefits, and other state benefits, Property, investment and share ownership, public or private, Interest rates and exchange rates, Regulation of economic activity, public or private, Relations between employers, workers and trade unions
- Sentences should be coded as "social" if they deal with aspects of social and moral life, relationships between social groups, and matters of nat

As you see, this just converts the list of messages into text by adding special tokens that demarcate text by the assistant and user.

Next, we need to tokenizer this input:

In [47]:
inputs = tokenizer(chat_messages, return_tensors="pt")
inputs = inputs.to(model.device) # move to same device as model (GPU if available)

The tokenized inputs can be processed through the model to generate a response:

In [52]:
outputs = model.generate(**inputs, max_new_tokens=4, do_sample=False)

In [53]:
offset = inputs['input_ids'].shape[1]
response = tokenizer.decode(outputs[0][offset:].cpu(), skip_special_tokens=True)

In [54]:
response

'economical'

## Inter-LLM agreement?

What if we consider the different LLM's classifications as annotations?
Then we compute see the degree of their inter-annotator agreement (ICA).

This is equivalent to what we did in the [notebook](../annotation/compute_ica_pledge_classification.ipynb) on computing ICA in our policy pledge codings

In [None]:
import pandas as pd
from krippendorff import alpha

tmp = pd.DataFrame({
    'gpt4o': classifications_gpt4o,
    'gemma3_4b': classifications_gemma3_4b,
    'llama3_70b': classifications_llama3_70b,
    'deepseekR1': classifications_deepseekR1,
})

label2id = {
    'economic': 0,
    'social': 1,
    'neither': 2,
}

tmp = tmp.apply(lambda x: x.map(label2id))
alpha(tmp.T.values, level_of_measurement='nominal')

😳 This is a very strong agreement between LLMs' classifications.