# Few-shot text classification with LLMs

This notebook illustrates how to use different LLMs for text classification.

- closed-source LLMs models by OpenAI
- open-weights model hosted via Hugging Face Inference Providers/Endpoints
- open-weights LLMs models with `ollama`

## Setup

In [None]:
import os
from pathlib import Path
import pandas as pd
from src.utils.io import read_tabular
import re

from tqdm.notebook import tqdm
from sklearn.metrics import classification_report

### Load data

In [None]:
COLAB = False # no support for colab yet
base_path = Path("/content/advanced_text_analysis/" if COLAB else "../../")
data_path = base_path / "data" / "labeled" / "benoit_crowdsourced_2016"

In [None]:
## (down)load the data
fp = data_path / "benoit_crowdsourced_2016-policy_area.csv"
if not fp.exists():
    url = "https://cta-text-datasets.s3.eu-central-1.amazonaws.com/labeled/" + fp.parent.name + '/' + fp.name
    df = pd.read_csv(url)
    fp.parent.mkdir(parents=True, exist_ok=True)
    df.to_csv(fp, index=False)

df = read_tabular(fp, columns=['uid', 'text', 'label', 'metadata__gold'])

In [None]:
# subset to gold examples (i.e., those labeled by experts)
df = df[df.metadata__gold]
del df['metadata__gold']

In [None]:
id2label = {
    2: 'economic',
    3: 'social',
    1: 'neither',
}
df.label = df.label.map(id2label)

print(df.label.value_counts())

In [None]:
# get 20 examples per label class
inputs = df.groupby('label').sample(20, random_state=42)

#### sample few-shot examples

In [None]:
exemplars = df[~df.uid.isin(inputs.uid)].groupby('label').sample(3, random_state=42).reset_index(drop=True).sample(frac=1.0, random_state=42)

We use these "exemplars" in the conversation history to demonstrate the disred annotation behavior.

For this, we have to format them in **turns** of input text and assistants response (using the observed "true" label):

In [None]:
text_template = "Text: '''{text}'''"

def get_exemplar_messages(exemplars):
    exemplar_messages = []
    for _, row in exemplars.iterrows():
        exemplar_messages.append({"role": "user", "content": text_template.format(text=row.text)})
        exemplar_messages.append({"role": "assistant", "content": f"{row.label}"})
    return exemplar_messages

exemplar_messages = get_exemplar_messages(exemplars)

In [None]:
exemplar_messages[:4]

## Define the task

In this example, we adapt the instruction for one of the tweet classification tasks examined in Benoit et al. ([2016](https://doi.org/10.1017/S0003055416000058)) "Crowd-sourced Text Analysis: Reproducible and Agile Production
of Political Data"

- see [this README file](../../data/labeled/benoit_crowdsourced_2016/README.md) for a description of the data and tasks covered in the paper
- see [this file](../../data/labeled/benoit_crowdsourced_2016/instructions/econ_social_policy.md) for a copy of their original task instructions

In [None]:
instructions = f"""
Act as a text classification system versatile in performing content analysis.

You will read a sentence from a political text.
Yout will judge whether this sentence deals with economic or social policy.
You must classify posts into one of the following categories: "economic", "social", or "neither". 

## Definitions

These categories have the following definitions:

- Sentences should be coded as "economic" if they deal with aspects of the economy, such as: Taxation, Government spending, Services provided by the government or other public bodies, Pensions, unemployment and welfare benefits, and other state benefits, Property, investment and share ownership, public or private, Interest rates and exchange rates, Regulation of economic activity, public or private, Relations between employers, workers and trade unions
- Sentences should be coded as "social" if they deal with aspects of social and moral life, relationships between social groups, and matters of national and social identity. These include: Policing, crime, punishment and rehabilitation of offenders; Immigration, relations between social groups, discrimination and multiculturalism; The role of the state in regulating the social and moral behavior of individuals

## Step-by-step instructions

Follow these steps to classify the sentence:

1. Carefully read the text of the sentence, paying close attention to details.
2. Assess whether the sentence belongs to any of the categories. If not, return 'neither' as your response.
3. Classify the sentence with the category it belongs to. Return only the name of the category.

## Response format

Only include the selected category in your response and no further text.
"""

In [None]:
texts = inputs.text.to_list()
texts[:3]

## With ChatGPT

In [None]:
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
MODEL = 'gpt-4o-2024-08-06'

#### illustration with a _single_ sentence

In [None]:
text = df.text.iloc[5]
print(text)

messages = [
    # system prompt
    {"role": "system", "content": instructions},
    # NOTE: here we inject the few-shot examples between the instruction and the to-be-classified text
    *exemplar_messages,
    # user input
    {"role": "user", "content": text_template.format(text=text)},
]

response = client.chat.completions.create(
    model=MODEL,
    messages=messages,
    temperature=0.001,
    seed=42
)

response.choices[0].message.content

### Iterate over multiple examples

Let's first define a custom function to classify texts:

In [None]:
def classify_text(text, system_message, exemplars, model):

  # clean the text 
  text = re.sub(r'\s+', ' ', text).strip()

  # construct input

  messages = [
    # system prompt
    {"role": "system", "content": system_message},
    # NOTE: here we inject the few-shot examples between the instruction and the to-be-classified text
    *exemplars,
    # user input
    {"role": "user", "content": text_template.format(text=text)},
  ]

  response = client.chat.completions.create(
    model=model,
    messages=messages,
    temperature=0.001,
    seed=42
  )
  
  result = response.choices[0].message.content
  
  return result

Now we can iterate over example texts:

In [None]:
classifications_gpt4o = [
    classify_text(text, instructions, exemplar_messages, model=MODEL)
    for text in tqdm(texts)
]

In [None]:
cr = classification_report(
    y_true=inputs.label,
    y_pred=classifications_gpt4o,
)
print(cr)

Without exemplars (few-shot inference), the macro F1 was 0.93.
This was already very strong. 

So in this case, adding exemplars doesn't achieve and improvement.

## With Hugging Face _Inference Providers_


In [None]:
import os
from huggingface_hub import InferenceClient

MODEL = "meta-llama/Meta-Llama-3-70B-Instruct"
client = InferenceClient(MODEL, token=os.environ.get("HF_TOKEN"))

the **cool thing** is that the `InferenceClient` works exactly like the `openai.Client` class.
So the code from above really _doesn't change_!

#### illustration with a _single_ sentence

In [None]:
text = df.text.iloc[5]
print(text)

messages = [
    # system prompt
    {"role": "system", "content": instructions},
    # NOTE: here we inject the few-shot examples between the instruction and the to-be-classified text
    *exemplar_messages,
    # user input
    {"role": "user", "content": text_template.format(text=text)},
]

response = client.chat.completions.create(
    model=MODEL,
    messages=messages,
    temperature=0.001,
    seed=42
)

response.choices[0].message.content

### Iterate over multiple examples

Let's first define a custom function to classify texts:

In [None]:
def classify_text(text, system_message, exemplars, model):
  # NOTE: `model` actually not needed because we setup the InferenceClient with the model already

  # clean the text 
  text = re.sub(r'\s+', ' ', text).strip()

  # construct input

  messages = [
    # system prompt
    {"role": "system", "content": system_message},
    # NOTE: here we inject the few-shot examples between the instruction and the to-be-classified text
    *exemplars,
    # user input
    {"role": "user", "content": text_template.format(text=text)},
  ]

  response = client.chat.completions.create(
    model=model,
    messages=messages,
    temperature=0.001,
    seed=42
  )
  
  result = response.choices[0].message.content
  
  return result

Now we can iterate over example texts:

In [None]:
classifications_llama3_70b = [
    classify_text(text, instructions, exemplar_messages, model=MODEL)
    for text in tqdm(texts)
]

In [None]:
cr = classification_report(
    y_true=inputs.label,
    y_pred=classifications_llama3_70b,
)
print(cr)

With few-shot inference, the macro F1 was 0.83.
So in this case, we could boost it to 0.89.


Let's try also with the R1 model from DeepSeek:

In [None]:
MODEL= "deepseek-ai/DeepSeek-V3-0324"

client = InferenceClient(MODEL, provider="sambanova", token=os.environ.get("HF_TOKEN"))

In [None]:
classifications_deepseekR1 = [
    classify_text(text, instructions, exemplar_messages, model=MODEL)
    for text in tqdm(texts)
]

In [None]:
cr = classification_report(
    y_true=inputs.label,
    y_pred=classifications_deepseekR1,
)
print(cr)

Here we boost the macro F1 from 0.85 (zero-shot) to 0.90 (9-shot).

## With Ollama

In [None]:
from ollama import Client
client = Client()
MODEL = 'gemma3:4b'

In [None]:
# list models
available_models = [m['model'] for m in client.list()['models']]

if MODEL not in available_models:
    import ollama
    ollama.pull(MODEL)

### Iterate over multiple examples

Let's first define a custom function to classify tweets:

In [None]:
def classify_text(text, system_message, exemplars, model):

  # clean the text 
  text = re.sub(r'\s+', ' ', text).strip()

  # construct input

  messages = [
    # system prompt
    {"role": "system", "content": system_message},
    # NOTE: here we inject the few-shot examples between the instruction and the to-be-classified text
    *exemplars,
    # user input
    {"role": "user", "content": text_template.format(text=text)},
  ]

  # set some options controlling generation behavior
  # NOTE: this changed slightly compared to using `openai` Client
  opts = {
      'seed': 42,         # seed controlling random number generation and thus stochastic generation
      'temperature': 0.0, # hyper parameter controlling "craetivity", see https://learnprompting.org/docs/basics/configuration_hyperparameters
      'max_tokens': 3     # maximum numbers of tokens to generate in completion
  }
  # NOTE: this changed slightly compared to using `openai` Client
  response = client.chat(
    model=model,
    messages=messages,
    options=opts
  )
  
  # NOTE: this changed slightly compared to using `openai` Client
  result = response.message.content.strip()
  
  return result

In [None]:
classifications_gemma3_4b = [
    classify_text(text, instructions, exemplar_messages, model=MODEL)
    for text in tqdm(texts)
]

In [None]:
cr = classification_report(
    y_true=inputs.label,
    y_pred=classifications_gemma3_4b,
)
print(cr)

Here we boost the macro F1 from 0.80 to 0.87 (by 8.75%).
So for the smallest model, we get the strongest relative gain from using few-shot exemplars.

## Similarity-based exemplar selection 

In [None]:
import pandas as pd
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

embedder = SentenceTransformer('all-mpnet-base-v2')

In [None]:
def select_exemplars_by_similarity(text, embeddings, labels, k=3):
    """
    Select the top-k most similar exemplars for each class based on cosine similarity.
    """
    # embed the input text
    text_embedding = embedder.encode(text)
    # compute cosine similarities
    similarities = cosine_similarity([text_embedding], embeddings)[0]
    # put similarities and labels in a DataFrame
    out = pd.Series(similarities).to_frame('similarity')
    out['label'] = labels
    # select top-k exemplars per class
    out = out.groupby('label')['similarity'].apply(lambda x: x.nlargest(k)).reset_index(level=0, drop=True)
    # reshuffle
    out = out.sample(frac=1.0, random_state=42)
    # return the indices of the selected exemplars
    return out.index.tolist()

# pre-compute embeddings for all available exemplars (i.e., those not in the set of input texts to classify)
exemplars = df[~df.uid.isin(inputs.uid)].reset_index(drop=True)
exemplar_embeddings = embedder.encode(exemplars.text.to_list())

n_exemplars = 3
exemplars_by_text = []
# iterate over all texts to select the most similar exemplars for each text
for text in tqdm(texts, desc="Selecting exemplars by similarity"):
    idxs = select_exemplars_by_similarity(text, exemplar_embeddings, exemplars.label, k=n_exemplars)
    exs = exemplars.iloc[idxs]
    exemplars_by_text.append(get_exemplar_messages(exs))

In [None]:
# classify all texts using similarity-based exemplar selection
classifications_gemma3_4b_sim = []
for text, exs in tqdm(zip(texts, exemplars_by_text), total=len(texts)):
    pred = classify_text(text, instructions, exs, model=MODEL)
    classifications_gemma3_4b_sim.append(pred)

In [None]:
cr = classification_report(
    y_true=inputs.label,
    y_pred=classifications_gemma3_4b_sim,
)
print(cr)

The macro F1 doesn't change but the performance across label classes becomes more balanced

## Inter-LLM agreement?

What if we consider the different LLM's classifications as annotations?
Then we compute see the degree of their inter-annotator agreement (ICA).

This is equivalent to what we did in the [notebook](../annotation/compute_ica_pledge_classification.ipynb) on computing ICA in our policy pledge codings

In [None]:
import pandas as pd
from krippendorff import alpha

tmp = pd.DataFrame({
    'gpt4o': classifications_gpt4o,
    'gemma3_4b': classifications_gemma3_4b,
    'llama3_70b': classifications_llama3_70b,
    'deepseekR1': classifications_deepseekR1,
})

label2id = {
    'economic': 0,
    'social': 1,
    'neither': 2,
}

tmp = tmp.apply(lambda x: x.map(label2id))
alpha(tmp.T.values, level_of_measurement='nominal')

😳 This is a very strong agreement and about 10% higher than the agreemet between LLMs' zero-shot classifications.