# Application: Pledge extraction through in-context learning

This notebook illustrates how to apply in-context learning to identify policy pledges in party manifesto sentences.

We will rely on LLMs served via the Hugging Face _Inference Providers_ API. 

### Background

We can apply in-context learning to task a large pre-trained generative language model with an annotation task.
This is commonly referred to as ["prompting"](https://learnprompting.org/vocabulary/prompting).

The **most important ingredients** for this approach are:

1. A large pre-trained generative language model (LLM) that can generate [chat completions](https://huggingface.co/docs/inference-providers/en/tasks/chat-completion).
2. A task description that describes the to-be-completed annotation task to the LLM.
3. To-be-annotated texts.


## Basic setup

### Defining the prompt

The formatted string below specifies the task instruction that we will use as a [**system prompt**](https://promptengineering.org/system-prompts-in-large-language-models/) for the model.

The task instructions are based on the **defintion of a pledge** used in the [_Comparative Pledges Project_](https://comparativepledges.net/).
As is common, the instruction include
    
1. a task description,
2. a definition of the focal concept,
3. instructions for how to format the response, and
4. (optionally) a few examples that illustrate the desired annotation behavior.

Note that in the example below, 

1. We use a *persona*. This is a common practice although there is [no clear evidence](https://arxiv.org/abs/2311.10054) that this improves LLMs' performance.
2. we use a strategy called ["Chain-of-Thought" prompting](https://learnprompting.org/vocabulary/CoT_Prompting), that is, a prompt that describes the "thought process" the LLM should apply to generate its response.

In [1]:
prompt = """
You are an expert annotator of political texts. \  
Your task is to identify **pledges** and extract them from a sentence taken from a party manifesto.  

### Definition of a Pledge  

A pledge is a specific kind of statement in which the author commits to a concrete action or outcome.
To be counted as a pledge, a statement must meet two criteria: 

1. **Specific and Concrete**: The statement should outline a clear, measurable action or outcome.
2. **Clear Commitment**: The statement should represent a firm commitment to take that action, not just a general principle or stance.

Importantly, this is a *narrow* definition: only statements with clear, measurable, and time-bound outcomes are counted as pledges.

### Reasoning Steps (Chain-of-Thought)  

1. **Read the input text carefully.**  
2. **Identify candidate statements** in the text that might be pledges.  
3. For each candidate, verify the two criteria:  
   - Is it specific and concrete? (Does it mention a measurable action or outcome?)  
   - Does it express a clear commitment to act?  
4. Only if both criteria are met, consider a statement as a pledge.
5. **Extract the pledge verbatim** from the text by copying it exactly as it appears.

### Examples  

sentence: "We will build 100,000 affordable housing units by 2030."
output: {"pledges": ["build 100,000 affordable housing units by 2030"]}
explanation: the phrase "We will build 100,000 affordable housing units by 2030" is a concrete commitment with a measurable outcome.

sentences: "We believe in fair housing."
output: {"pledges": []}
explanation: This statement merely notes a policy _stance_ but does not contain any pledges.

sentence: "This year's election will be the most important in our history."
output: {"pledges": []}
explanation: This statement does not contain any pledges.

sentence: "We will ensure all public schools are inclusive and welcoming."
output: {"pledges": []}
explanation: While the phrase "ensure all public schools are inclusive and welcoming" might be viewed as a pledge, our narrow definition would not count it due to the lack of a specific metric.

sentence: "We will foster a culture of innovation in our economy."
output: {"pledges": []}
explanation: While "foster a culture of innovation in our economy" might be seen as a pledge in a broader sense, but under our narrow criteria, it is too abstract to count.

### Instruction  

Follow the reasoning steps above. \
Then, output your final decision by listing all verbatim pledges in a JSON array. If no pledges are found, return an empty array.  
"""

### Defining the output format

When we use generative LLMs to complete annotation tasks, it is a **best practice** to specify how the [outputs should be structured](https://humanloop.com/blog/structured-outputs).
After all, as in content analysis more generally, we don't want the model to responde in natural language but with data.

In **extractive tasks**, 

- we commonly want the model to extract the verbatim passages in the text that represent a certain context
- each text can contain none, one, or more such passages

Therefore, we can think of the intended output format as a _list_ of _strings_ (that must occur _literally_ in the input text).

Below, we use `pydantic` to define a custom class that defines the intended structure of an LLM's output.

In [2]:
from pydantic import BaseModel, fields
from typing import List, Tuple, Optional
from huggingface_hub.inference._generated.types.chat_completion import (
    ChatCompletionInputResponseFormatJSONSchema, 
    ChatCompletionInputJSONSchema
)

class PledgeExtractionOutput(BaseModel):
    # attribute capturing the list of verbatim pledges extracted from the text
    pledges: List[str] = fields.Field(default_factory=list, description="A list of verbatim pledges extracted from the text. If no pledges are found, this list should be empty.", required=True)
    # and optional attribute capturing the character spans of the pledges in the original text
    spans: Optional[List[Tuple[int, int]]] = None

    def get_spans(self, text: str) -> List[Tuple[int, int]]:
        """
        Get the character spans of the extracted pledges in the original text.
        """
        spans = []
        for pledge in self.pledges:
            start = text.find(pledge)
            while start != -1:
                end = start + len(pledge)
                spans.append((start, end))
                start = text.find(pledge, end)
        return spans
    
    def _set_spans(self, text: str):
        """
        Set the character spans of the extracted pledges in the original text.
        """
        setattr(self, 'spans', self.get_spans(text))

Let's get the corresponding JSON schema of this output class:

In [3]:
import json
schema = PledgeExtractionOutput.model_json_schema()
print(json.dumps(schema, indent=2))

{
  "properties": {
    "pledges": {
      "description": "A list of verbatim pledges extracted from the text. If no pledges are found, this list should be empty.",
      "items": {
        "type": "string"
      },
      "required": true,
      "title": "Pledges",
      "type": "array"
    },
    "spans": {
      "anyOf": [
        {
          "items": {
            "maxItems": 2,
            "minItems": 2,
            "prefixItems": [
              {
                "type": "integer"
              },
              {
                "type": "integer"
              }
            ],
            "type": "array"
          },
          "type": "array"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "title": "Spans"
    }
  },
  "title": "PledgeExtractionOutput",
  "type": "object"
}


It specifies the "properties" of an instance of the output class.

- pledges: a JSON array of strings that capture the extracted passages
- spans: a JSON array of 2-tuples that represent the start and end character possitions of a given phrase in the "pledges" elememnt

Say given an input text, an LLM returns

```
'{"pledges": ["reduce unemployment to 5%"]}'
```

Then this JSON-formatted string can be parsed into a python object like this:

In [4]:
response = '{"pledges": ["reduce unemployment to 5%"]}'
output = PledgeExtractionOutput.model_validate_json(response)
output

PledgeExtractionOutput(pledges=['reduce unemployment to 5%'], spans=None)

If we then pass the text from which this pledge was extracted ...

In [5]:
text = "Our goal is to reduce unemployment to 5% by next year."
output._set_spans(text)

... we have all the info we need:

In [6]:
output

PledgeExtractionOutput(pledges=['reduce unemployment to 5%'], spans=[(15, 40)])

Combined with the origanl input text, we can always reconstruct the extracted pledges from the character spans:

In [7]:
# NOTE: The spans can be used to subset the original text to get the exact pledge phrases
print("text:", repr(text))
print("pledges:")
for span in output.spans:
    print(" -", repr(text[slice(*span)]))

text: 'Our goal is to reduce unemployment to 5% by next year.'
pledges:
 - 'reduce unemployment to 5%'


**_Note_:** If you don't understand why we need a custom output class, read this [blog post](https://humanloop.com/blog/structured-outputs).

### Setup the model API

We will rely on LLMs served via the Hugging Face _Inference Providers_ API. 

In [8]:
from huggingface_hub import InferenceClient

client = InferenceClient("meta-llama/Meta-Llama-3-70B-Instruct")

We can use this model for **chat completion** by passing it a conversation history, that is, a list of turns between the user and assistant.
To instruct the LLM what it should do, we pass the **task instruction prompt** as a system message.
The user then inputs a to-be-annotated text.

In [9]:
text = r"We will reduce carbon emissions by 50% by 2030 and achieve net-zero emissions by 2050."

messages = [
    {
        "role": "system",
        "content": prompt
    }
    ,
    {
        "role": "user",
        "content": text.strip()
    }
]

The LLM completes this conversation with an assistant message we can parse:

In [10]:
completion = client.chat.completions.create(messages=messages)
print(completion.choices[0].message.content)

Let's go through the reasoning steps to identify pledges in the given sentence:

1. Read the input text carefully: "We will reduce carbon emissions by 50% by 2030 and achieve net-zero emissions by 2050."
2. Identify candidate statements: There are two statements in the sentence that might be pledges.
3. Verify the two criteria for each candidate:
   - "reduce carbon emissions by 50% by 2030":
     - Is it specific and concrete? Yes, it mentions a measurable reduction (50%) and a specific deadline (2030).
     - Does it express a clear commitment to act? Yes, it uses the phrase "We will".
   - "achieve net-zero emissions by 2050":
     - Is it specific and concrete? Yes, it mentions a measurable outcome (net-zero emissions) and a specific deadline (2050).
     - Does it express a clear commitment to act? Yes, it uses the phrase "We will".

Both candidate statements meet the criteria, so they are considered pledges.

4. Extract the pledges verbatim from the text: 
   - "reduce carbon emi

⚡️ Wait! The model generates an assistant response that has much more than the JSON-formatted output we need.

This is where our **custom output class** and the JSON schema it defines come into play.
We process the JSON schema string into a suitable python object and then pass it to the `response_format` argument of the `client.chat.completions.create()` method:


In [11]:
json_schema = ChatCompletionInputJSONSchema(name="extracted pledge statements JSON schema", schema=schema)
response_format =ChatCompletionInputResponseFormatJSONSchema(type="json_schema", json_schema=json_schema)

completion = client.chat.completions.create(messages=messages, response_format=response_format)
print(completion.choices[0].message.content)

{"pledges": ["reduce carbon emissions by 50% by 2030", "achieve net-zero emissions by 2050"]}


This looks much better! 💫

As the generated **response adheres to out output format**, we can _parse_ it with out custom output class:

In [12]:
output = PledgeExtractionOutput.model_validate_json(completion.choices[0].message.content)
output._set_spans(text)
output

PledgeExtractionOutput(pledges=['reduce carbon emissions by 50% by 2030', 'achieve net-zero emissions by 2050'], spans=[(8, 46), (51, 85)])

## Bringing everything together in a handy custom "extractor" class

Instead of executing the steps above one by one, we can define a class (we call it `PledgeExtractor`) that can be used to initialize a given model model to behave like a pledge extractor.

In [None]:
from tqdm.auto import tqdm

class PledgeExtractor:
    """
    A class to extract pledge statements from text using a language model.
    """
    system_prompt = prompt # NOTE: use prompt defined above
    response_format = response_format # NOTE: use response_format defined above
    
    def __init__(self, model: str, **kwargs):
        """
        Initialize the PledgeExtractor with a given model.

        Args:
            model: The name of the model to use for extraction.
            kwargs: Additional keyword arguments to pass to the InferenceClient.
        """
        self.model = model
        self.client = InferenceClient(model=self.model, **kwargs)
    
    def _extract(self, text: str) -> PledgeExtractionOutput:
        """
        Extract pledge statements from a single text input.

        Args:
            text: The input text from which to extract pledges.

        Returns:
            A [`PledgeExtractionOutput`] object containing the extracted pledges and their spans.
        """
        
        completion = self.client.chat.completions.create(
            model=self.model,
            messages=[
                {"role": "system", "content": self.system_prompt},
                {"role": "user", "content": text.strip()}
            ],
            response_format=self.response_format,
            seed=42,
            extra_body={"do_sample": False}, # for deterministic output
            temperature=0.0, # just to be save
        )
        output = PledgeExtractionOutput.model_validate_json(completion.choices[0].message.content)
        output._set_spans(text)
        return output.__dict__
    
    def __call__(self, texts: List[str]) -> List[PledgeExtractionOutput]:
        """
        Extract pledge statements from a list of text inputs.
        
        Args:
            texts: An input text or list of input texts from which to extract pledges.

        Returns:
            If the input is a list of texts, a list of `PledgeExtractionOutput` objects, one for each input text.
            If a single string is provided, a single `PledgeExtractionOutput` object is returned.
        """

        if isinstance(texts, str):
            return self._extract(texts)
        outputs = []
        for text in tqdm(texts, desc="Extracting pledges"):
            outputs.append(self._extract(text))
        return outputs



**_Note_:** defining the `__call__` method allows us to use the class instance similar to how we would use a function (see [here](https://www.geeksforgeeks.org/python/__call__-in-python/)).

Now we can check the list of models available through Hugging Face Inference Providers: https://huggingface.co/inference/models

Let's use a state-of-the-art large reasoning model by Qwen AI: `Qwen3-235B-A22B-Instruct-2507` 

**_Important_:** Given that we want to generate _structure_ outputs, we can only use models that have "Yes" in the lst column of this table ("Structured") 

In [14]:
# NOTE: check for availanble models and status here: https://huggingface.co/inference/models
model_id = "Qwen/Qwen3-235B-A22B-Instruct-2507"
pledge_extractor = PledgeExtractor(model=model_id, provider="together")

In [15]:
pledge_extractor(text)

{'pledges': ['reduce carbon emissions by 50% by 2030',
  'achieve net-zero emissions by 2050'],
 'spans': [(8, 46), (51, 85)]}

## Scale the annotation task to multiple texts

We now have a custom extractor class that allows us to annotate multiple texts.

**_Note_:** Below we process texts sequentially (i.e., send texts one at a time to the API). batch processing - a better alternative - is beyond the scope of this tutorial. 

Let's load some suitable sentence-level data from [Fornaciari et al. 2021](https://aclanthology.org/2021.findings-acl.301/).

Their data has been human-labeled into two categories: sentences with and sentences without pledge(s).
(Here, we use a balanced subsample of the data.)
Our extraction approach is more granular but having some labels is better than nothing.

In [17]:
from pathlib import Path
import pandas as pd

data_path = Path("../../data/labeled/fornaciari_we_2021")

fp = data_path / "annotation_set_01.csv"
df = pd.read_csv(fp)

# get the text column into a list of strings
texts = df['text'].tolist()

We can not pass the texts to out extractor initialized above:

In [18]:
outputs = pledge_extractor(texts)

Extracting pledges:   0%|          | 0/50 [00:00<?, ?it/s]

In this run, we annotate these 50 sentence in 37 seconds.
This is likely much quicker than a human annotator would take.

### Evaluation

We use the sentence-level labels in for this sample to evaluate our LLM annotations. 

In [19]:
# convert the list of dicts into a dataframe and merge with the original dataframe
outputs_df = pd.Series(outputs).apply(pd.Series)
# add the annotations a new columns to the original dataframe
df[outputs_df.columns] = outputs_df
# binarize the annotations into sentence-level labels
df['pred'] = df.pledges.apply(lambda x: len(x)>0)

With binary observed and predicted labels at hand, we can compute standard (binary) classification metrics:

In [20]:
from sklearn.metrics import classification_report
print(classification_report(df['label'], df['pred'], zero_division=0))

              precision    recall  f1-score   support

           0       0.77      0.96      0.86        25
           1       0.95      0.72      0.82        25

    accuracy                           0.84        50
   macro avg       0.86      0.84      0.84        50
weighted avg       0.86      0.84      0.84        50



Even with this very concise prompt and only 5 [few-shot examples](https://learnprompting.org/vocabulary/few-shot_prompting), we achieve a solid F1! 🥳

But still, there are some "errors" according to the "true" labels recorded in the original data.
Let's look at a sample of these instances.

In [21]:
import textwrap
error_sample = df.query("label != pred").groupby('label')[['label', 'pred', 'text', 'pledges']].apply(lambda x: x.sample(min(len(x), 3), random_state=42), include_groups=True).reset_index(drop=True)
for i, row in error_sample.iterrows():
    print(f"Text: '{textwrap.fill(row['text'], width=70, subsequent_indent=' '*8)}'")
    print(f"Label: {bool(row['label'])}; Pred: {row['pred']}")
    if row['pred']:
        print(f"Pledges: {row['pledges']}")
    print()

Text: '11 . We will strengthen the legal and institutional framework to
        protect our children .'
Label: False; Pred: True
Pledges: ['strengthen the legal and institutional framework to protect our children']

Text: 'The Total Sanitation Campaign, launched by the NDA Government in 1999,
        has been a remarkable success .'
Label: True; Pred: False

Text: 'Immediately after forming the governments in Chhattisgarh, Madhya
        Pradesh and Rajasthan, as promised, the 3 Congress Governments
        waived the loans of farmers .'
Label: True; Pred: False

Text: 'In consonance with its policy, the BJP supports the creation of
        Telangana as a separate State of the Union of India .'
Label: True; Pred: False



The first example is a _false positive_ that captures a candidate pledge that does _not_ satisfy the criterion of measurability of the promised outcome.
But it certainly is a conceptual boundary case.

Yet, the among the three examples of _supposed_ false negative, the first two are arguably incorrectly labeled in the "ground truth" data.
Both are past-oriented which disqualifies them as pledges.

This suggests that the LLM's performance might be _underestimated_ and it raises the issue of 


### Export as annotations

We can save these annotations to disk (including metadata) for later use (e.g., token classifier fine-tuning).

In [22]:
def to_jsonlines(df, fp, cols=["text", "spans"], metadata_exclude=('metadata', 'pred', 'pledges')):
    # gather all columns not in cols in metadata dictionary
    metadata_cols = [col for col in df.columns if col not in cols if col not in metadata_exclude]
    metadata = df[metadata_cols].to_dict(orient="records")
    out = df[cols].copy()
    out['metadata'] = metadata
    out.rename(columns={"spans": "label"}, inplace=True)
    out.to_json(fp, orient="records", lines=True, index=False, force_ascii=False, default_handler=str)

annotations_dir = data_path / "annotations" / "extraction"
annotations_dir.mkdir(exist_ok=True, parents=True)

to_jsonlines(df, annotations_dir / f"{model_id.split('/')[-1]}.jsonl")

## Run with multiple models

One of the coolest things of using LLMs through Hugging Face _Inference Providers_ is the large number of models that are readily avaliable for use.
This invites experimentation and [ensemble approaches](https://arxiv.org/abs/2502.18036) to LLM annotation.

In [None]:
models = [
    ("meta-llama/Llama-4-Maverick-17B-128E-Instruct", "sambanova"),
    ("openai/gpt-oss-120b", "nebius"),
    ("deepseek-ai/DeepSeek-V3-0324", "sambanova")
]

In [29]:
# NOTE: inside the loop, we just  repeat the logic above for each model
for model_id, provider in models:
    print(f"Processing model: {model_id} (provider: {provider})")
    pledge_extractor = PledgeExtractor(model=model_id, provider=provider)

    outputs = pledge_extractor(texts)
    outputs_df = pd.Series(outputs).apply(pd.Series)
    df[outputs_df.columns] = outputs_df
    df['pred'] = df.pledges.apply(lambda x: len(x)>0)
    
    print(classification_report(df['label'], df['pred'], zero_division=0))
    
    to_jsonlines(df, annotations_dir / f"{model_id.split('/')[-1]}.jsonl")

Processing model: openai/gpt-oss-120b (provider: nebius)


Extracting pledges:   0%|          | 0/50 [00:00<?, ?it/s]

              precision    recall  f1-score   support

           0       0.69      0.96      0.80        25
           1       0.93      0.56      0.70        25

    accuracy                           0.76        50
   macro avg       0.81      0.76      0.75        50
weighted avg       0.81      0.76      0.75        50

Processing model: deepseek-ai/DeepSeek-V3-0324 (provider: sambanova)


Extracting pledges:   0%|          | 0/50 [00:00<?, ?it/s]

              precision    recall  f1-score   support

           0       0.81      1.00      0.89        25
           1       1.00      0.76      0.86        25

    accuracy                           0.88        50
   macro avg       0.90      0.88      0.88        50
weighted avg       0.90      0.88      0.88        50

