# Metadata extraction using DSPy and a local LLM, with evaluation metrics

To run this, you first need to start a local vLLM server in the backround with a command like this:

    vllm serve $MODEL_ID --port 7987 --max-model-len 32768 --gpu-memory-utilization 0.9

where MODEL_ID is e.g. `meta-llama/Llama-3.1-8B-Instruct` and the port has to match the PORT setting below.

In [1]:
import dspy

MODEL_ID = "mistralai/Mistral-Small-3.1-24B-Instruct-2503"  # should match the model vLLM is running (does it matter??)
PORT = 7987  # should match the port where vLLM is running
MAX_TOKENS = 2048  # limit on how many new tokens to generate (default: 4000)
TEMPERATURE = 0.7

lm = dspy.LM("openai/" + MODEL_ID,
             api_base=f"http://localhost:{PORT}/v1",  # ensure this points to your port
             api_key="local", model_type="chat", max_tokens=MAX_TOKENS, temperature=TEMPERATURE)
dspy.configure(lm=lm)

# test the connection to the LLM
lm("Say this is a test!", temperature=0.0)  # => ['This is a test!']

["Alright, let's proceed with the test. What would you like to test? Here are a few options:\n\n1. **Trivia**: I can ask you questions on a topic of your choice.\n2. **Language**: I can help you practice a foreign language.\n3. **Math**: I can provide math problems to solve.\n4. **General Knowledge**: I can ask you questions on a wide range of topics.\n\nPlease choose one, or let me know if there's something specific you'd like to test."]

In [2]:
# Load and prepare dataset

import json
import glob
import random

random.seed(42)  # for deterministic sampling of validation set

train_files = glob.glob("../../llm-dataset/*-train.jsonl")
test_files = glob.glob("../../llm-dataset/*-test.jsonl")

VAL_SIZE = 128  # how many documents to validate on during optimization

def preprocess_sample(sample):
    # fix some bad field names
    ground_truth = { fld.replace('-', '_'): val for fld, val in sample["ground_truth"].items() }
    output = json.dumps(ground_truth)
    input_ = json.dumps(sample["content"])
    return dspy.Example({"content": input_, "metadata": output}).with_inputs("content")

def dataset_to_records(files):
    records = []
    for filename in files:
        with open(filename) as infile:
            for line in infile:
                sample = json.loads(line)
                records.append(preprocess_sample(sample))
    return records


train_val_set = dataset_to_records(train_files)
random.shuffle(train_val_set)

train_set = train_val_set[VAL_SIZE:]
val_set = train_val_set[:VAL_SIZE]

test_set = dataset_to_records(test_files)

len(train_set), len(val_set), len(test_set)

(512, 128, 182)

In [3]:
print("Input Message:")
print(train_set[-1]['content'])

print("\n\nGold Answer:")
for k, v in json.loads(train_set[-1]['metadata']).items():
    print(f"{k}: {v}")

Input Message:
{"pdfinfo": {"creationDate": "D:20201214215341+01'00'", "modDate": "D:20201214215418+01'00'"}, "pages": [{"page": 1, "text": "# ANTAA TAITEEN OPETTAA\n\n\n"}, {"page": 3, "text": "ANTA A TAITEEN OPETTA A GERT BIESTA\n\n\n"}, {"page": 4, "text": "00:00:08.18\n\n\n"}, {"page": 5, "text": "00:00:36.03 00:00:52.19 00:00:54.19\n\n\n"}, {"page": 6, "text": "00:00:58.16 00:01:00.17 00:01:0\n\n\n"}, {"page": 65, "text": "\u2018Opastan sinua kaikessa, n\u00e4yt\u00e4n sinulle kaiken ja nime\u00e4n kaiken.\u2019\n\u2014 COMENIUS\nT\u00e4ss\u00e4 kirjassa Gert Biesta esitt\u00e4\u00e4 uuden n\u00e4kemyksen nykyaikaisesta taidekasvatuksesta\n\nosoittamalla, ett\u00e4 taide tarjoaa ainutlaatuisia v\u00e4lineit\u00e4 olla dialogissa maailman kanssa. N\u00e4kemys\n\nperustuu ajatukseen, ett\u00e4 opettaminen on n\u00e4ytt\u00e4mist\u00e4. Opettaja n\u00e4ytt\u00e4\u00e4 oppilaalle millaisiin\n\nhyviin, t\u00e4rkeisiin tai merkitt\u00e4viin asioihin maailmassa voisi kiinnitt\u00e4\u00e4

In [4]:
from typing import Optional

class ExtractInfo(dspy.Signature):
    """Extract structured metadata from text extracted from a PDF."""

    content: str = dspy.InputField()
    language: str = dspy.OutputField(desc="The language of the resource expressed as a BCP47 language tag.")
    title: str = dspy.OutputField(desc="The main title of the publication.")
    alt_title: list[str] = dspy.OutputField(desc="Alternative or parallel titles of the publication, suffixed with a BCP47 language tag in curly brackets.")
    creator: list[str] = dspy.OutputField(desc="The primary author(s) of the resource.")
    year: Optional[str] = dspy.OutputField(desc="The year on which the resource was issued or made available.")
    publisher: list[str] = dspy.OutputField(desc="The entity/entities responsible for making the resource available.")
    doi: Optional[str] = dspy.OutputField(desc="The Digital Object Identifier (DOI) associated with the resource.")
    e_isbn: list[str] = dspy.OutputField(desc="The ISBN associated with the electronic resource.")
    p_isbn: list[str] = dspy.OutputField(desc="The ISBN of the printed version of this document.")
    e_issn: Optional[str] = dspy.OutputField(desc="The ISSN associated with the electronic resource.")
    p_issn: Optional[str] = dspy.OutputField(desc="The ISSN of the printed version of this document.")
    type_coar: str = dspy.OutputField(desc="The type of the resource according to the COAR Resource Types classification.")

module = dspy.ChainOfThought(ExtractInfo)

text = "Apple Inc. announced its latest iPhone 14 today." \
    "The CEO, Tim Cook, highlighted its new features in a press release."
response = module(content=text)

print(response)


Prediction(
    reasoning='The text is a press release announcing the latest iPhone 14 by Apple Inc. It mentions the CEO, Tim Cook, and highlights new features. The text does not provide a specific title, alternative titles, year of publication, DOI, ISBNs, or ISSNs. The type of the resource is a press release.',
    language='en',
    title='Apple Inc. Announces Latest iPhone 14',
    alt_title=[],
    creator=['Apple Inc.'],
    year=None,
    publisher=['Apple Inc.'],
    doi=None,
    e_isbn=[],
    p_isbn=[],
    e_issn=None,
    p_issn=None,
    type_coar='text'
)


In [14]:
import Levenshtein

ALMOST_THRESHOLD = 0.9  # Adjust as needed

def feedback_simple_string(field, true_val, pred_val):
    score = 1.0 if true_val == pred_val else 0.0
    if score == 1.0:
        feedback = f"✅ `{field}` is correct: `{true_val}`."
    else:
        feedback = f"❌ `{field}` is incorrect. You predicted `{pred_val}`, but the correct value is `{true_val}`."
    return score, feedback

def feedback_fuzzy_string(field, true_val, pred_val):
    base_score = 1.0 if true_val == pred_val else 0.0
    if base_score == 1.0 or (true_val and pred_val and Levenshtein.ratio(true_val.lower(), pred_val.lower()) >= ALMOST_THRESHOLD):
        score = 1.0
        feedback = f"✅ `{field}` is approximately correct: `{pred_val}` matches `{true_val}` closely."
    else:
        score = 0.0
        feedback = f"❌ `{field}` is incorrect. You predicted `{pred_val}`, but the correct value is `{true_val}`."
    return score, feedback

def feedback_set(field, true_val, pred_val):
    true_set = set(true_val or [])
    pred_set = set(pred_val or [])

    if not true_set and not pred_set:
        return 1.0, f"✅ `{field}` is empty as expected."
    elif not true_set or not pred_set:
        return 0.0, f"❌ `{field}` is incorrect. Expected `{true_set}`, but got `{pred_set}`."

    tp = len(true_set & pred_set)
    fp = len(pred_set - true_set)
    fn = len(true_set - pred_set)

    precision = tp / (tp + fp) if (tp + fp) > 0 else 0
    recall = tp / (tp + fn) if (tp + fn) > 0 else 0
    f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0

    feedback = f"🔍 `{field}` partial match."
    feedback += f"- Correctly included: `{list(true_set & pred_set)}`\n"
    if fp:
        feedback += f"- Incorrectly included: `{list(pred_set - true_set)}`\n"
    if fn:
        feedback += f"- Missed: `{list(true_set - pred_set)}`"

    return f1, feedback.strip()

def feedback_e_issn(field, true_val, pred_val, p_issn_val):
    if true_val == pred_val:
        return 1.0, f"✅ `{field}` is correct: `{true_val}`."
    elif p_issn_val and pred_val == p_issn_val and true_val is None:
        return 1.0, f"✅ `{field}` is correctly inferred from `p_issn`: `{pred_val}`."
    else:
        return 0.0, f"❌ `{field}` is incorrect. You predicted `{pred_val}`, but the correct value is `{true_val}`."

def metadata_metric_with_feedback(example, pred, trace=None, pred_name=None, pred_trace=None):
    fields = [
        'language', 'title', 'creator', 'year', 'publisher',
        'doi', 'e_isbn', 'p_isbn', 'e_issn', 'p_issn', 'type_coar'
    ]

    scores = []
    feedback_parts = []

    metadata = json.loads(example.get("metadata", "{}"))
    ground_truth = example.get("ground_truth", {})

    for field in fields:
        true_val = metadata.get(field)
        pred_val = pred.get(field) or None

        if field in ['language', 'year', 'doi', 'p_issn', 'type_coar']:
            score, feedback = feedback_simple_string(field, true_val, pred_val)
        elif field == 'title':
            score, feedback = feedback_fuzzy_string(field, true_val, pred_val)
        elif field in ['creator', 'publisher', 'e_isbn', 'p_isbn']:
            score, feedback = feedback_set(field, true_val, pred_val)
        elif field == 'e_issn':
            p_issn_val = ground_truth.get("p_issn")
            score, feedback = feedback_e_issn(field, true_val, pred_val, p_issn_val)
        else:
            score, feedback = feedback_simple_string(field, true_val, pred_val)

        scores.append(score)
        feedback_parts.append(feedback)

    overall_score = sum(scores) / len(scores) if scores else 0
    full_feedback = "\n".join(feedback_parts)

    return dspy.Prediction(score=overall_score, feedback=full_feedback)


In [15]:
from dspy import GEPA

optimizer = GEPA(
    metric=metadata_metric_with_feedback,
    auto="light", # <-- We will use a light budget for this tutorial. However, we typically recommend using auto="heavy" for optimized performance!
    num_threads=32,
    track_stats=True,
    use_merge=False,
    reflection_lm=lm
)

In [17]:
optimized_program = optimizer.compile(
    module,
    trainset=train_set,
    valset=val_set,
)

2025/09/26 13:20:59 INFO dspy.teleprompt.gepa.gepa: Running GEPA for approx 892 metric calls of the program. This amounts to 1.39 full evals on the train+val set.
2025/09/26 13:20:59 INFO dspy.teleprompt.gepa.gepa: Using 128 examples for tracking Pareto scores. You can consider using a smaller sample of the valset to allow GEPA to explore more diverse solutions within the same budget.
GEPA Optimization:   0%|          | 0/892 [00:00<?, ?rollouts/s]2025/09/26 13:23:38 INFO dspy.evaluate.evaluate: Average Metric: 77.88636363636364 / 128 (60.8%)
2025/09/26 13:23:38 INFO dspy.teleprompt.gepa.gepa: Iteration 0: Base program full valset score: 0.6084872159090909
GEPA Optimization:  14%|█▍        | 128/892 [02:39<15:49,  1.24s/rollouts]2025/09/26 13:23:38 INFO dspy.teleprompt.gepa.gepa: Iteration 1: Selected program 0 score: 0.6084872159090909


Average Metric: 1.64 / 3 (54.5%): 100%|██████████| 3/3 [00:20<00:00,  6.79s/it]

2025/09/26 13:23:58 INFO dspy.evaluate.evaluate: Average Metric: 1.6363636363636362 / 3 (54.5%)





2025/09/26 13:24:39 INFO dspy.teleprompt.gepa.gepa: Iteration 1: Proposed new text for predict: markdown
## Task: Extract Structured Metadata from PDF Text

### Instructions

1. **Input Format**: The input will be a JSON object containing the following keys:
   - `pdfinfo`: A dictionary with metadata about the PDF, including `title`, `author`, `creationDate`, and `modDate`.
   - `pages`: A list of dictionaries, each containing:
     - `page`: The page number.
     - `text`: The extracted text from that page.

2. **Output Format**: Extract and return the following metadata in a structured format:

   - **language**: The primary language of the text. Use the appropriate ISO 639-1 code (e.g., `en` for English, `fi` for Finnish).
   - **title**: The main title of the document. Ensure it matches the text exactly as presented in the document, including any special characters or formatting.
   - **alt_title**: Alternative titles or subtitles, if present. Include the language code in curly bra

Average Metric: 2.12 / 3 (70.5%): 100%|██████████| 3/3 [00:22<00:00,  7.66s/it]

2025/09/26 13:28:14 INFO dspy.evaluate.evaluate: Average Metric: 2.1151515151515152 / 3 (70.5%)





2025/09/26 13:28:49 INFO dspy.teleprompt.gepa.gepa: Iteration 2: Proposed new text for predict: markdown
# Detailed Instructions for Extracting Structured Metadata from PDFs

## Task Description
The task is to extract structured metadata from text extracted from a PDF document. The metadata should include standard bibliographic fields such as language, title, alternative titles, creators, year, publisher, DOI, ISBNs, ISSNs, and resource type.

## Input Format
The input is provided in a JSON format with the following structure:

```json
{
  "pdfinfo": {
    "author": "string",
    "creationDate": "string",
    "modDate": "string",
    "title": "string",
    "subject": "string",
    "keywords": "string"
  },
  "pages": [
    {
      "page": integer,
      "text": "string"
    }
    ...
  ]
}
```

- `author`: The author of the PDF.
- `creationDate`: The date the PDF was created.
- `modDate`: The date the PDF was last modified.
- `title`: The title of the PDF.
- `subject`: The subject of t

Average Metric: 2.23 / 3 (74.2%): 100%|██████████| 3/3 [00:14<00:00,  4.89s/it]

2025/09/26 13:33:13 INFO dspy.evaluate.evaluate: Average Metric: 2.227272727272727 / 3 (74.2%)





2025/09/26 13:34:20 INFO dspy.teleprompt.gepa.gepa: Iteration 3: Proposed new text for predict: markdown
## Task: Extract Structured Metadata from PDF Text

### Instructions

1. **Input Format**: The input will be a JSON object containing the following keys:
   - `pdfinfo`: A dictionary with metadata about the PDF, including `title`, `author`, `creationDate`, and `modDate`.
   - `pages`: A list of dictionaries, each containing:
     - `page`: The page number.
     - `text`: The extracted text from that page.

2. **Output Format**: Extract and return the following metadata in a structured format:
   - **language**: The primary language of the text. Use the appropriate ISO 639-1 code (e.g., `en` for English, `fi` for Finnish, `sv` for Swedish).
   - **title**: The main title of the document. Ensure it matches the text exactly as presented in the document, including any special characters or formatting.
   - **alt_title**: Alternative titles or subtitles, if present. Include the language 

Average Metric: 2.56 / 3 (85.4%): 100%|██████████| 3/3 [00:23<00:00,  7.74s/it]

2025/09/26 13:37:41 INFO dspy.evaluate.evaluate: Average Metric: 2.5606060606060606 / 3 (85.4%)





2025/09/26 13:38:48 INFO dspy.teleprompt.gepa.gepa: Iteration 4: Proposed new text for predict: markdown
## Task: Extract Structured Metadata from PDF Text

### Instructions

1. **Input Format**: The input will be a JSON object containing the following keys:
   - `pdfinfo`: A dictionary with metadata about the PDF, including `title`, `author`, `subject`, `keywords`, `creationDate`, and `modDate`.
   - `pages`: A list of dictionaries, each containing:
     - `page`: The page number.
     - `text`: The extracted text from that page.

2. **Output Format**: Extract and return the following metadata in a structured format:
   - **language**: The primary language of the text. Use the appropriate ISO 639-1 code (e.g., `en` for English, `fi` for Finnish, `sv` for Swedish).
   - **title**: The main title of the document. Ensure it matches the text exactly as presented in the document, including any special characters or formatting.
   - **alt_title**: Alternative titles or subtitles, if present

Average Metric: 2.42 / 3 (80.8%): 100%|██████████| 3/3 [00:19<00:00,  6.56s/it]

2025/09/26 13:42:10 INFO dspy.evaluate.evaluate: Average Metric: 2.424242424242424 / 3 (80.8%)





2025/09/26 13:43:18 INFO dspy.teleprompt.gepa.gepa: Iteration 5: Proposed new text for predict: markdown
## Task: Extract Structured Metadata from PDF Text

### Instructions

1. **Input Format**: The input will be a JSON object containing the following keys:
   - `pdfinfo`: A dictionary with metadata about the PDF, including `title`, `author`, `subject`, `keywords`, `creationDate`, and `modDate`.
   - `pages`: A list of dictionaries, each containing:
     - `page`: The page number.
     - `text`: The extracted text from that page.

2. **Output Format**: Extract and return the following metadata in a structured format:
   - **language**: The primary language of the text. Use the appropriate ISO 639-1 code (e.g., `en` for English, `fi` for Finnish, `sv` for Swedish).
   - **title**: The main title of the document. Ensure it matches the text exactly as presented in the document, including any special characters or formatting.
   - **alt_title**: Alternative titles or subtitles, if present

Average Metric: 2.45 / 3 (81.8%): 100%|██████████| 3/3 [00:18<00:00,  6.04s/it]

2025/09/26 13:47:22 INFO dspy.evaluate.evaluate: Average Metric: 2.4545454545454546 / 3 (81.8%)





2025/09/26 13:48:30 INFO dspy.teleprompt.gepa.gepa: Iteration 6: Proposed new text for predict: markdown
## Task: Extract Structured Metadata from PDF Text

### Instructions

1. **Input Format**: The input will be a JSON object containing the following keys:
   - `pdfinfo`: A dictionary with metadata about the PDF, including `title`, `author`, `creationDate`, and `modDate`.
   - `pages`: A list of dictionaries, each containing:
     - `page`: The page number.
     - `text`: The extracted text from that page.

2. **Output Format**: Extract and return the following metadata in a structured format:

   - **language**: The primary language of the text. Use the appropriate ISO 639-1 code (e.g., `en` for English, `fi` for Finnish).
   - **title**: The main title of the document. Ensure it matches the text exactly as presented in the document, including any special characters or formatting.
   - **alt_title**: Alternative titles or subtitles, if present. Include the language code in curly bra

Average Metric: 2.30 / 3 (76.6%): 100%|██████████| 3/3 [00:22<00:00,  7.34s/it]

2025/09/26 13:49:15 INFO dspy.evaluate.evaluate: Average Metric: 2.2987012987012987 / 3 (76.6%)





2025/09/26 13:49:56 INFO dspy.teleprompt.gepa.gepa: Iteration 7: Proposed new text for predict: markdown
# Detailed Instructions for Extracting Structured Metadata from PDFs

## Task Description
The task is to extract structured metadata from text extracted from a PDF document. The metadata should include standard bibliographic fields such as language, title, alternative titles, creators, year, publisher, DOI, ISBNs, ISSNs, and resource type. The task requires attention to domain-specific terminology and formatting to ensure accurate extraction of metadata.

## Input Format
The input is provided in a JSON format with the following structure:

```json
{
  "pdfinfo": {
    "author": "string",
    "creationDate": "string",
    "modDate": "string",
    "title": "string",
    "subject": "string",
    "keywords": "string"
  },
  "pages": [
    {
      "page": integer,
      "text": "string"
    }
    ...
  ]
}
```

- `author`: The author of the PDF.
- `creationDate`: The date the PDF was cre

Average Metric: 2.55 / 3 (84.8%): 100%|██████████| 3/3 [00:18<00:00,  6.09s/it]

2025/09/26 13:50:35 INFO dspy.evaluate.evaluate: Average Metric: 2.5454545454545454 / 3 (84.8%)





2025/09/26 13:51:43 INFO dspy.teleprompt.gepa.gepa: Iteration 8: Proposed new text for predict: markdown
## Task: Extract Structured Metadata from PDF Text

### Instructions

1. **Input Format**: The input will be a JSON object containing the following keys:
   - `pdfinfo`: A dictionary with metadata about the PDF, including `title`, `author`, `creationDate`, and `modDate`.
   - `pages`: A list of dictionaries, each containing:
     - `page`: The page number.
     - `text`: The extracted text from that page.

2. **Output Format**: Extract and return the following metadata in a structured format:
   - **language**: The primary language of the text. Use the appropriate ISO 639-1 code (e.g., `en` for English, `fi` for Finnish, `sv` for Swedish).
   - **title**: The main title of the document. Ensure it matches the text exactly as presented in the document, including any special characters or formatting.
   - **alt_title**: Alternative titles or subtitles, if present. Include the language 

In [18]:
for name, pred in optimized_program.named_predictors():
    print("================================")
    print(f"Predictor: {name}")
    print("================================")
    print("Prompt:")
    print(pred.signature.instructions)
    print("*********************************")

Predictor: predict
Prompt:
markdown
## Task: Extract Structured Metadata from PDF Text

### Instructions

1. **Input Format**: The input will be a JSON object containing the following keys:
   - `pdfinfo`: A dictionary with metadata about the PDF, including `title`, `author`, `subject`, `keywords`, `creationDate`, and `modDate`.
   - `pages`: A list of dictionaries, each containing:
     - `page`: The page number.
     - `text`: The extracted text from that page.

2. **Output Format**: Extract and return the following metadata in a structured format:
   - **language**: The primary language of the text. Use the appropriate ISO 639-1 code (e.g., `en` for English, `fi` for Finnish, `sv` for Swedish).
   - **title**: The main title of the document. Ensure it matches the text exactly as presented in the document, including any special characters or formatting.
   - **alt_title**: Alternative titles or subtitles, if present. Include the language code in curly braces (e.g., `{en}` for English

In [20]:
%%time

evaluate = dspy.Evaluate(
    devset=test_set,
    metric=metadata_metric_with_feedback,
    num_threads=32,
    display_table=True,
    display_progress=True,
    provide_traceback=True
)

eval_result = evaluate(optimized_program)

Average Metric: 142.22 / 182 (78.1%): 100%|██████████| 182/182 [00:00<00:00, 197.75it/s]

2025/09/26 14:04:44 INFO dspy.evaluate.evaluate: Average Metric: 142.22460012123014 / 182 (78.1%)



CPU times: user 1.15 s, sys: 231 ms, total: 1.39 s
Wall time: 1.24 s


In [None]:
lm.inspect_history()