# Metadata extraction using DSPy and a local LLM using GEPA optimization

To run this, you first need to start two local vLLM servers in the backround. These commands are tested on a single A100 80GB in non-exclusive mode (e.g. Turso oversub GPU). The GPT-OSS 120B model has to be partially offloaded to CPU to preserve VRAM.

For the main extractor model:

    vllm serve google/gemma-3-4b-it --port 7987 --max-model-len 16384 --gpu-memory-utilization 0.25

For the reflection model:
    
    llama-server -hf ggml-org/gpt-oss-120b-GGUF --host 0.0.0.0 --port 7988 --ctx-size 0 --jinja -ub 2048 -b 2048 --n-cpu-moe 5


In [1]:
import dspy

MODEL_ID = "google/gemma-3-4b-it"  # should match the model vLLM is running (does it matter??)
PORT = 7987  # should match the port where vLLM is running
MAX_TOKENS = 1024  # limit on how many new tokens to generate (default: 4000)
TEMPERATURE = 0.7

lm = dspy.LM("openai/" + MODEL_ID,
             api_base=f"http://localhost:{PORT}/v1",  # ensure this points to your port
             api_key="local", model_type="chat", max_tokens=MAX_TOKENS, temperature=TEMPERATURE)
dspy.configure(lm=lm)

# test the connection to the LLM
lm("Who are you?", temperature=0.0)

["I'm Gemma, a large language model created by the Gemma team at Google DeepMind. I’m an open-weights model, which means I’m widely available for public use! \n\nI can take text and images as inputs and generate text-based responses. \n\nYou can learn more about me on the Gemma project page: [https://ai.google.dev/gemma](https://ai.google.dev/gemma)"]

In [2]:

REFLECTION_MODEL_ID = "ggml-org/gpt-oss-120b-GGUF"
REFLECTION_PORT = PORT + 1
REFLECTION_MAX_TOKENS = 7988

reflection_lm = dspy.LM("openai/" + REFLECTION_MODEL_ID,
             api_base=f"http://localhost:{REFLECTION_PORT}/v1",  # ensure this points to your port
             api_key="local", model_type="chat", max_tokens=REFLECTION_MAX_TOKENS, temperature=TEMPERATURE)

# test the connection to the LLM
reflection_lm("Who are you?", temperature=0.0)

['I’m ChatGPT\u202f—\u202fa large language model created by OpenAI. I’ve been trained on a wide variety of text up through June\u202f2024, which lets me help with things like answering questions, brainstorming ideas, explaining concepts, drafting or editing writing, solving problems, and much more. I don’t have personal experiences or consciousness, and I can’t browse the web in real time, but I can draw on the information I was trained on to generate useful, context‑aware responses.  \n\nIf there’s anything specific you’d like to know or discuss, just let me know!']

In [3]:
# Load and prepare dataset

import json
import glob
import random

random.seed(42)  # for deterministic sampling of validation set

train_files = glob.glob("../../llm-dataset/*-train.jsonl")
test_files = glob.glob("../../llm-dataset/*-test.jsonl")

VAL_SIZE = 64  # how many documents to validate on during optimization

def preprocess_sample(sample):
    # fix some bad field names
    ground_truth = { fld.replace('-', '_'): val for fld, val in sample["ground_truth"].items() }
    output = json.dumps(ground_truth)
    input_ = json.dumps(sample["content"])
    return dspy.Example({"content": input_, "metadata": output}).with_inputs("content")

def dataset_to_records(files):
    records = []
    for filename in files:
        with open(filename) as infile:
            for line in infile:
                sample = json.loads(line)
                records.append(preprocess_sample(sample))
    return records


train_val_set = dataset_to_records(train_files)
random.shuffle(train_val_set)

train_set = train_val_set[VAL_SIZE:]
val_set = train_val_set[:VAL_SIZE]

test_set = dataset_to_records(test_files)

len(train_set), len(val_set), len(test_set)

(576, 64, 182)

In [4]:
print("Input Message:")
print(train_set[-1]['content'])

print("\n\nGold Answer:")
for k, v in json.loads(train_set[-1]['metadata']).items():
    print(f"{k}: {v}")

Input Message:
{"pdfinfo": {"creationDate": "D:20201214215341+01'00'", "modDate": "D:20201214215418+01'00'"}, "pages": [{"page": 1, "text": "# ANTAA TAITEEN OPETTAA\n\n\n"}, {"page": 3, "text": "ANTA A TAITEEN OPETTA A GERT BIESTA\n\n\n"}, {"page": 4, "text": "00:00:08.18\n\n\n"}, {"page": 5, "text": "00:00:36.03 00:00:52.19 00:00:54.19\n\n\n"}, {"page": 6, "text": "00:00:58.16 00:01:00.17 00:01:0\n\n\n"}, {"page": 65, "text": "\u2018Opastan sinua kaikessa, n\u00e4yt\u00e4n sinulle kaiken ja nime\u00e4n kaiken.\u2019\n\u2014 COMENIUS\nT\u00e4ss\u00e4 kirjassa Gert Biesta esitt\u00e4\u00e4 uuden n\u00e4kemyksen nykyaikaisesta taidekasvatuksesta\n\nosoittamalla, ett\u00e4 taide tarjoaa ainutlaatuisia v\u00e4lineit\u00e4 olla dialogissa maailman kanssa. N\u00e4kemys\n\nperustuu ajatukseen, ett\u00e4 opettaminen on n\u00e4ytt\u00e4mist\u00e4. Opettaja n\u00e4ytt\u00e4\u00e4 oppilaalle millaisiin\n\nhyviin, t\u00e4rkeisiin tai merkitt\u00e4viin asioihin maailmassa voisi kiinnitt\u00e4\u00e4

In [5]:
from typing import Optional

class ExtractInfo(dspy.Signature):
    """Extract structured metadata from text extracted from a PDF."""

    content: str = dspy.InputField()
    language: str = dspy.OutputField(desc="The language of the resource expressed as a BCP47 language tag.")
    title: str = dspy.OutputField(desc="The main title of the publication.")
    alt_title: list[str] = dspy.OutputField(desc="Alternative or parallel titles of the publication, suffixed with a BCP47 language tag in curly brackets.")
    creator: list[str] = dspy.OutputField(desc="The primary author(s) of the resource (order: Last Name, First Names).")
    year: Optional[str] = dspy.OutputField(desc="The year on which the resource was issued or made available.")
    publisher: list[str] = dspy.OutputField(desc="The entity/entities responsible for making the resource available.")
    doi: Optional[str] = dspy.OutputField(desc="The Digital Object Identifier (DOI) associated with the resource.")
    e_isbn: list[str] = dspy.OutputField(desc="The ISBN associated with the electronic resource.")
    p_isbn: list[str] = dspy.OutputField(desc="The ISBN of the printed version of this document.")
    e_issn: Optional[str] = dspy.OutputField(desc="The ISSN associated with the electronic resource.")
    p_issn: Optional[str] = dspy.OutputField(desc="The ISSN of the printed version of this document.")
    type_coar: str = dspy.OutputField(desc="The type of the resource according to the COAR Resource Types classification.")

module = dspy.ChainOfThought(ExtractInfo)

text = "Apple Inc. announced its latest iPhone 14 today." \
    "The CEO, Tim Cook, highlighted its new features in a press release."
response = module(content=text)

print(response)


Prediction(
    reasoning='The text describes an announcement by Apple Inc. regarding the iPhone 14. The CEO, Tim Cook, is mentioned, indicating a press release. The information provided is sufficient to identify the main entities and the type of resource.',
    language='en',
    title='Apple Inc. Announces iPhone 14',
    alt_title=[],
    creator=['Apple Inc.', 'Tim Cook'],
    year=None,
    publisher=['Apple Inc.'],
    doi=None,
    e_isbn=[],
    p_isbn=[],
    e_issn=None,
    p_issn=None,
    type_coar='News Article'
)


In [6]:
import Levenshtein

ALMOST_THRESHOLD = 0.95  # Adjust as needed

def feedback_simple_string(field, true_val, pred_val):
    score = 1.0 if true_val == pred_val else 0.0
    if score == 1.0:
        feedback = f"✅ `{field}` is correct: `{true_val}`."
    else:
        feedback = f"❌ `{field}` is incorrect. You predicted `{pred_val}`, but the correct value is `{true_val}`."
    return score, feedback

def feedback_fuzzy_string(field, true_val, pred_val):
    base_score = 1.0 if true_val == pred_val else 0.0
    if base_score == 1.0 or (true_val and pred_val and Levenshtein.ratio(true_val.lower(), pred_val.lower()) >= ALMOST_THRESHOLD):
        score = 1.0
        feedback = f"✅ `{field}` is approximately correct: `{pred_val}` matches `{true_val}` closely."
    else:
        score = 0.0
        feedback = f"❌ `{field}` is incorrect. You predicted `{pred_val}`, but the correct value is `{true_val}`."
    return score, feedback

def feedback_set(field, true_val, pred_val):
    true_set = set(true_val or [])
    pred_set = set(pred_val or [])

    if not true_set and not pred_set:
        return 1.0, f"✅ `{field}` is empty as expected."
    elif not true_set or not pred_set:
        return 0.0, f"❌ `{field}` is incorrect. Expected `{true_set}`, but got `{pred_set}`."

    tp = len(true_set & pred_set)
    fp = len(pred_set - true_set)
    fn = len(true_set - pred_set)

    precision = tp / (tp + fp) if (tp + fp) > 0 else 0
    recall = tp / (tp + fn) if (tp + fn) > 0 else 0
    f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0

    feedback = f"🔍 `{field}` partial match."
    feedback += f"- Correctly included: `{list(true_set & pred_set)}`\n"
    if fp:
        feedback += f"- Incorrectly included: `{list(pred_set - true_set)}`\n"
    if fn:
        feedback += f"- Missed: `{list(true_set - pred_set)}`"

    return f1, feedback.strip()

def feedback_e_issn(field, true_val, pred_val, p_issn_val):
    if true_val == pred_val:
        return 1.0, f"✅ `{field}` is correct: `{true_val}`."
    elif p_issn_val and pred_val == p_issn_val and true_val is None:
        return 1.0, f"✅ `{field}` is correctly inferred from `p_issn`: `{pred_val}`."
    else:
        return 0.0, f"❌ `{field}` is incorrect. You predicted `{pred_val}`, but the correct value is `{true_val}`."

def metadata_metric_with_feedback(example, pred, trace=None, pred_name=None, pred_trace=None):
    fields = [
        'language', 'title', 'creator', 'year', 'publisher',
        'doi', 'e_isbn', 'p_isbn', 'e_issn', 'p_issn', 'type_coar'
    ]

    scores = []
    feedback_parts = []

    metadata = json.loads(example.get("metadata", "{}"))
    ground_truth = example.get("ground_truth", {})

    for field in fields:
        true_val = metadata.get(field)
        pred_val = pred.get(field) or None

        if field in ['language', 'year', 'doi', 'p_issn', 'type_coar']:
            score, feedback = feedback_simple_string(field, true_val, pred_val)
        elif field == 'title':
            score, feedback = feedback_fuzzy_string(field, true_val, pred_val)
        elif field in ['creator', 'publisher', 'e_isbn', 'p_isbn']:
            score, feedback = feedback_set(field, true_val, pred_val)
        elif field == 'e_issn':
            p_issn_val = ground_truth.get("p_issn")
            score, feedback = feedback_e_issn(field, true_val, pred_val, p_issn_val)
        else:
            score, feedback = feedback_simple_string(field, true_val, pred_val)

        scores.append(score)
        feedback_parts.append(feedback)

    overall_score = sum(scores) / len(scores) if scores else 0
    full_feedback = "\n".join(feedback_parts)

    return dspy.Prediction(score=overall_score, feedback=full_feedback)


In [7]:
from dspy import GEPA

optimizer = GEPA(
    metric=metadata_metric_with_feedback,
    auto="heavy",
    num_threads=64,
    track_stats=False,
    use_merge=True,
    reflection_lm=reflection_lm
)

In [8]:
%%time

optimized_program = optimizer.compile(
    module,
    trainset=train_set,
    valset=val_set,
)

2025/09/30 09:18:15 INFO dspy.teleprompt.gepa.gepa: Running GEPA for approx 1483 metric calls of the program. This amounts to 2.32 full evals on the train+val set.
2025/09/30 09:18:15 INFO dspy.teleprompt.gepa.gepa: Using 64 examples for tracking Pareto scores. You can consider using a smaller sample of the valset to allow GEPA to explore more diverse solutions within the same budget.
GEPA Optimization:   0%|          | 0/1483 [00:00<?, ?rollouts/s]2025/09/30 09:18:15 INFO dspy.evaluate.evaluate: Average Metric: 38.1608225108225 / 64 (59.6%)
2025/09/30 09:18:15 INFO dspy.teleprompt.gepa.gepa: Iteration 0: Base program full valset score: 0.5962628517316018
GEPA Optimization:   4%|▍         | 64/1483 [00:00<00:15, 91.89rollouts/s]2025/09/30 09:18:15 INFO dspy.teleprompt.gepa.gepa: Iteration 1: Selected program 0 score: 0.5962628517316018


Average Metric: 1.91 / 3 (63.6%): 100%|██████████| 3/3 [00:00<00:00, 111.80it/s]

2025/09/30 09:18:15 INFO dspy.evaluate.evaluate: Average Metric: 1.9090909090909092 / 3 (63.6%)





2025/09/30 09:19:35 INFO dspy.teleprompt.gepa.gepa: Iteration 1: Proposed new text for predict: **Task**: Extract structured bibliographic metadata from the JSON representation of a PDF document.  
The input is a JSON object with two top‑level keys:

* `pdfinfo` – metadata extracted from the PDF file (may contain `title`, `author`, `creationDate`, `modDate`, etc.).
* `pages` – a list of page objects, each with `page` (number) and `text` (the OCR‑extracted plain text of that page).

Your job is to produce **one JSON object** containing the fields listed below.  
If a field cannot be determined, use the exact empty value specified (e.g. `null` for a missing string, `[]` for an empty list).

---

### Output JSON Schema

| Field | Type | Description | Required format |
|-------|------|-------------|-----------------|
| `language` | string | ISO‑639‑1 language code of the document. Detect from the text: if the text contains Finnish‑specific characters (ä, ö, Ä, Ö, å, Å) or obvious Finnish w

Average Metric: 1.73 / 3 (57.6%): 100%|██████████| 3/3 [00:07<00:00,  2.33s/it]

2025/09/30 09:20:33 INFO dspy.evaluate.evaluate: Average Metric: 1.727272727272727 / 3 (57.6%)





2025/09/30 09:21:35 INFO dspy.teleprompt.gepa.gepa: Iteration 2: Proposed new text for predict: markdown
# Instruction: Structured Metadata Extraction from PDF‑Text JSON

You will receive a JSON object that contains two parts:

* **pdfinfo** – a dictionary with the PDF’s internal metadata (title, author, creationDate, …).  
* **pages** – a list of pages, each with a `page` number and the plain‑text that was extracted from that page.

Your task is to **produce a single JSON object** that contains the following fields **exactly** (order does not matter).  
All values must be of the type shown in the examples (strings, numbers, lists, or `null`).  
If a field cannot be determined, use `null` for a scalar field or an empty list `[]` for a list field.

| Field | Expected type | How to obtain / rules |
|-------|---------------|-----------------------|
| `language` | string (ISO‑639‑1 code) | Detect the primary language of the document’s main body text (the bulk of the pages). Use `fi` for Fi

Average Metric: 2.16 / 3 (72.0%): 100%|██████████| 3/3 [00:07<00:00,  2.38s/it]

2025/09/30 09:22:20 INFO dspy.evaluate.evaluate: Average Metric: 2.159090909090909 / 3 (72.0%)





2025/09/30 09:23:52 INFO dspy.teleprompt.gepa.gepa: Iteration 3: Proposed new text for predict: markdown
# Task: Structured Metadata Extraction from PDF‑Text JSON

You will receive **one JSON object** with the following two top‑level keys:

* **`pdfinfo`** – a dictionary containing the PDF’s internal metadata (e.g. `title`, `author`, `creationDate`, …).  
* **`pages`** – a list of page objects, each having:
  * `page` – the page number (integer)  
  * `text` – the plain‑text extracted from that page (string, may contain markdown‑style headings, bold markup, URLs, etc.)

Your job is to **produce a single JSON object** that contains **exactly** the fields listed in the table below (order does not matter).  
All fields must be present; if a value cannot be determined, use `null` for scalar fields or an empty list `[]` for list fields.

| Field | Type | How to obtain (rules) |
|-------|------|-----------------------|
| `language` | string (ISO‑639‑1) | Detect the primary language of the **

Average Metric: 1.82 / 3 (60.6%): 100%|██████████| 3/3 [00:10<00:00,  3.41s/it]

2025/09/30 09:24:13 INFO dspy.evaluate.evaluate: Average Metric: 1.8181818181818181 / 3 (60.6%)





2025/09/30 09:25:21 INFO dspy.teleprompt.gepa.gepa: Iteration 4: Proposed new text for predict: markdown
# Instruction: Structured Metadata Extraction from PDF‑Text JSON

You will be given a **single JSON object** with two top‑level keys:

* **pdfinfo** – a dictionary containing the PDF’s internal metadata (e.g. `title`, `author`, `creationDate`, …).  
* **pages** – an ordered list of pages. Each page is a dictionary with:
  * `page` – the page number (integer, 1‑based)  
  * `text` – the plain‑text extracted from that page (string, may contain line‑breaks, markdown‑style headings, etc.)

Your job is to **produce ONE JSON object** containing **exactly** the fields listed in the table below (order does not matter).  
All values must follow the indicated type. If a value cannot be determined, use `null` for scalar fields or an empty list `[]` for list fields.

| Field | Type | Extraction rules (detailed) |
|-------|------|----------------------------|
| **language** | string (ISO‑639‑1) 

Average Metric: 1.73 / 3 (57.6%): 100%|██████████| 3/3 [00:05<00:00,  1.79s/it]

2025/09/30 09:26:15 INFO dspy.evaluate.evaluate: Average Metric: 1.727272727272727 / 3 (57.6%)





2025/09/30 09:27:34 INFO dspy.teleprompt.gepa.gepa: Iteration 5: Proposed new text for predict: markdown
# Instruction: Structured Metadata Extraction from PDF‑Text JSON

You will receive **one** JSON object that contains two top‑level keys:

* **`pdfinfo`** – a dictionary with the PDF’s internal metadata (e.g. `title`, `author`, `creationDate`, …).  
* **`pages`** – a list of pages, each page is an object with a `page` number and a `text` field that holds the plain‑text extracted from that page.

Your job is to **produce a single JSON object** that contains **exactly** the fields listed below (order does not matter).  
All values must be of the type shown in the examples (strings, numbers, lists, or `null`).  

If a field cannot be determined, use `null` for a scalar field or an empty list `[]` for a list field.

| Field | Type | Extraction rules (detailed) |
|-------|------|-----------------------------|
| `language` | string (ISO‑639‑1) | Detect the primary language of the **body te

Average Metric: 1.41 / 3 (47.0%): 100%|██████████| 3/3 [00:07<00:00,  2.52s/it]

2025/09/30 09:28:31 INFO dspy.evaluate.evaluate: Average Metric: 1.409090909090909 / 3 (47.0%)





2025/09/30 09:30:14 INFO dspy.teleprompt.gepa.gepa: Iteration 6: Proposed new text for predict: markdown
# Task: Structured Metadata Extraction from PDF‑Text JSON

You will be given **one** JSON object that contains two top‑level keys:

* **`pdfinfo`** – a dictionary with the PDF’s internal metadata (e.g. `title`, `author`, `creationDate`, …).  
* **`pages`** – a list of page objects, each with:
  * `page` – the page number (integer)  
  * `text` – the plain‑text extracted from that page (string, may contain line‑breaks)

Your job is to **produce a single JSON object** that contains **exactly** the fields listed in the table below (order does not matter).  
All values must be of the type shown in the examples (strings, numbers, lists, or `null`).  

If a field cannot be determined, use `null` for a scalar field or an empty list `[]` for a list field.

---

## Output Fields & Extraction Rules

| Field | Type | Extraction details (must be followed exactly) |
|-------|------|-------------

Average Metric: 1.62 / 3 (54.1%): 100%|██████████| 3/3 [00:05<00:00,  1.98s/it]

2025/09/30 09:30:27 INFO dspy.evaluate.evaluate: Average Metric: 1.6225895316804408 / 3 (54.1%)





2025/09/30 09:31:48 INFO dspy.teleprompt.gepa.gepa: Iteration 7: Proposed new text for predict: markdown
# Task: Structured Metadata Extraction from PDF‑Text JSON

You will receive a **single JSON object** with two top‑level keys:

* `pdfinfo` – a dictionary containing the PDF’s internal metadata (e.g. `title`, `author`, `creationDate`, …).  
* `pages` – a list of page objects, each with:
  * `page` – the page number (integer)  
  * `text` – the plain‑text extracted from that page (string, may contain line‑breaks)

Your job is to **produce ONE JSON object** that contains **exactly** the fields listed below (order does not matter).  
All values must conform to the indicated type; if a value cannot be determined, use `null` for a scalar field or an empty list `[]` for a list field.

---

## Output Schema

| Field | Type | Extraction Rules |
|-------|------|------------------|
| `language` | **string** (ISO‑639‑1) | Detect the primary language of the *main body* (the bulk of the pages). C

Average Metric: 1.36 / 3 (45.5%): 100%|██████████| 3/3 [00:06<00:00,  2.12s/it]

2025/09/30 09:32:45 INFO dspy.evaluate.evaluate: Average Metric: 1.3636363636363638 / 3 (45.5%)





2025/09/30 09:34:38 INFO dspy.teleprompt.gepa.gepa: Iteration 8: Proposed new text for predict: markdown
# 📋 Task – Structured Bibliographic Metadata Extraction  

You will receive **one JSON object** that represents a PDF document.  
The object has two top‑level keys:

| Key      | Description |
|----------|-------------|
| `pdfinfo`| Metadata that was extracted directly from the PDF file (e.g. `title`, `author`, `creationDate`, `modDate`). |
| `pages`  | A list of page objects. Each page object contains `page` (the page number) and `text` (the OCR‑extracted plain‑text of that page). |

Your job is to **produce ONE JSON object** that follows the schema below.  
If a field cannot be determined, use the exact empty value indicated (e.g. `null`, `[]`).  
All string values must be plain ASCII – normalise quotes to `"` or `’`, collapse multiple spaces to a single space, and trim leading/trailing whitespace.

---

## Output JSON Schema

| Field | Type | Required format / rules |
|-------|--

Average Metric: 1.82 / 3 (60.6%): 100%|██████████| 3/3 [00:05<00:00,  1.82s/it]

2025/09/30 09:35:26 INFO dspy.evaluate.evaluate: Average Metric: 1.8181818181818181 / 3 (60.6%)





2025/09/30 09:37:23 INFO dspy.teleprompt.gepa.gepa: Iteration 9: Proposed new text for predict: markdown
# Bibliographic‑Metadata Extraction – Detailed Assistant Instructions

You will receive **one JSON object** that represents a PDF file.  
Its top‑level keys are:

* `pdfinfo` – metadata extracted from the PDF (may contain `title`, `author`,
  `creationDate`, `modDate`, …).  
* `pages` – an ordered list of page objects, each with:
  * `page` – page number (integer)  
  * `text` – the OCR‑extracted plain‑text of that page (UTF‑8 string)

Your job is to produce **exactly one JSON object** that follows the schema below.
If a value cannot be determined, use the exact empty value specified
(`null` for a missing string, `[]` for an empty list).

---

## Output JSON Schema (order matters for readability)

| Field | Type | Required format / notes |
|-------|------|--------------------------|
| `language` | string | `"fi"` if the document is Finnish, otherwise `"en"`. Detect by scanning **onl

Average Metric: 2.00 / 3 (66.7%): 100%|██████████| 3/3 [00:07<00:00,  2.51s/it]

2025/09/30 09:38:22 INFO dspy.evaluate.evaluate: Average Metric: 2.0 / 3 (66.7%)





2025/09/30 09:39:40 INFO dspy.teleprompt.gepa.gepa: Iteration 10: Proposed new text for predict: **Task Overview**
You will receive a JSON object that contains two parts:
1. `pdfinfo` – metadata extracted from the PDF file (e.g., creationDate, author, title, etc.).
2. `pages` – an ordered list of pages, each with a `"page"` number and the plain‑text `"text"` that appears on that page.

From this information you must produce **structured bibliographic metadata** in a fixed, line‑by‑line format (see the “Output Format” section).  
All fields must follow the exact conventions described below; any deviation will be marked as incorrect.

---

### 1. Output Format
Your answer must contain **exactly** the following sections, in this order, each on its own line (no extra whitespace, no markdown formatting, no additional sections):

```
reasoning
language
title
alt_title
creator
year
publisher
doi
e_isbn
p_isbn
e_issn
p_issn
type_coar
```

* `reasoning` – a short (1‑2 sentences) description of 

Average Metric: 1.79 / 3 (59.6%): 100%|██████████| 3/3 [00:06<00:00,  2.13s/it]

2025/09/30 09:40:20 INFO dspy.evaluate.evaluate: Average Metric: 1.7878787878787878 / 3 (59.6%)





2025/09/30 09:41:59 INFO dspy.teleprompt.gepa.gepa: Iteration 11: Proposed new text for predict: text
**Task Overview**

You will receive a JSON object with two keys:

* `pdfinfo` – metadata extracted from the PDF (e.g., `creationDate`, `author`).
* `pages` – an ordered list of pages. Each page is a dictionary with:
  * `"page"` – the page number (integer, starting at 1).
  * `"text"` – the plain‑text content of that page (UTF‑8, line breaks preserved).

From this information you must produce **structured bibliographic metadata** in a **fixed, line‑by‑line format** (no markdown, no extra whitespace). The output must contain exactly the sections listed below, in this order, each on its own line:

```
reasoning
language
title
alt_title
creator
year
publisher
doi
e_isbn
p_isbn
e_issn
p_issn
type_coar
```

---

### 1. General Output Rules
* Every line must contain **only** the value for that field (except `reasoning`, which is a short sentence).  
* Use the literal string `None` for missin

Average Metric: 1.61 / 3 (53.5%): 100%|██████████| 3/3 [00:06<00:00,  2.26s/it]

2025/09/30 09:42:12 INFO dspy.evaluate.evaluate: Average Metric: 1.6060606060606062 / 3 (53.5%)





2025/09/30 09:44:16 INFO dspy.teleprompt.gepa.gepa: Iteration 12: Proposed new text for predict: markdown
# 📋 Task – Structured Bibliographic Metadata Extraction (Re‑specified)

You will receive **one JSON object** that represents a PDF document.  
The object has exactly two top‑level keys:

| Key      | Description |
|----------|-------------|
| `pdfinfo`| Metadata that was extracted directly from the PDF file (e.g. `title`, `author`, `creationDate`, `modDate`). |
| `pages`  | A list of page objects. Each page object contains `page` (the page number) and `text` (the OCR‑extracted plain‑text of that page). |

Your job is to produce **one JSON object** that follows the schema below.  
If a field cannot be determined, use the exact empty value indicated (`null` for scalars, `[]` for lists).  
All string values must be plain ASCII – normalise quotes to `"` (or the apostrophe `’`), collapse multiple spaces to a single space, and trim leading/trailing whitespace.

---

## Output JSON Schema

Average Metric: 1.82 / 3 (60.6%): 100%|██████████| 3/3 [00:07<00:00,  2.55s/it]

2025/09/30 09:45:26 INFO dspy.evaluate.evaluate: Average Metric: 1.8181818181818183 / 3 (60.6%)





2025/09/30 09:46:37 INFO dspy.teleprompt.gepa.gepa: Iteration 13: Proposed new text for predict: **Task:** Extract structured bibliographic metadata from the JSON representation of a PDF (the “pdfinfo” block plus the plain‑text of each page).  
Return the values exactly in the fields listed below.  All fields must be present; if a value cannot be found, use `None` (or an empty list `[]` for list fields).  

**Output format (JSON‑like key/value pairs, one field per line):**  

```
language: <ISO‑639‑1 code, e.g. “en”, “sv”, “fi”>
title: <primary title string>
alt_title: <list of alternative titles (may be empty)>
creator: <list of author names, each formatted as “LastName, FirstName”>
year: <four‑digit year of publication/acceptance>
publisher: <list of publishing organisations (institution, university, bank, etc.)>
doi: <DOI string or None>
e_isbn: <list of electronic ISBNs, digits only, no hyphens>
p_isbn: <list of printed ISBNs, digits only, no hyphens>
e_issn: <electronic ISSN strin

Average Metric: 1.58 / 3 (52.7%): 100%|██████████| 3/3 [00:08<00:00,  2.79s/it]

2025/09/30 09:47:25 INFO dspy.evaluate.evaluate: Average Metric: 1.5818181818181818 / 3 (52.7%)





2025/09/30 09:48:38 INFO dspy.teleprompt.gepa.gepa: Iteration 14: Proposed new text for predict: markdown
# Task: Extract Structured Bibliographic Metadata from PDF‑derived JSON

You will receive a single JSON object that contains the raw text extracted from a PDF document.  
The object has the following top‑level keys:

* **pdfinfo** – dictionary with any metadata that the PDF file itself provides (e.g. `title`, `author`, `creationDate`, `modDate`).  
* **pages** – list of dictionaries, each with:
  * `page` – page number (integer)
  * `text` – plain‑text content of that page (line‑breaks are preserved)

Your job is to **populate a fixed set of metadata fields** based only on the information present in `pdfinfo` and in the page texts.  
The output must be a **single JSON object** (no additional text) with the keys listed below.  
If a field cannot be determined, use the exact placeholder shown in the schema (e.g. `null` for `doi`, an empty list `[]` for list‑valued fields).

---

## O

Average Metric: 1.52 / 3 (50.5%): 100%|██████████| 3/3 [00:06<00:00,  2.31s/it]

2025/09/30 09:49:27 INFO dspy.evaluate.evaluate: Average Metric: 1.5151515151515151 / 3 (50.5%)





2025/09/30 09:51:23 INFO dspy.teleprompt.gepa.gepa: Iteration 15: Proposed new text for predict: text
**Task Overview**

You will receive a single JSON object with two top‑level keys:

* `pdfinfo` – a dictionary of PDF metadata (e.g. `creationDate`, `title`, `author`).
* `pages` – an ordered list of pages. Each page is a dictionary with:
  * `page` – the page number (integer, starting at 1)
  * `text` – the plain‑text content of that page (line breaks are preserved).

From this information you must produce **exactly** the 13 lines shown in the “Output Format” section below.  
No markdown, no extra whitespace, no additional lines.

---

### 1. Output Format (exact order, one line each)

```
reasoning
language
title
alt_title
creator
year
publisher
doi
e_isbn
p_isbn
e_issn
p_issn
type_coar
```

* `reasoning` – 1‑2 short sentences explaining how you obtained the values.  
* All other fields must contain **only** the final value (or the prescribed empty marker).  

Empty‑value markers  
* 

Average Metric: 1.64 / 3 (54.5%): 100%|██████████| 3/3 [00:06<00:00,  2.04s/it]

2025/09/30 09:51:37 INFO dspy.evaluate.evaluate: Average Metric: 1.6363636363636362 / 3 (54.5%)





2025/09/30 09:53:15 INFO dspy.teleprompt.gepa.gepa: Iteration 16: Proposed new text for predict: markdown
## Task Overview
You will be given a JSON object with two keys:

1. **pdfinfo** – metadata extracted from the PDF file (e.g., `title`, `author`, `creationDate`).
2. **pages** – an ordered list of pages, each containing a `"page"` number and the plain‑text `"text"` that appears on that page.

From this information you must produce **structured bibliographic metadata** in a strict, line‑by‑line format (see the “Output Format” section).  
All fields must follow the exact conventions described below; any deviation will be marked as incorrect.

---

## 1. Output Format
Your answer must contain **exactly** the following 13 lines, in this order, each on its own line (no extra whitespace, no markdown, no additional sections):

```
reasoning
language
title
alt_title
creator
year
publisher
doi
e_isbn
p_isbn
e_issn
p_issn
type_coar
2025/09/30 09:53:22 INFO dspy.evaluate.evaluate: Average Metr

Average Metric: 1.55 / 2 (77.3%):  67%|██████▋   | 2/3 [00:08<00:04,  4.14s/it]



Average Metric: 2.09 / 3 (69.7%): 100%|██████████| 3/3 [00:11<00:00,  3.91s/it]

2025/09/30 09:54:07 INFO dspy.evaluate.evaluate: Average Metric: 2.090909090909091 / 3 (69.7%)





2025/09/30 09:55:44 INFO dspy.teleprompt.gepa.gepa: Iteration 17: Proposed new text for predict: markdown
# 📋 Task – Structured Bibliographic Metadata Extraction (Fully Specified)

You will receive **one JSON object** that represents a PDF document.  
The object has exactly two top‑level keys:

| Key      | Description |
|----------|-------------|
| `pdfinfo`| Metadata extracted directly from the PDF file (e.g. `title`, `author`, `creationDate`, `modDate`). |
| `pages`  | A list of page objects. Each page object contains `page` (the page number) and `text` (the OCR‑extracted plain‑text of that page). |

Your job is to produce **one JSON object** that follows the schema below.  
If a field cannot be determined, use the exact empty value indicated (`null` for scalars, `[]` for lists).  
All string values must be plain ASCII – normalise quotes to `"` (or the apostrophe `’`), collapse multiple spaces to a single space, and trim leading/trailing whitespace.

---

## Output JSON Schema (orde

Average Metric: 2.00 / 3 (66.7%): 100%|██████████| 3/3 [00:08<00:00,  2.99s/it]

2025/09/30 09:56:01 INFO dspy.evaluate.evaluate: Average Metric: 2.0 / 3 (66.7%)





2025/09/30 09:57:59 INFO dspy.teleprompt.gepa.gepa: Iteration 18: Proposed new text for predict: markdown
# Bibliographic‑Metadata Extraction – Revised Assistant Instructions

You will receive **one JSON object** that represents a PDF file.  
Its top‑level keys are:

* `pdfinfo` – metadata extracted from the PDF (may contain `title`, `author`,
  `creationDate`, `modDate`, …).  
* `pages` – an ordered list of page objects, each with:
  * `page` – page number (integer)  
  * `text` – the OCR‑extracted plain‑text of that page (UTF‑8 string)

Your job is to produce **exactly one JSON object** that follows the schema below.
If a value cannot be determined, use the exact empty value specified
(`null` for a missing string, `[]` for an empty list).  The fields must appear in the
order shown in the *Output JSON Schema* table.

---

## Output JSON Schema (order matters)

| Field | Type | Required format / notes |
|-------|------|--------------------------|
| `language` | string | `"fi"` if the d

Average Metric: 0.64 / 1 (63.6%):  33%|███▎      | 1/3 [00:08<00:16,  8.29s/it]



Average Metric: 1.91 / 3 (63.6%): 100%|██████████| 3/3 [00:16<00:00,  5.40s/it]

2025/09/30 09:58:26 INFO dspy.evaluate.evaluate: Average Metric: 1.9090909090909092 / 3 (63.6%)





2025/09/30 10:00:34 INFO dspy.teleprompt.gepa.gepa: Iteration 19: Proposed new text for predict: markdown
# 📋 Task – Structured Bibliographic Metadata Extraction (Fully Specified)

You will receive **one JSON object** that represents a PDF document.  
The object has exactly two top‑level keys:

| Key      | Description |
|----------|-------------|
| `pdfinfo`| Metadata that was extracted directly from the PDF file (e.g. `title`, `author`, `creationDate`, `modDate`). |
| `pages`  | A list of page objects. Each page object contains `page` (the page number) and `text` (the OCR‑extracted plain‑text of that page). |

Your job is to produce **one JSON object** that follows the schema below.  
If a field cannot be determined, use the exact empty value indicated (`null` for scalars, `[]` for lists).  
All string values must be plain ASCII – normalise fancy quotes to `"` (or the apostrophe `’`), collapse any sequence of whitespace characters (space, tab, newline) to a single space, and trim lea

Average Metric: 1.68 / 3 (56.1%): 100%|██████████| 3/3 [00:06<00:00,  2.23s/it]

2025/09/30 10:00:52 INFO dspy.evaluate.evaluate: Average Metric: 1.6818181818181819 / 3 (56.1%)





2025/09/30 10:02:09 INFO dspy.teleprompt.gepa.gepa: Iteration 20: Proposed new text for predict: markdown
# Instruction: Structured Metadata Extraction from PDF‑Text JSON

You will receive a **single JSON object** with two keys:

* **`pdfinfo`** – a dictionary containing the PDF’s internal metadata (e.g. `title`, `author`, `creationDate`, …).  
* **`pages`** – an ordered list of pages. Each page is a dictionary with:
  * `page` – the page number (integer)  
  * `text` – the plain‑text extracted from that page (string, may contain line‑breaks `\n`).

Your job is to **produce ONE JSON object** that contains **exactly** the fields listed in the table below (order does not matter).  
All values must match the required type; use `null` for missing scalar values and `[]` for missing list values.  
Do **not** add any extra keys or any surrounding commentary.

---

## Output Fields

| Field | Type | How to obtain / rules |
|-------|------|-----------------------|
| `language` | string (ISO‑639

Average Metric: 2.09 / 3 (69.7%): 100%|██████████| 3/3 [00:05<00:00,  1.87s/it]

2025/09/30 10:02:59 INFO dspy.evaluate.evaluate: Average Metric: 2.090909090909091 / 3 (69.7%)





2025/09/30 10:04:52 INFO dspy.teleprompt.gepa.gepa: Iteration 21: Proposed new text for predict: markdown
# Bibliographic‑Metadata Extraction – Final Assistant Instructions

You will receive **one JSON object** that represents a PDF file.  
Its top‑level keys are:

* `pdfinfo` – metadata extracted from the PDF (may contain `title`, `author`,
  `creationDate`, `modDate`, …).  
* `pages` – an ordered list of page objects, each with:
  * `page` – page number (integer)  
  * `text` – the OCR‑extracted plain‑text of that page (UTF‑8 string)

Your job is to produce **exactly one JSON object** that follows the schema below.
If a value cannot be determined, use the exact empty value specified
(`null` for a missing string, `[]` for an empty list).

---

## Output JSON Schema (field order is mandatory)

| Field | Type | Required format / notes |
|-------|------|--------------------------|
| `language` | string | `"fi"` for Finnish, `"sv"` for Swedish, otherwise `"en"`. Detect by scanning **only 

Average Metric: 1.64 / 3 (54.5%): 100%|██████████| 3/3 [00:06<00:00,  2.00s/it]

2025/09/30 10:05:47 INFO dspy.evaluate.evaluate: Average Metric: 1.6363636363636362 / 3 (54.5%)





2025/09/30 10:07:22 INFO dspy.teleprompt.gepa.gepa: Iteration 22: Proposed new text for predict: markdown
# Instruction: Structured Metadata Extraction from PDF‑Text JSON

You will receive **one JSON object** with two keys:

* **`pdfinfo`** – a dictionary containing the PDF’s internal metadata (e.g. `title`, `author`, `creationDate`, …).  
* **`pages`** – an ordered list of pages. Each page is a dictionary with:
  * `page` – the page number (integer)  
  * `text` – the plain‑text extracted from that page (string, may contain line‑breaks `\n`).

Your task is to output **exactly one JSON object** that contains **all and only** the fields listed in the table below (order does not matter).  

*Use `null` for missing scalar values and `[]` for missing list values.*  
*Do not add any extra keys, surrounding markdown fences, or commentary.*

---

## Output Fields

| Field | Type | Extraction rules |
|-------|------|-----------------|
| `language` | string (ISO‑639‑1) | Detect the primary lang

Average Metric: 1.41 / 3 (47.0%): 100%|██████████| 3/3 [00:10<00:00,  3.47s/it]

2025/09/30 10:07:39 INFO dspy.evaluate.evaluate: Average Metric: 1.409090909090909 / 3 (47.0%)





2025/09/30 10:09:43 INFO dspy.teleprompt.gepa.gepa: Iteration 23: Proposed new text for predict: markdown
# 📋 Task – Structured Bibliographic Metadata Extraction (Final Specification)

You will receive **one JSON object** that represents a PDF document.  
The object always has exactly two top‑level keys:

| Key      | Description |
|----------|-------------|
| `pdfinfo`| Metadata extracted directly from the PDF file (e.g. `title`, `author`, `creationDate`, `modDate`). |
| `pages`  | A list of page objects. Each page object contains `page` (the page number) and `text` (the OCR‑extracted plain‑text of that page). |

Your job is to produce **one JSON object** that conforms to the schema below.  
If a field cannot be determined, use the exact empty value indicated (`null` for scalars, `[]` for lists).  
All string values must be plain ASCII – replace any fancy quotes with `"` (or the apostrophe `’`), collapse multiple spaces/tabs/new‑lines to a single space, and trim leading/trailing white

Average Metric: 2.45 / 3 (81.8%): 100%|██████████| 3/3 [00:07<00:00,  2.36s/it]

2025/09/30 10:10:01 INFO dspy.evaluate.evaluate: Average Metric: 2.4545454545454546 / 3 (81.8%)





2025/09/30 10:12:05 INFO dspy.teleprompt.gepa.gepa: Iteration 24: Proposed new text for predict: markdown
# Bibliographic‑Metadata Extraction – Final Assistant Instructions (Revised)

You will receive **one JSON object** that represents a PDF file.  
Its top‑level keys are:

* `pdfinfo` – metadata extracted from the PDF (may contain `title`, `author`,
  `creationDate`, `modDate`, …).  
* `pages` – an ordered list of page objects, each with:
  * `page` – page number (integer)  
  * `text` – the OCR‑extracted plain‑text of that page (UTF‑8 string)

Your job is to produce **exactly one JSON object** that follows the schema below.
If a value cannot be determined, use the exact empty value specified
(`null` for a missing string, `[]` for an empty list).

> **All strings you output must be processed through the *Cleaning Rules* (see below).**  
> **The fields must appear in the order shown in the schema.**  
> **Do not output any extra text, markdown, or commentary.**  

---

## Output JSON 

Average Metric: 1.82 / 3 (60.6%): 100%|██████████| 3/3 [00:09<00:00,  3.12s/it]

2025/09/30 10:13:03 INFO dspy.evaluate.evaluate: Average Metric: 1.8181818181818183 / 3 (60.6%)





2025/09/30 10:14:55 INFO dspy.teleprompt.gepa.gepa: Iteration 25: Proposed new text for predict: markdown
# Bibliographic‑Metadata Extraction – Final Assistant Instructions (Authoritative)

You will be given **one JSON object** that represents a PDF file.  
Its top‑level keys are:

* `pdfinfo` – a dictionary of PDF‑level metadata (may contain `title`,
  `author`, `creationDate`, `modDate`, …).  
* `pages` – an ordered list of page dictionaries, each with:
  * `page` – page number (integer)  
  * `text` – the OCR‑extracted plain‑text of that page (UTF‑8 string)

Your task is to produce **exactly one JSON object** that follows the schema
below.  Every field must appear **in the order shown**.  If a value cannot be
determined, use the exact empty value specified (`null` for a missing string,
`[]` for an empty list).

---

## 1. Cleaning Rules (apply to **every** string you output)

1. **Trim** leading and trailing whitespace.  
2. **Collapse** any run of whitespace characters (space, tab,

Average Metric: 1.55 / 3 (51.5%): 100%|██████████| 3/3 [00:05<00:00,  1.79s/it]

2025/09/30 10:15:49 INFO dspy.evaluate.evaluate: Average Metric: 1.5454545454545454 / 3 (51.5%)





2025/09/30 10:17:16 INFO dspy.teleprompt.gepa.gepa: Iteration 26: Proposed new text for predict: markdown
# Task: Structured Metadata Extraction from PDF‑Text JSON

You will receive **one JSON object** with the following top‑level keys:

* **pdfinfo** – a dictionary containing the PDF’s internal metadata (e.g. `title`, `author`, `creationDate`, …).  
* **pages** – an ordered list of pages. Each page is a dictionary with:
  * `page` – the page number (integer, 1‑based)  
  * `text` – the plain‑text extracted from that page (string, may contain line‑breaks, markdown‑style headings, etc.)

Your job is to **produce ONE JSON object** that contains **exactly** the fields listed in the table below (order does not matter).  
All values must follow the indicated type. If a value cannot be determined, use `null` for scalar fields or an empty list `[]` for list fields.

| Field | Type | Extraction rules (detailed) |
|-------|------|----------------------------|
| **language** | string (ISO‑639‑1)

Average Metric: 2.00 / 3 (66.7%): 100%|██████████| 3/3 [00:07<00:00,  2.34s/it]

2025/09/30 10:17:31 INFO dspy.evaluate.evaluate: Average Metric: 2.0 / 3 (66.7%)





2025/09/30 10:18:28 INFO dspy.teleprompt.gepa.gepa: Iteration 27: Proposed new text for predict: **Task Overview**
You will receive a JSON object that contains the extracted text of a PDF document (`pages` → list of `{ "page": <number>, "text": <string> }`).  
Your job is to analyse this text and produce a set of bibliographic metadata fields in a **structured, machine‑readable format**.

**Required Output Fields**
Return the following fields **exactly** in the order shown, each on its own line, using valid Python‑style literals (strings in single quotes, lists in `[...]`, `None` for missing values). Do **not** add any extra text, headings, or explanations.

```
language: <ISO‑639‑2 code string, e.g. 'en' or 'fi'>
title: <primary title string>
alt_title: <list of alternative titles (may be empty) >
creator: <list of author names formatted as "LastName, FirstName MiddleName" (may be empty)>
year: <4‑digit integer year of publication (or None)>
publisher: <list of publisher names (may be

Average Metric: 0.82 / 3 (27.3%): 100%|██████████| 3/3 [00:09<00:00,  3.05s/it]

2025/09/30 10:19:11 INFO dspy.evaluate.evaluate: Average Metric: 0.8181818181818181 / 3 (27.3%)





2025/09/30 10:21:18 INFO dspy.teleprompt.gepa.gepa: Iteration 28: Proposed new text for predict: markdown
# Task Overview
You will receive a single JSON object that contains the raw text extracted from a PDF document.  
Your job is to **extract a fixed set of bibliographic metadata** from the information found in:

* `pdfinfo` – the PDF‑level metadata (may contain `title`, `author`, `creationDate`, `modDate`, `lang` etc.).
* `pages` – a list of page objects, each with a `page` number and a `text` string (line‑breaks are preserved).

Using **only** the data present in these two places, you must populate the output schema below.  
If a field cannot be determined, use the exact placeholder shown ( `null` for scalar values, `[]` for lists).

> **Important:** Do **not** guess values, fabricate identifiers, or add generic placeholders (e.g. “Plato’s Ideas – Reality”). Return *only* what you can locate explicitly.

---

## Output Schema (order must be preserved)

| Field | Type | Description 

Average Metric: 1.73 / 3 (57.6%): 100%|██████████| 3/3 [00:08<00:00,  2.80s/it]

2025/09/30 10:22:13 INFO dspy.evaluate.evaluate: Average Metric: 1.7272727272727273 / 3 (57.6%)





2025/09/30 10:23:50 INFO dspy.teleprompt.gepa.gepa: Iteration 29: Proposed new text for predict: markdown
# Bibliographic‑Metadata Extraction – Revised Assistant Instructions

You will receive **one JSON object** that represents a PDF file.  
Its top‑level keys are:

* `pdfinfo` – metadata extracted from the PDF (may contain `title`, `author`,
  `creationDate`, `modDate`, …).  
* `pages` – an ordered list of page objects, each with:
  * `page` – page number (integer)  
  * `text` – the OCR‑extracted plain‑text of that page (UTF‑8 string)

Your job is to produce **exactly one JSON object** that follows the schema below.
If a value cannot be determined, use the exact empty value specified
(`null` for a missing string, `[]` for an empty list).

---

## Output JSON Schema (field order matters)

| Field | Type | Required format / notes |
|-------|------|--------------------------|
| `language` | string | `"fi"` if the document is Finnish, otherwise `"en"`. Detect by scanning **only the firs

Average Metric: 1.00 / 3 (33.3%): 100%|██████████| 3/3 [00:07<00:00,  2.34s/it]

2025/09/30 10:24:57 INFO dspy.evaluate.evaluate: Average Metric: 1.0 / 3 (33.3%)





2025/09/30 10:26:31 INFO dspy.teleprompt.gepa.gepa: Iteration 30: Proposed new text for predict: markdown
# Revised Task: Bibliographic Metadata Extraction from PDF‑derived JSON

You will receive **one** JSON object that represents the raw text extracted from a PDF document.  
The object always contains the top‑level keys **`pdfinfo`** and **`pages`** as described below.

## Input Structure

| Key | Type | Description |
|-----|------|-------------|
| `pdfinfo` | object | Metadata that the PDF file itself provides. May include `title`, `author`, `creationDate`, `modDate`, `lang` (optional) and other entries. |
| `pages`   | list of objects | Each element represents a page and has: <br>`page` – page number (integer) <br>`text` – plain‑text of the page (line‑breaks preserved). |

## Goal

From the information in **both** `pdfinfo` **and** the page texts, produce a **single** JSON object that conforms exactly to the **Output Schema** below.  
Only the fields listed in the schema may appear

CPU times: user 28.6 s, sys: 7.59 s, total: 36.2 s
Wall time: 1h 9min 14s





In [9]:
for name, pred in optimized_program.named_predictors():
    print("================================")
    print(f"Predictor: {name}")
    print("================================")
    print("Prompt:")
    print(pred.signature.instructions)
    print("*********************************")

Predictor: predict
Prompt:
markdown
# 📋 Task – Structured Bibliographic Metadata Extraction (Re‑specified)

You will receive **one JSON object** that represents a PDF document.  
The object has exactly two top‑level keys:

| Key      | Description |
|----------|-------------|
| `pdfinfo`| Metadata that was extracted directly from the PDF file (e.g. `title`, `author`, `creationDate`, `modDate`). |
| `pages`  | A list of page objects. Each page object contains `page` (the page number) and `text` (the OCR‑extracted plain‑text of that page). |

Your job is to produce **one JSON object** that follows the schema below.  
If a field cannot be determined, use the exact empty value indicated (`null` for scalars, `[]` for lists).  
All string values must be plain ASCII – normalise quotes to `"` (or the apostrophe `’`), collapse multiple spaces to a single space, and trim leading/trailing whitespace.

---

## Output JSON Schema

| Field | Type | Required format / rules |
|-------|------|---------

In [10]:
%%time

evaluate = dspy.Evaluate(
    devset=test_set,
    metric=metadata_metric_with_feedback,
    num_threads=64,
    display_table=True,
    display_progress=True,
    provide_traceback=True
)

eval_result = evaluate(optimized_program)

Average Metric: 20.91 / 34 (61.5%):  19%|█▊        | 34/182 [00:41<01:48,  1.36it/s]



Average Metric: 21.44 / 35 (61.3%):  19%|█▉        | 35/182 [00:43<02:38,  1.08s/it]



Average Metric: 22.53 / 37 (60.9%):  20%|█▉        | 36/182 [00:44<02:10,  1.12it/s]



Average Metric: 23.35 / 38 (61.5%):  21%|██        | 38/182 [00:44<01:24,  1.70it/s]



Average Metric: 24.81 / 40 (62.0%):  21%|██▏       | 39/182 [00:45<01:26,  1.66it/s]



Average Metric: 25.44 / 41 (62.1%):  23%|██▎       | 41/182 [00:46<01:15,  1.86it/s]



Average Metric: 25.99 / 42 (61.9%):  23%|██▎       | 41/182 [00:46<01:15,  1.86it/s]



Average Metric: 26.47 / 43 (61.6%):  24%|██▎       | 43/182 [00:46<00:56,  2.47it/s]



Average Metric: 30.72 / 49 (62.7%):  27%|██▋       | 49/182 [00:49<01:07,  1.97it/s]



Average Metric: 31.26 / 50 (62.5%):  27%|██▋       | 50/182 [00:52<02:06,  1.04it/s]



Average Metric: 31.99 / 51 (62.7%):  28%|██▊       | 51/182 [00:53<01:59,  1.10it/s]



Average Metric: 32.72 / 52 (62.9%):  29%|██▊       | 52/182 [00:53<01:32,  1.41it/s]



Average Metric: 43.53 / 69 (63.1%):  37%|███▋      | 68/182 [01:10<01:50,  1.03it/s]



Average Metric: 45.35 / 72 (63.0%):  40%|███▉      | 72/182 [01:15<02:20,  1.28s/it]



Average Metric: 46.62 / 74 (63.0%):  41%|████      | 74/182 [01:16<01:52,  1.04s/it]



Average Metric: 58.90 / 94 (62.7%):  52%|█████▏    | 94/182 [01:29<00:51,  1.72it/s]



Average Metric: 60.35 / 97 (62.2%):  53%|█████▎    | 97/182 [01:31<00:54,  1.56it/s]



Average Metric: 61.62 / 99 (62.2%):  54%|█████▍    | 99/182 [01:34<01:32,  1.11s/it]



Average Metric: 64.94 / 106 (61.3%):  58%|█████▊    | 106/182 [01:39<00:45,  1.65it/s]



Average Metric: 69.88 / 116 (60.2%):  64%|██████▎   | 116/182 [01:46<01:01,  1.07it/s]



Average Metric: 74.25 / 122 (60.9%):  67%|██████▋   | 122/182 [01:53<01:09,  1.17s/it]



Average Metric: 81.94 / 134 (61.2%):  74%|███████▎  | 134/182 [02:00<00:23,  2.03it/s]



Average Metric: 86.67 / 141 (61.5%):  77%|███████▋  | 141/182 [02:06<00:37,  1.10it/s]



Average Metric: 88.76 / 144 (61.6%):  79%|███████▉  | 144/182 [02:08<00:31,  1.21it/s]



Average Metric: 97.66 / 157 (62.2%):  86%|████████▋ | 157/182 [02:14<00:06,  3.73it/s]



Average Metric: 98.48 / 158 (62.3%):  87%|████████▋ | 158/182 [02:15<00:11,  2.06it/s]



Average Metric: 101.66 / 163 (62.4%):  90%|████████▉ | 163/182 [02:17<00:07,  2.44it/s]



Average Metric: 104.75 / 167 (62.7%):  92%|█████████▏| 167/182 [02:18<00:03,  4.67it/s]



Average Metric: 106.21 / 169 (62.8%):  93%|█████████▎| 169/182 [02:18<00:03,  4.32it/s]



Average Metric: 106.84 / 170 (62.8%):  93%|█████████▎| 170/182 [02:18<00:02,  4.58it/s]



Average Metric: 107.57 / 171 (62.9%):  93%|█████████▎| 170/182 [02:18<00:02,  4.58it/s]



Average Metric: 109.30 / 174 (62.8%):  96%|█████████▌| 174/182 [02:20<00:02,  3.16it/s]



Average Metric: 110.57 / 176 (62.8%):  97%|█████████▋| 176/182 [02:21<00:02,  2.58it/s]



Average Metric: 113.03 / 180 (62.8%):  99%|█████████▉| 180/182 [02:22<00:00,  2.70it/s]

2025/09/30 10:31:22 ERROR dspy.utils.parallelizer: Error for Example({'content': '{"pdfinfo": {"title": "TEAviisari", "creationDate": "D:20201219234023+02\'00\'", "modDate": "D:20201220002847+02\'00\'"}, "pages": [{"page": 1, "text": "#### TEAviisari\\n# Terveytt\\u00e4 edist\\u00e4v\\u00e4 liikunta ETEL\\u00c4-SAVON KUNNISSA 2020\\nLiikunnan edist\\u00e4misaktiivisuus Suomen kunnissa\\nTEAviisari 2020 -tiedonkeruussa\\nHyv\\u00e4 tulos*\\nParannettavaa\\nTieto puuttuu\\n\\n\\n- Pistem\\u00e4\\u00e4r\\u00e4 yli 75, kun tavoite 100. N\\u00e4in edellytykset\\n\\n\\nliikunnan edist\\u00e4miseen kunnassa ovat kaikilta osin\\n\\nhyv\\u00e4n k\\u00e4yt\\u00e4nn\\u00f6n ja laadun mukaiset.\\n\\n### sitoutuminen\\nKansallisia julkaisuja k\\u00e4sitell\\u00e4\\u00e4n kunnan liikunnan\\nedist\\u00e4misest\\u00e4 vastaavassa ty\\u00f6ryhm\\u00e4ss\\u00e4\\nValtioneuvoston selonteko liikuntapolitiikasta (OKM 2018)\\n\\n\\n57 % 43 %\\nK\\u00e4velyn ja py\\u00f6r\\u00e4ilyn edist\\u00e4misohjelma (L

Average Metric: 113.57 / 181 (62.7%): 100%|██████████| 182/182 [02:24<00:00,  1.26it/s]

2025/09/30 10:31:22 INFO dspy.evaluate.evaluate: Average Metric: 113.57045454545464 / 182 (62.4%)



CPU times: user 3.87 s, sys: 682 ms, total: 4.55 s
Wall time: 2min 24s


In [11]:
lm.inspect_history()





[34m[2025-09-30T10:31:22.596853][0m

[31mSystem message:[0m

Your input fields are:
1. `content` (str):
Your output fields are:
1. `reasoning` (str): 
2. `language` (str): The language of the resource expressed as a BCP47 language tag.
3. `title` (str): The main title of the publication.
4. `alt_title` (list[str]): Alternative or parallel titles of the publication, suffixed with a BCP47 language tag in curly brackets.
5. `creator` (list[str]): The primary author(s) of the resource (order: Last Name, First Names).
6. `year` (Union[str, NoneType]): The year on which the resource was issued or made available.
7. `publisher` (list[str]): The entity/entities responsible for making the resource available.
8. `doi` (Union[str, NoneType]): The Digital Object Identifier (DOI) associated with the resource.
9. `e_isbn` (list[str]): The ISBN associated with the electronic resource.
10. `p_isbn` (list[str]): The ISBN of the printed version of this document.
11. `e_issn` (Union[str, NoneType

In [12]:
# save the optimized program for later use (many formats, just in case)
optimized_program.save("gepa-optimized-module.json", save_program=False)
optimized_program.save("gepa-optimized-module.pkl", save_program=False)
# save just the prompt(s)
for name, pred in optimized_program.named_predictors():
    with open(f"gepa-optimized-prompt-{name}.txt", "w") as outfile:
        outfile.write(pred.signature.instructions)
