# Metadata extraction using DSPy and a local LLM using GEPA optimization

To run this, you first need to start two local vLLM servers in the backround. These commands are tested on a single A100 80GB in non-exclusive mode (e.g. Turso oversub GPU). The GPT-OSS 120B model has to be partially offloaded to CPU to preserve VRAM.

For the main extractor model:

    vllm serve google/gemma-3-4b-it --port 7987 --max-model-len 16384 --gpu-memory-utilization 0.25

For the reflection model:
    
    llama-server -hf ggml-org/gpt-oss-120b-GGUF --host 0.0.0.0 --port 7988 --ctx-size 0 --jinja -ub 2048 -b 2048 --n-cpu-moe 5


In [1]:
import dspy

MODEL_ID = "google/gemma-3-4b-it"  # should match the model vLLM is running (does it matter??)
PORT = 7987  # should match the port where vLLM is running
MAX_TOKENS = 2048  # limit on how many new tokens to generate (default: 4000)
TEMPERATURE = 0.7

lm = dspy.LM("openai/" + MODEL_ID,
             api_base=f"http://localhost:{PORT}/v1",  # ensure this points to your port
             api_key="local", model_type="chat", max_tokens=MAX_TOKENS, temperature=TEMPERATURE)
dspy.configure(lm=lm)

# test the connection to the LLM
lm("Who are you?", temperature=0.0)

["I'm Gemma, a large language model created by the Gemma team at Google DeepMind. I’m an open-weights model, which means I’m widely available for public use! \n\nI can take text and images as inputs and generate text-based responses. \n\nYou can learn more about me on the Gemma project page: [https://ai.google.dev/gemma](https://ai.google.dev/gemma)"]

In [2]:

REFLECTION_MODEL_ID = "ggml-org/gpt-oss-120b-GGUF"
REFLECTION_PORT = PORT + 1
REFLECTION_MAX_TOKENS = 16384

reflection_lm = dspy.LM("openai/" + REFLECTION_MODEL_ID,
             api_base=f"http://localhost:{REFLECTION_PORT}/v1",  # ensure this points to your port
             api_key="local", model_type="chat", max_tokens=REFLECTION_MAX_TOKENS, temperature=TEMPERATURE)

# test the connection to the LLM
reflection_lm("Who are you?", temperature=0.0)

['I’m ChatGPT\u202f—\u202fa large language model created by OpenAI. I’ve been trained on a wide variety of text up through June\u202f2024, which lets me help with things like answering questions, brainstorming ideas, explaining concepts, drafting or editing writing, solving problems, and much more. I don’t have personal experiences or consciousness, and I can’t browse the web in real time, but I can draw on the information I was trained on to generate useful, context‑aware responses.  \n\nIf there’s anything specific you’d like to know or discuss, just let me know!']

In [3]:
# Load and prepare dataset

import json
import glob
import random

random.seed(42)  # for deterministic sampling of validation set

train_files = glob.glob("../../llm-dataset/*-train.jsonl")
test_files = glob.glob("../../llm-dataset/*-test.jsonl")

VAL_SIZE = 64  # how many documents to validate on during optimization

def preprocess_sample(sample):
    # fix some bad field names
    ground_truth = { fld.replace('-', '_'): val for fld, val in sample["ground_truth"].items() }
    output = json.dumps(ground_truth)
    input_ = json.dumps(sample["content"])
    return dspy.Example({"content": input_, "metadata": output}).with_inputs("content")

def dataset_to_records(files):
    records = []
    for filename in files:
        with open(filename) as infile:
            for line in infile:
                sample = json.loads(line)
                records.append(preprocess_sample(sample))
    return records


train_val_set = dataset_to_records(train_files)
random.shuffle(train_val_set)

train_set = train_val_set[VAL_SIZE:]
val_set = train_val_set[:VAL_SIZE]

test_set = dataset_to_records(test_files)

len(train_set), len(val_set), len(test_set)

(576, 64, 182)

In [4]:
print("Input Message:")
print(train_set[-1]['content'])

print("\n\nGold Answer:")
for k, v in json.loads(train_set[-1]['metadata']).items():
    print(f"{k}: {v}")

Input Message:
{"pdfinfo": {"creationDate": "D:20201214215341+01'00'", "modDate": "D:20201214215418+01'00'"}, "pages": [{"page": 1, "text": "# ANTAA TAITEEN OPETTAA\n\n\n"}, {"page": 3, "text": "ANTA A TAITEEN OPETTA A GERT BIESTA\n\n\n"}, {"page": 4, "text": "00:00:08.18\n\n\n"}, {"page": 5, "text": "00:00:36.03 00:00:52.19 00:00:54.19\n\n\n"}, {"page": 6, "text": "00:00:58.16 00:01:00.17 00:01:0\n\n\n"}, {"page": 65, "text": "\u2018Opastan sinua kaikessa, n\u00e4yt\u00e4n sinulle kaiken ja nime\u00e4n kaiken.\u2019\n\u2014 COMENIUS\nT\u00e4ss\u00e4 kirjassa Gert Biesta esitt\u00e4\u00e4 uuden n\u00e4kemyksen nykyaikaisesta taidekasvatuksesta\n\nosoittamalla, ett\u00e4 taide tarjoaa ainutlaatuisia v\u00e4lineit\u00e4 olla dialogissa maailman kanssa. N\u00e4kemys\n\nperustuu ajatukseen, ett\u00e4 opettaminen on n\u00e4ytt\u00e4mist\u00e4. Opettaja n\u00e4ytt\u00e4\u00e4 oppilaalle millaisiin\n\nhyviin, t\u00e4rkeisiin tai merkitt\u00e4viin asioihin maailmassa voisi kiinnitt\u00e4\u00e4

In [5]:
from typing import Optional

class ExtractInfo(dspy.Signature):
    """Extract structured metadata from text extracted from a PDF."""

    content: str = dspy.InputField()
    language: str = dspy.OutputField(desc="The language of the resource expressed as a BCP47 language tag.")
    title: str = dspy.OutputField(desc="The main title of the publication.")
    alt_title: list[str] = dspy.OutputField(desc="Alternative or parallel titles of the publication, suffixed with a BCP47 language tag in curly brackets.")
    creator: list[str] = dspy.OutputField(desc="The primary author(s) of the resource (order: Last Name, First Names).")
    year: Optional[str] = dspy.OutputField(desc="The year on which the resource was issued or made available.")
    publisher: list[str] = dspy.OutputField(desc="The entity/entities responsible for making the resource available.")
    doi: Optional[str] = dspy.OutputField(desc="The Digital Object Identifier (DOI) associated with the resource.")
    e_isbn: list[str] = dspy.OutputField(desc="The ISBN associated with the electronic resource.")
    p_isbn: list[str] = dspy.OutputField(desc="The ISBN of the printed version of this document.")
    e_issn: Optional[str] = dspy.OutputField(desc="The ISSN associated with the electronic resource.")
    p_issn: Optional[str] = dspy.OutputField(desc="The ISSN of the printed version of this document.")
    type_coar: str = dspy.OutputField(desc="The type of the resource according to the COAR Resource Types classification.")

module = dspy.ChainOfThought(ExtractInfo)

text = "Apple Inc. announced its latest iPhone 14 today." \
    "The CEO, Tim Cook, highlighted its new features in a press release."
response = module(content=text)

print(response)


Prediction(
    reasoning='The text describes an announcement by Apple Inc. regarding the iPhone 14. It mentions the CEO, Tim Cook, and a press release. I will extract the key entities and information to populate the metadata fields.',
    language='en',
    title='iPhone 14 Announcement',
    alt_title=[],
    creator=['Apple Inc.', 'Tim Cook'],
    year=None,
    publisher=['Apple Inc.'],
    doi=None,
    e_isbn=[],
    p_isbn=[],
    e_issn=None,
    p_issn=None,
    type_coar='News Article'
)


In [6]:
import Levenshtein

ALMOST_THRESHOLD = 0.95  # Adjust as needed

def feedback_simple_string(field, true_val, pred_val):
    score = 1.0 if true_val == pred_val else 0.0
    if score == 1.0:
        feedback = f"✅ `{field}` is correct: `{true_val}`."
    else:
        feedback = f"❌ `{field}` is incorrect. You predicted `{pred_val}`, but the correct value is `{true_val}`."
    return score, feedback

def feedback_fuzzy_string(field, true_val, pred_val):
    base_score = 1.0 if true_val == pred_val else 0.0
    if base_score == 1.0 or (true_val and pred_val and Levenshtein.ratio(true_val.lower(), pred_val.lower()) >= ALMOST_THRESHOLD):
        score = 1.0
        feedback = f"✅ `{field}` is approximately correct: `{pred_val}` matches `{true_val}` closely."
    else:
        score = 0.0
        feedback = f"❌ `{field}` is incorrect. You predicted `{pred_val}`, but the correct value is `{true_val}`."
    return score, feedback

def feedback_set(field, true_val, pred_val):
    true_set = set(true_val or [])
    pred_set = set(pred_val or [])

    if not true_set and not pred_set:
        return 1.0, f"✅ `{field}` is empty as expected."
    elif not true_set or not pred_set:
        return 0.0, f"❌ `{field}` is incorrect. Expected `{true_set}`, but got `{pred_set}`."

    tp = len(true_set & pred_set)
    fp = len(pred_set - true_set)
    fn = len(true_set - pred_set)

    precision = tp / (tp + fp) if (tp + fp) > 0 else 0
    recall = tp / (tp + fn) if (tp + fn) > 0 else 0
    f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0

    feedback = f"🔍 `{field}` partial match."
    feedback += f"- Correctly included: `{list(true_set & pred_set)}`\n"
    if fp:
        feedback += f"- Incorrectly included: `{list(pred_set - true_set)}`\n"
    if fn:
        feedback += f"- Missed: `{list(true_set - pred_set)}`"

    return f1, feedback.strip()

def feedback_e_issn(field, true_val, pred_val, p_issn_val):
    if true_val == pred_val:
        return 1.0, f"✅ `{field}` is correct: `{true_val}`."
    elif p_issn_val and pred_val == p_issn_val and true_val is None:
        return 1.0, f"✅ `{field}` is correctly inferred from `p_issn`: `{pred_val}`."
    else:
        return 0.0, f"❌ `{field}` is incorrect. You predicted `{pred_val}`, but the correct value is `{true_val}`."

def metadata_metric_with_feedback(example, pred, trace=None, pred_name=None, pred_trace=None):
    fields = [
        'language', 'title', 'creator', 'year', 'publisher',
        'doi', 'e_isbn', 'p_isbn', 'e_issn', 'p_issn', 'type_coar'
    ]

    scores = []
    feedback_parts = []

    metadata = json.loads(example.get("metadata", "{}"))
    ground_truth = example.get("ground_truth", {})

    for field in fields:
        true_val = metadata.get(field)
        pred_val = pred.get(field) or None

        if field in ['language', 'year', 'doi', 'p_issn', 'type_coar']:
            score, feedback = feedback_simple_string(field, true_val, pred_val)
        elif field == 'title':
            score, feedback = feedback_fuzzy_string(field, true_val, pred_val)
        elif field in ['creator', 'publisher', 'e_isbn', 'p_isbn']:
            score, feedback = feedback_set(field, true_val, pred_val)
        elif field == 'e_issn':
            p_issn_val = ground_truth.get("p_issn")
            score, feedback = feedback_e_issn(field, true_val, pred_val, p_issn_val)
        else:
            score, feedback = feedback_simple_string(field, true_val, pred_val)

        scores.append(score)
        feedback_parts.append(feedback)

    overall_score = sum(scores) / len(scores) if scores else 0
    full_feedback = "\n".join(feedback_parts)

    return dspy.Prediction(score=overall_score, feedback=full_feedback)


In [7]:
from dspy import GEPA

optimizer = GEPA(
    metric=metadata_metric_with_feedback,
#    auto="heavy",
    max_metric_calls=3200,
    num_threads=64,
    track_stats=False,
    use_merge=True,
    reflection_lm=reflection_lm
)

In [8]:
%%time

optimized_program = optimizer.compile(
    module,
    trainset=train_set,
    valset=val_set,
)

2025/09/30 15:01:06 INFO dspy.teleprompt.gepa.gepa: Running GEPA for approx 3200 metric calls of the program. This amounts to 5.00 full evals on the train+val set.
2025/09/30 15:01:06 INFO dspy.teleprompt.gepa.gepa: Using 64 examples for tracking Pareto scores. You can consider using a smaller sample of the valset to allow GEPA to explore more diverse solutions within the same budget.
GEPA Optimization:   0%|          | 0/3200 [00:00<?, ?rollouts/s]2025/09/30 15:01:07 INFO dspy.evaluate.evaluate: Average Metric: 37.57647907647908 / 64 (58.7%)
2025/09/30 15:01:07 INFO dspy.teleprompt.gepa.gepa: Iteration 0: Base program full valset score: 0.5871324855699855
GEPA Optimization:   2%|▏         | 64/3200 [00:00<00:34, 90.29rollouts/s]2025/09/30 15:01:07 INFO dspy.teleprompt.gepa.gepa: Iteration 1: Selected program 0 score: 0.5871324855699855


Average Metric: 1.97 / 3 (65.7%): 100%|██████████| 3/3 [00:00<00:00, 93.72it/s]

2025/09/30 15:01:07 INFO dspy.evaluate.evaluate: Average Metric: 1.9696969696969697 / 3 (65.7%)





2025/09/30 15:02:21 INFO dspy.teleprompt.gepa.gepa: Iteration 1: Proposed new text for predict: **Task Overview**
You will receive a JSON object that contains the result of extracting raw text from a PDF file (`pdfinfo` + a list of pages with their text).  
Your job is to *automatically* parse this information and produce a structured metadata record in JSON‑like key/value form (plain text, not a JSON object).  

**Required output fields (exact names and order)**
```
reasoning
language
title
alt_title
creator
year
publisher
doi
e_isbn
p_isbn
e_issn
p_issn
type_coar
```
Each field must be on its own line, exactly as shown above, followed by a colon, a space, and the value.  
Values must follow the formats described below. Do **not** add any extra fields, headings, or markup.

---

### 1. General parsing strategy (you may follow this step‑by‑step)

1. **Read `pdfinfo`** – use the `title`, `author`, `creationDate`, and `modDate` fields as fall‑backs only when the page text does not contai

Average Metric: 1.73 / 3 (57.6%): 100%|██████████| 3/3 [00:00<00:00, 103.32it/s]

2025/09/30 15:03:01 INFO dspy.evaluate.evaluate: Average Metric: 1.727272727272727 / 3 (57.6%)





2025/09/30 15:04:23 INFO dspy.teleprompt.gepa.gepa: Iteration 2: Proposed new text for predict: markdown
# Task: Structured metadata extraction from PDF‑derived text

You will receive a JSON object that contains two top‑level keys:

* **pdfinfo** – a dictionary with the raw PDF metadata (title, author, creationDate, …).  
* **pages** – a list of page objects, each with:
  * **page** – the page number (integer, 1‑based).  
  * **text** – the plain‑text extracted from that page (UTF‑8, may contain markdown, hyperlinks, footnotes, line‑breaks, etc.).

Your job is to **produce a single JSON object** that contains the following fields, **exactly** as shown (order does not matter).  
If a field cannot be determined, use the value prescribed in the *“Missing data handling”* section.

| Field | Expected type | Description | Extraction hints |
|-------|---------------|-------------|-----------------|
| **language** | string | ISO‑639‑1 language code of the document (e.g. `en`, `fi`, `sv`). | De

Average Metric: 1.95 / 3 (65.2%): 100%|██████████| 3/3 [00:00<00:00, 105.91it/s]

2025/09/30 15:05:08 INFO dspy.evaluate.evaluate: Average Metric: 1.9545454545454546 / 3 (65.2%)





2025/09/30 15:06:13 INFO dspy.teleprompt.gepa.gepa: Iteration 3: Proposed new text for predict: language
title
alt_title
creator
year
publisher
doi
e_isbn
p_isbn
e_issn
p_issn
type_coar
```

* **Scalar values** (`language`, `title`, `year`, `doi`, `e_issn`, `p_issn`, `type_coar`) are printed as plain strings or `None`.  
* **List values** (`alt_title`, `creator`, `publisher`, `e_isbn`, `p_isbn`) are printed as JSON arrays, e.g., `['Doe, John', 'Smith, Anna']` or `[]`.  

**Example**

```
language
fi
title
Syömishäiriötaustaisen äidin kohtaaminen lastenneuvolassa
alt_title
[]
creator
['Pentikäinen, Tytti', 'Pihkakoski, Tanja', 'Vänskä, Suvi', 'Männistö, Merja']
year
2021
publisher
['Oulun ammattikorkeakoulu']
doi
None
e_isbn
[]
p_isbn
[]
e_issn
1798-2022
p_issn
None
type_coar
journal article
2025/09/30 15:06:19 INFO dspy.evaluate.evaluate: Average Metric: 2.0 / 3 (66.7%)
2025/09/30 15:06:50 INFO dspy.evaluate.evaluate: Average Metric: 37.78484848484847 / 64 (59.0%)
2025/09/30 15:06:50 I

Average Metric: 1.55 / 3 (51.5%): 100%|██████████| 3/3 [00:07<00:00,  2.38s/it]

2025/09/30 15:06:57 INFO dspy.evaluate.evaluate: Average Metric: 1.5454545454545454 / 3 (51.5%)





2025/09/30 15:08:04 INFO dspy.teleprompt.gepa.gepa: Iteration 4: Proposed new text for predict: markdown
# Instruction for extracting structured metadata from PDF‑extracted text

You will receive a JSON object that contains the raw text extracted from a PDF (field **pages** → list of pages with a **text** string) and, optionally, some PDF‑level information (author, title, creationDate, etc.) in the **pdfinfo** object.

Your job is to parse this information and produce a **flat list of metadata fields** (one field per line, exactly as shown in the examples) that can later be turned into a JSON record.  
All fields must follow the exact naming, type and formatting rules below. If a piece of information is not present, use the values shown in the “Missing values” section.

---

## 1. Required output fields (order does not matter)

| Field name | Type | Description & format |
|------------|------|----------------------|
| **language** | string | ISO‑639‑1 language code of the document (e.g

Average Metric: 1.09 / 2 (54.5%):  67%|██████▋   | 2/3 [00:05<00:02,  2.61s/it]



Average Metric: 1.73 / 3 (57.6%): 100%|██████████| 3/3 [00:28<00:00,  9.51s/it]

2025/09/30 15:09:15 INFO dspy.evaluate.evaluate: Average Metric: 1.7272727272727273 / 3 (57.6%)





2025/09/30 15:10:32 INFO dspy.teleprompt.gepa.gepa: Iteration 5: Proposed new text for predict: markdown
**Task Overview**

You are given a JSON object that contains two top‑level keys:

* `pdfinfo` – metadata extracted from the PDF (e.g. `title`, `author`, `creationDate`, `modDate`).
* `pages` – an ordered list of pages, each with a `page` number and the plain‑text extracted from that page.

From this information you must produce a flat list of **13 fields** in the exact order shown below.  
Each field name is printed on its own line, immediately followed by its value on the next line.

```
language
<title value>
alt_title
<list value>
creator
<list value>
year
<scalar value>
publisher
<list value>
doi
<scalar value>
e_isbn
<list value>
p_isbn
<list value>
e_issn
<scalar value>
p_issn
<scalar value>
type_coar
<scalar value>
```

*Scalar values* (`language`, `title`, `year`, `doi`, `e_issn`, `p_issn`, `type_coar`) must be printed **as plain strings** or the literal word `None` when the

Average Metric: 1.09 / 3 (36.4%): 100%|██████████| 3/3 [00:07<00:00,  2.49s/it]

2025/09/30 15:10:47 INFO dspy.evaluate.evaluate: Average Metric: 1.0909090909090908 / 3 (36.4%)





2025/09/30 15:12:14 INFO dspy.teleprompt.gepa.gepa: Iteration 6: Proposed new text for predict: markdown
# Task: Precise Metadata Extraction from PDF‑derived Text

You will be given a **single JSON object** with two top‑level keys:

* **pdfinfo** – a dictionary containing the raw PDF metadata (e.g. `title`, `author`, `creationDate`, …).  
* **pages** – an ordered list of page objects, each with:
  * **page** – the page number (integer, 1‑based).  
  * **text** – the plain‑text extracted from that page (UTF‑8, may contain markdown headings, bold/italic markup, hyperlinks, footnotes, line‑breaks, etc.).

Your job is to **produce ONE JSON object** that contains **exactly** the fields listed in the table below (order does not matter).  
If a field cannot be determined, use the value prescribed in the *Missing‑Data Handling* section.

| Field | Type | Description | Extraction Details |
|-------|------|-------------|--------------------|
| **language** | string | ISO‑639‑1 code of the docume

Average Metric: 1.64 / 3 (54.5%): 100%|██████████| 3/3 [00:08<00:00,  2.73s/it]

2025/09/30 15:13:08 INFO dspy.evaluate.evaluate: Average Metric: 1.6363636363636362 / 3 (54.5%)





2025/09/30 15:14:47 INFO dspy.teleprompt.gepa.gepa: Iteration 7: Proposed new text for predict: markdown
# Task Overview
You are given a JSON object that contains the raw text extracted from a PDF file (`pages` → list of `{page, text}`) together with the PDF‑metadata (`pdfinfo`).  
Your job is to **extract structured bibliographic metadata** and return it as a single JSON object that follows the exact schema described below.

The evaluation of your answer is strict: every field must be present, data must be in the required format, and only the values that are explicitly present in the source text may be returned.  
Below you will find detailed rules, normalization conventions, and examples of how to handle edge‑cases that appeared in the training data.

---

## 1. Output Schema

| Field | Type | Description | Required? |
|-------|------|-------------|-----------|
| `language` | string | ISO‑639‑2 three‑letter code (`fi`, `sv`, `en`, …). Detect the primary language of the document from 

Average Metric: 1.06 / 3 (35.4%): 100%|██████████| 3/3 [00:07<00:00,  2.38s/it]

2025/09/30 15:15:58 INFO dspy.evaluate.evaluate: Average Metric: 1.0606060606060606 / 3 (35.4%)





2025/09/30 15:17:22 INFO dspy.teleprompt.gepa.gepa: Iteration 8: Proposed new text for predict: markdown
# Instruction for extracting structured metadata from PDF‑extracted text

You will receive a JSON object with two top‑level keys:

* **pdfinfo** – optional dictionary containing PDF‑level metadata such as `title`, `author`, `creationDate`, `modDate`, etc.
* **pages** – a list of page objects, each with a `page` number and a `text` string that is the raw OCR/clipboard text of that page.

Your task is to **scan the entire document** (all pages) and produce a **flat list of metadata fields**.  
Each field is written on its own line, followed by the value on the next line, exactly as shown in the examples below.  
The order of the fields is irrelevant, but the **field names and value formats must match the specification**.

If a piece of information cannot be found, output the *missing‑value placeholder* described in section 3.

---

## 1. Output fields

| Field name | Type | Descriptio

Average Metric: 2.23 / 3 (74.2%): 100%|██████████| 3/3 [00:06<00:00,  2.27s/it]

2025/09/30 15:18:16 INFO dspy.evaluate.evaluate: Average Metric: 2.2272727272727275 / 3 (74.2%)





2025/09/30 15:19:31 INFO dspy.teleprompt.gepa.gepa: Iteration 9: Proposed new text for predict: text
**Task Overview**

You are given a JSON representation of a PDF document.  
The JSON has two top‑level keys:

* `pdfinfo` – contains metadata such as `author`, `creationDate`, `modDate`.
* `pages` – a list of page objects, each with `page` (number) and `text` (the OCR‑extracted plain text of that page).

From this information you must extract a fixed set of bibliographic fields and output them **exactly** in the order and format shown in the examples:

```
language
<value>
title
<value>
alt_title
<value>
creator
<value>
year
<value>
publisher
<value>
doi
<value>
e_isbn
<value>
p_isbn
<value>
e_issn
<value>
p_issn
<value>
type_coar
<value>
2025/09/30 15:19:37 INFO dspy.evaluate.evaluate: Average Metric: 1.909090909090909 / 3 (63.6%)
2025/09/30 15:19:37 INFO dspy.teleprompt.gepa.gepa: Iteration 9: New subsample score is not better, skipping
GEPA Optimization:  18%|█▊        | 566/3200 [18

Average Metric: 1.73 / 3 (57.6%): 100%|██████████| 3/3 [00:05<00:00,  1.94s/it]

2025/09/30 15:19:43 INFO dspy.evaluate.evaluate: Average Metric: 1.7272727272727275 / 3 (57.6%)





2025/09/30 15:21:02 INFO dspy.teleprompt.gepa.gepa: Iteration 10: Proposed new text for predict: markdown
# Instruction for extracting structured metadata from PDF‑extracted text

You will receive a **single JSON object** that contains:

* **pdfinfo** – optional dictionary with PDF‑level metadata (e.g. `author`, `title`, `creationDate`, `modDate`).
* **pages** – list of page objects, each with a `page` number and a `text` string that is the raw OCR/clipboard text of that page.

Your task is to **parse this information and output a flat list of metadata fields** (one field per line, exactly as shown in the examples). The output must be ready to be turned into a JSON record later.

---

## 1. Required output fields (order does **not** matter)

| Field name | Type | Description & format |
|------------|------|----------------------|
| **language** | string or `None` | ISO‑639‑1 code of the **content language** (`en`, `fi`, `sv`, …). Detect from the majority of visible words, not from PDF 

Average Metric: 1.82 / 3 (60.6%): 100%|██████████| 3/3 [00:05<00:00,  1.87s/it]

2025/09/30 15:21:47 INFO dspy.evaluate.evaluate: Average Metric: 1.8181818181818181 / 3 (60.6%)





2025/09/30 15:23:28 INFO dspy.teleprompt.gepa.gepa: Iteration 11: Proposed new text for predict: markdown
# Instruction for extracting structured metadata from PDF‑extracted text

You will receive **one JSON object** with the following structure:

* `pdfinfo` – optional dictionary containing PDF‑level metadata (e.g. `author`, `title`, `creationDate`, `modDate`).  
* `pages` – list of page objects, each with a numeric `page` field and a `text` field that holds the raw OCR/clipboard text of that page.

Your job is to **parse this information and output a flat list of metadata fields** (one field name on a line, the value on the next line). The output must follow the exact format shown in the “Output format example” section below, because it will later be turned into a JSON record.

---

## 1. Required output fields (order does **not** matter)

| Field name | Type | Description & format |
|------------|------|----------------------|
| `language` | string or `None` | ISO‑639‑1 code of the 

Average Metric: 2.15 / 3 (71.7%): 100%|██████████| 3/3 [00:08<00:00,  2.86s/it]

2025/09/30 15:24:17 INFO dspy.evaluate.evaluate: Average Metric: 2.1515151515151514 / 3 (71.7%)





2025/09/30 15:25:46 INFO dspy.teleprompt.gepa.gepa: Iteration 12: Proposed new text for predict: markdown
# 📄 Instruction for extracting structured metadata from PDF‑extracted text

You will receive **one JSON object** with the following structure:

```json
{
  "pdfinfo": {               // optional, may be missing or empty
    "author": "...",
    "title": "...",
    "creationDate": "D:YYYYMMDD…",
    "modDate": "D:YYYYMMDD…",
    ...
  },
  "pages": [
    {"page": 1, "text": "..."},
    {"page": 2, "text": "..."},
    ...
  ]
}
```

`pages[*].text` contains the raw OCR / clipboard text of each page (line breaks are preserved).  
Your job is to **produce a flat list of metadata fields** (one field name on a line, its value on the next line) that can later be turned into a JSON record.

---

## 1️⃣ Required output fields (order does **not** matter)

| Field name | Type | Missing‑value placeholder |
|------------|------|---------------------------|
| `language` | `str` or `None` | `None

Average Metric: 2.18 / 3 (72.7%): 100%|██████████| 3/3 [00:04<00:00,  1.59s/it]

2025/09/30 15:26:02 INFO dspy.evaluate.evaluate: Average Metric: 2.1818181818181817 / 3 (72.7%)





2025/09/30 15:27:38 INFO dspy.teleprompt.gepa.gepa: Iteration 13: Proposed new text for predict: markdown
# Instruction for extracting structured metadata from PDF‑extracted text

You will receive **one JSON object** containing:

* **pdfinfo** – optional dictionary with PDF‑level metadata (e.g. `author`, `title`, `creationDate`, `modDate`).
* **pages** – list of page objects, each with a `page` number and a `text` string that is the raw OCR/clipboard text of that page.

Your task is to **parse this information and output a flat list of metadata fields** (one field per line, exactly as shown in the examples). The output must be ready to be turned into a JSON record later.

---

## 1. Required output fields (order does **not** matter)

| Field name | Type | Description & format |
|------------|------|----------------------|
| **language** | string or `None` | ISO‑639‑1 code of the **content language** (`en`, `fi`, `sv`, …). Detect from the majority of visible words, **not** from PDF meta

Average Metric: 1.82 / 3 (60.6%): 100%|██████████| 3/3 [00:07<00:00,  2.53s/it]

2025/09/30 15:27:51 INFO dspy.evaluate.evaluate: Average Metric: 1.8181818181818181 / 3 (60.6%)





2025/09/30 15:29:52 INFO dspy.teleprompt.gepa.gepa: Iteration 14: Proposed new text for predict: markdown
# Instruction for extracting structured metadata from PDF‑extracted text

You will receive **one JSON object** with the following structure:

* `pdfinfo` – optional dictionary containing PDF‑level metadata (e.g. `author`, `title`, `creationDate`, `modDate`).  
* `pages` – list of page objects, each with a numeric `page` field and a `text` field that holds the raw OCR/clipboard text of that page.

Your job is to **parse this information and output a flat list of metadata fields** (one field name on a line, the value on the next line). The output must follow the exact format shown in the “Output format example” section below, because it will later be turned into a JSON record.

---

## 1. Required output fields (order does **not** matter)

| Field name | Type | Description & format |
|------------|------|----------------------|
| `language` | string or `None` | ISO‑639‑1 code of the 

Average Metric: 0.91 / 3 (30.3%): 100%|██████████| 3/3 [00:08<00:00,  2.98s/it]

2025/09/30 15:30:07 INFO dspy.evaluate.evaluate: Average Metric: 0.9090909090909092 / 3 (30.3%)





2025/09/30 15:31:52 INFO dspy.teleprompt.gepa.gepa: Iteration 15: Proposed new text for predict: markdown
# Instruction for extracting structured metadata from PDF‑extracted text

You will receive a **single JSON object** with two top‑level keys:

* **pdfinfo** – optional dictionary containing PDF‑level metadata such as `title`, `author`, `creationDate`, `modDate`, etc.
* **pages** – a list of page objects, each with a `page` number (integer) and a `text` string that is the raw OCR/clipboard text of that page.

Your job is to **scan the whole document** (all pages) and produce a **flat list of metadata fields**.  
Each field name is written on its own line, followed by its value on the next line, exactly as shown in the *Output format* section.

All list values must be valid Python‑style lists (single quotes, commas, no trailing commas).  
If a piece of information cannot be found, output the *missing‑value placeholder* described in the *Missing‑value handling* table.

Below are the co

Average Metric: 1.70 / 3 (56.6%): 100%|██████████| 3/3 [00:07<00:00,  2.62s/it]

2025/09/30 15:32:57 INFO dspy.evaluate.evaluate: Average Metric: 1.6969696969696968 / 3 (56.6%)





2025/09/30 15:34:44 INFO dspy.teleprompt.gepa.gepa: Iteration 16: Proposed new text for predict: markdown
# Revised Task Specification – Bibliographic Metadata Extraction from PDF‑Text JSON

You will receive a JSON object that contains:

* `pdfinfo` – metadata extracted by the PDF parser (may include `title`, `author`, `creationDate`, `modDate`).
* `pages` – an ordered list of objects `{ "page": <number>, "text": "<raw page text>" }`.

Your job is to **produce a single JSON object** that follows the schema below, **populated only with information that can be found in the supplied text** (or, when absolutely necessary, from `pdfinfo`).  
All fields must be present; when a value is not available use `null` (for scalar fields) or an empty list `[]` (for list fields).

---

## 1. Output Schema

| Field | Type | Description | Required |
|-------|------|-------------|----------|
| `language` | string | ISO‑639‑1 two‑letter code (`fi`, `sv`, `en`, …). Detect the primary language of the docume

Average Metric: 1.88 / 3 (62.6%): 100%|██████████| 3/3 [00:07<00:00,  2.44s/it]

2025/09/30 15:35:33 INFO dspy.evaluate.evaluate: Average Metric: 1.878787878787879 / 3 (62.6%)





2025/09/30 15:37:02 INFO dspy.teleprompt.gepa.gepa: Iteration 17: Proposed new text for predict: markdown
# Instruction for extracting structured metadata from PDF‑extracted text

You will receive a **single JSON object** with two top‑level keys:

* **pdfinfo** – optional dictionary containing PDF‑level metadata (e.g. `author`, `title`, `creationDate`, `modDate`).  
* **pages** – a list of page objects, each with a `page` number (integer) and a `text` string that is the raw OCR/clipboard text of that page.

Your job is to **scan the whole document** (all pages) and produce a **flat list of metadata fields**.  
Each field name is written on its own line, followed by its value on the next line, exactly as shown in the *Output format* section.

If a piece of information cannot be found, output the **missing‑value placeholder** described in the *Missing‑value handling* table.

---

## 1. Output fields

| Field name | Type | Required format & notes |
|------------|------|-------------------

Average Metric: 1.76 / 3 (58.8%): 100%|██████████| 3/3 [00:08<00:00,  2.74s/it]

2025/09/30 15:37:39 INFO dspy.evaluate.evaluate: Average Metric: 1.7636363636363637 / 3 (58.8%)





2025/09/30 15:39:07 INFO dspy.teleprompt.gepa.gepa: Iteration 18: Proposed new text for predict: markdown
# Task: Structured metadata extraction from PDF‑derived text

You will receive a single JSON object with two top‑level keys:

* **pdfinfo** – a dictionary containing the raw PDF metadata (e.g. `title`, `author`, `creationDate`, …).  
* **pages** – a list of page objects, each with:
  * **page** – the page number (integer, 1‑based).  
  * **text** – the plain‑text extracted from that page (UTF‑8, may contain markdown, hyperlinks, footnotes, line‑breaks, etc.).

Your job is to **produce ONE JSON object** that contains **exactly** the fields listed below (order does not matter).  
If a field cannot be determined, use the value prescribed in the *Missing‑data handling* section.

---

## Output fields

| Field | Type | Description | Extraction rules |
|-------|------|-------------|------------------|
| **language** | string | ISO‑639‑1 code of the document language (`en`, `fi`, `sv`, …)

Average Metric: 2.09 / 3 (69.7%): 100%|██████████| 3/3 [00:05<00:00,  1.98s/it]

2025/09/30 15:40:13 INFO dspy.evaluate.evaluate: Average Metric: 2.090909090909091 / 3 (69.7%)





2025/09/30 15:42:17 INFO dspy.teleprompt.gepa.gepa: Iteration 19: Proposed new text for predict: markdown
# Instruction for extracting structured metadata from PDF‑extracted text

You will receive **one JSON object** with two top‑level keys:

* **pdfinfo** – optional dictionary containing PDF‑level metadata such as `title`, `author`, `creationDate`, `modDate`, etc.
* **pages** – a list of page objects, each with a `page` number (integer) and a `text` string that is the raw OCR/clipboard text of that page.

Your job is to **scan the whole document (all pages)** and produce a **flat list of metadata fields**.  
Each field name is written on its own line, followed by its value on the next line, exactly as shown in the *Output format* section below.

All list values must be valid Python‑style lists (single quotes, commas, no trailing commas).  
If a piece of information cannot be found, output the *missing‑value placeholder* described in the *Missing‑value handling* table.

Below are the c

Average Metric: 1.64 / 3 (54.5%): 100%|██████████| 3/3 [00:07<00:00,  2.48s/it]

2025/09/30 15:42:31 INFO dspy.evaluate.evaluate: Average Metric: 1.6363636363636362 / 3 (54.5%)





2025/09/30 15:44:33 INFO dspy.teleprompt.gepa.gepa: Iteration 20: Proposed new text for predict: markdown
# Instruction for extracting structured metadata from PDF‑extracted text

You will receive **one JSON object** with the following structure:

* `pdfinfo` – optional dictionary containing PDF‑level metadata (e.g. `author`, `title`, `creationDate`, `modDate`).  
* `pages` – list of page objects, each with a numeric `page` field and a `text` field that holds the raw OCR/clipboard text of that page.

Your job is to **parse this information and output a flat list of metadata fields** (one field name on a line, the value on the next line). The output must follow the exact format shown in the “Output format example” section below, because it will later be turned into a JSON record.

---

## 1. Required output fields (order does **not** matter)

| Field name | Type | Description & format |
|------------|------|----------------------|
| `language` | string or `None` | ISO‑639‑1 code of the 

Average Metric: 1.55 / 3 (51.5%): 100%|██████████| 3/3 [00:06<00:00,  2.07s/it]

2025/09/30 15:45:23 INFO dspy.evaluate.evaluate: Average Metric: 1.5454545454545454 / 3 (51.5%)





2025/09/30 15:47:19 INFO dspy.teleprompt.gepa.gepa: Iteration 21: Proposed new text for predict: markdown
# Precise Metadata Extraction from PDF‑derived Text  

You will receive **one JSON object** with two top‑level keys:

* `pdfinfo` – raw PDF metadata (e.g. `title`, `author`, `creationDate`, …).  
* `pages` – an ordered list of page objects, each containing  
  * `page` – page number (1‑based integer)  
  * `text` – the plain‑text extracted from that page (may contain markdown headings, bold/italic markup, hyperlinks, footnotes, line‑breaks, etc.).

Your task is to produce **exactly one JSON object** that contains **all fields listed in the table below** (order does not matter). Do **not** add any extra keys.  
If a field cannot be determined, use the value specified in the *Missing‑Data Handling* section.

---

## 1. General Pre‑processing (apply to every page **before** any extraction)

1. Replace Windows line endings `\r\n` with `\n`.  
2. Remove **markdown image syntax** `![](..

Average Metric: 2.18 / 3 (72.7%): 100%|██████████| 3/3 [00:08<00:00,  2.88s/it]

2025/09/30 15:48:42 INFO dspy.evaluate.evaluate: Average Metric: 2.1818181818181817 / 3 (72.7%)





2025/09/30 15:50:33 INFO dspy.teleprompt.gepa.gepa: Iteration 22: Proposed new text for predict: markdown
# Instruction for extracting structured metadata from PDF‑extracted text

You will receive **one JSON object** with two top‑level keys:

* **pdfinfo** – optional dictionary containing PDF‑level metadata such as `title`, `author`, `creationDate`, `modDate`, etc.
* **pages** – an ordered list of page objects, each with a `page` number (integer) and a `text` string that is the raw OCR/clipboard text of that page.

Your job is to **scan the whole document (all pages)** and produce a **flat list of metadata fields**.  
Each field name is written on its own line, followed by its value on the next line, exactly as shown in the *Output format* section below.

All list values must be **valid Python‑style lists** (single quotes, commas, no trailing commas).  
If a piece of information cannot be found, output the *missing‑value placeholder* described in the *Missing‑value handling* table.

--

Average Metric: 1.53 / 3 (51.1%): 100%|██████████| 3/3 [00:05<00:00,  1.72s/it]

2025/09/30 15:50:46 INFO dspy.evaluate.evaluate: Average Metric: 1.5324675324675323 / 3 (51.1%)





2025/09/30 15:52:25 INFO dspy.teleprompt.gepa.gepa: Iteration 23: Proposed new text for predict: markdown
# Instruction for extracting structured metadata from PDF‑extracted text

You will receive **one JSON object** containing:

* `pdfinfo` – optional dictionary with PDF‑level metadata (e.g. `author`, `title`, `creationDate`, `modDate`).
* `pages` – a list of page objects, each with a `page` number and a `text` string that is the raw OCR/clipboard text of that page.

Your job is to **parse this information and output a flat list of metadata fields** (one field per line, exactly as shown in the examples).  
The output must be ready to be turned into a JSON record later.

---

## 1. Required output fields (order does **not** matter)

| Field name | Type | Description & format |
|------------|------|----------------------|
| **language** | `string` or `None` | ISO‑639‑1 code of the **content language** (`en`, `fi`, `sv`, …). Detect from the majority of visible words (see §2.1). |
| **tit

Average Metric: 2.27 / 3 (75.8%): 100%|██████████| 3/3 [00:06<00:00,  2.24s/it]

2025/09/30 15:52:38 INFO dspy.evaluate.evaluate: Average Metric: 2.2727272727272725 / 3 (75.8%)





2025/09/30 15:53:48 INFO dspy.teleprompt.gepa.gepa: Iteration 24: Proposed new text for predict: markdown
# Task: Extract Structured Bibliographic Metadata from PDF‑Extracted Text

You will receive a JSON object that contains the raw text of each page of a PDF (`pages`) together with optional PDF‑level metadata (`pdfinfo`).  
Your job is to read this information and produce **exactly** the set of metadata fields listed below, following the rules and conventions described in this instruction.

---

## 1. Required Output Fields  

| Field | Type | Description / Extraction Rules |
|-------|------|--------------------------------|
| `language` | string | ISO‑639‑1 code of the language of the main document (e.g. `fi`, `sv`, `en`). Detect from the majority of the visible text (title, abstract, headings). Do **not** default to Finnish; use a simple language‑detect heuristic (e.g., presence of Swedish‑specific characters `åäö` or common Swedish words, otherwise Finnish‑specific words, otherwis

Average Metric: 1.82 / 3 (60.6%): 100%|██████████| 3/3 [00:08<00:00,  2.71s/it]

2025/09/30 15:54:35 INFO dspy.evaluate.evaluate: Average Metric: 1.8181818181818181 / 3 (60.6%)





2025/09/30 15:56:19 INFO dspy.teleprompt.gepa.gepa: Iteration 25: Proposed new text for predict: markdown
# 📄 Instruction for extracting structured metadata from PDF‑extracted text

You will receive **one JSON object** with the following structure:

* `pdfinfo` – optional dictionary containing PDF‑level metadata (e.g. `author`, `title`, `creationDate`, `modDate`).  
* `pages` – list of page objects, each with a numeric `page` field and a `text` field that holds the raw OCR/clipboard text of that page.

Your job is to **parse this information and output a flat list of metadata fields** (one field name on a line, the value on the next line).  
The output must follow the exact format shown in the “Output format example” section below, because it will later be turned into a JSON record.

---

## 1️⃣ Required output fields (order does **not** matter)

| Field name | Type | Description & format |
|------------|------|----------------------|
| `language` | `string` or `None` | ISO‑639‑1 code 

Average Metric: 1.54 / 3 (51.2%): 100%|██████████| 3/3 [00:07<00:00,  2.43s/it]

2025/09/30 15:57:10 INFO dspy.evaluate.evaluate: Average Metric: 1.5363636363636364 / 3 (51.2%)





2025/09/30 15:58:54 INFO dspy.teleprompt.gepa.gepa: Iteration 26: Proposed new text for predict: markdown
# Instruction for extracting structured bibliographic metadata from PDF‑extracted text

You will be given **one JSON object** with two top‑level keys:

* **pdfinfo** – optional dictionary containing PDF‑level metadata (e.g. `title`, `author`, `creationDate`, `modDate`).  
* **pages** – a list of page objects, each with an integer `page` number and a string `text` that contains the raw OCR/clipboard text of that page.

Your task is to **scan the whole document** (all pages) and output a **flat list of metadata fields** exactly in the format described in the *Output format* section below.

---

## 1. General output rules

* Write each field name on a line **by itself**, then write its value on the **next line**.  
* The order of fields must be exactly as listed in the table under **1.1 Output fields**.  
* All list values must be **valid Python‑style lists**: single quotes, commas, n

Average Metric: 2.45 / 3 (81.8%): 100%|██████████| 3/3 [00:06<00:00,  2.07s/it]

2025/09/30 15:59:40 INFO dspy.evaluate.evaluate: Average Metric: 2.4545454545454546 / 3 (81.8%)





2025/09/30 16:00:49 INFO dspy.teleprompt.gepa.gepa: Iteration 27: Proposed new text for predict: ## Task Overview
You are given a JSON object that represents the textual content of a PDF document.  
Your job is to **extract the bibliographic metadata** required for the following fields and output them **exactly** in the format shown below:

```
language
<value>
title
<value>
alt_title
<list>
creator
<list>
year
<value>
publisher
<list>
doi
<value>
e_isbn
<list>
p_isbn
<list>
e_issn
<value>
p_issn
<value>
type_coar
<value>
```

*Scalar values* (`language`, `title`, `year`, `doi`, `e_issn`, `p_issn`, `type_coar`) must be printed as plain strings or the literal word `None` (without quotes).  
*List values* (`alt_title`, `creator`, `publisher`, `e_isbn`, `p_isbn`) must be printed as a **JSON array** (e.g. `['Doe, John', 'Smith, Anna']`). Empty lists are printed as `[]`.

No additional text, headings, or formatting is allowed.

---

## Detailed Extraction Rules

### 1. Language (`language`)

Average Metric: 1.45 / 3 (48.5%): 100%|██████████| 3/3 [00:06<00:00,  2.13s/it]

2025/09/30 16:00:59 INFO dspy.evaluate.evaluate: Average Metric: 1.4545454545454546 / 3 (48.5%)





2025/09/30 16:02:33 INFO dspy.teleprompt.gepa.gepa: Iteration 28: Proposed new text for predict: markdown
# 📄 Task: Extract Structured Bibliographic Metadata from PDF‑extracted Text

You will receive **one JSON object** with two top‑level keys:

| Key      | Description |
|----------|-------------|
| `pdfinfo`| Optional dictionary containing PDF‑level metadata (e.g. `title`, `author`, `creationDate`, `modDate`). |
| `pages`  | List of page objects. Each object has:<br>`page` – page number (int)<br>`text` – raw OCR/clipboard text of that page (string). |

Your job is to **scan the whole document** (all pages) and produce a **flat list of metadata fields** following the exact format described in **Section 4**.  

If a field cannot be found, output the *missing‑value placeholder* from **Section 3**.

---

## 1️⃣ Output Fields

| Field | Type | Required format & notes |
|-------|------|--------------------------|
| `language` | string | ISO‑639‑1 code of the **document language** (`fi`, `e

Average Metric: 1.68 / 3 (56.1%): 100%|██████████| 3/3 [00:07<00:00,  2.35s/it]

2025/09/30 16:02:47 INFO dspy.evaluate.evaluate: Average Metric: 1.6818181818181819 / 3 (56.1%)





2025/09/30 16:04:23 INFO dspy.teleprompt.gepa.gepa: Iteration 29: Proposed new text for predict: markdown
# Task: Structured metadata extraction from PDF‑extracted text  

You will receive **one JSON object** with two top‑level keys:

* `pdfinfo` – optional dictionary containing PDF‑level metadata (e.g. `title`, `author`, `creationDate`, `modDate`).
* `pages` – a list of page objects, each with:
  * `page` – integer page number  
  * `text` – the raw OCR/clipboard text of that page (preserve line breaks exactly as they appear).

Your job is to **examine the whole document** (all pages) and produce a **flat list of metadata fields** in the exact format described below.  
If a field cannot be determined, output the *missing‑value placeholder* shown in the table [3].

---

## 1. Output format  

Write each field name on a line **by itself**, then the value on the **next line**.  
All list values must be **valid Python‑style lists**:

* use single quotes (`'`) around each string,
* separat

Average Metric: 1.91 / 3 (63.6%): 100%|██████████| 3/3 [00:05<00:00,  1.87s/it]

2025/09/30 16:05:20 INFO dspy.evaluate.evaluate: Average Metric: 1.9090909090909092 / 3 (63.6%)





2025/09/30 16:07:04 INFO dspy.teleprompt.gepa.gepa: Iteration 30: Proposed new text for predict: markdown
# Instruction for extracting structured bibliographic metadata from PDF‑extracted text

You will receive **one JSON object** per request.  
It contains two top‑level keys:

* **pdfinfo** – optional dictionary with PDF‑level metadata (`title`, `author`, `creationDate`, `modDate`).  
* **pages** – an ordered list of page objects, each with an integer `page` and a string `text` that is the raw OCR/clipboard text of that page.

Your job is to **scan the whole document (all pages)** and output a **flat list of metadata fields** exactly in the format described below.  
Do **not** carry any information from previous requests – treat each request as completely independent.

---

## 1. Output format

* Write the field name on a line **by itself**, then write its value on the **next line**.  
* Follow the order given in the table under **1.1 Output fields**.  
* List values must be **valid P

Average Metric: 2.36 / 3 (78.8%): 100%|██████████| 3/3 [00:04<00:00,  1.37s/it]

2025/09/30 16:08:08 INFO dspy.evaluate.evaluate: Average Metric: 2.3636363636363633 / 3 (78.8%)





2025/09/30 16:09:57 INFO dspy.teleprompt.gepa.gepa: Iteration 31: Proposed new text for predict: markdown
# 📄 Instruction for extracting structured metadata from PDF‑extracted text

You will receive **one JSON object** with the following structure:

* `pdfinfo` – optional dictionary containing PDF‑level metadata (e.g. `author`, `title`, `creationDate`, `modDate`).  
* `pages` – list of page objects, each with a numeric `page` field and a `text` field that holds the raw OCR/clipboard text of that page.

Your job is to **parse this information and output a flat list of metadata fields** (one field name on a line, the value on the next line).  
The output must follow the exact format shown in the “Output format example” section, because it will later be turned into a JSON record.

---

## 1️⃣ Required output fields (order does **not** matter)

| Field name | Type | Description & format |
|------------|------|----------------------|
| `language` | `string` or `None` | ISO‑639‑1 code of the

Average Metric: 2.00 / 3 (66.7%): 100%|██████████| 3/3 [00:11<00:00,  3.84s/it]

2025/09/30 16:10:15 INFO dspy.evaluate.evaluate: Average Metric: 2.0 / 3 (66.7%)





2025/09/30 16:12:04 INFO dspy.teleprompt.gepa.gepa: Iteration 32: Proposed new text for predict: markdown
# 📄 Instruction for extracting structured metadata from PDF‑extracted text

You will receive **one JSON object** with the following structure:

* `pdfinfo` – optional dictionary containing PDF‑level metadata (e.g. `author`, `title`, `creationDate`, `modDate`).  
* `pages` – list of page objects, each with a numeric `page` field and a `text` field that holds the raw OCR/clipboard text of that page.

Your job is to **parse this information and output a flat list of metadata fields** (one field name on a line, the value on the next line).  
The output **must follow the exact format shown in the “Output format example” section** because it will later be turned into a JSON record.

---

## 1️⃣ Required output fields (order does **not** matter)

| Field name | Type | Description & format |
|------------|------|----------------------|
| `language` | `string` or `None` | ISO‑639‑1 code of 

Average Metric: 1.91 / 3 (63.6%): 100%|██████████| 3/3 [00:07<00:00,  2.37s/it]

2025/09/30 16:12:22 INFO dspy.evaluate.evaluate: Average Metric: 1.909090909090909 / 3 (63.6%)





2025/09/30 16:13:52 INFO dspy.teleprompt.gepa.gepa: Iteration 33: Proposed new text for predict: markdown
# Instruction for extracting structured metadata from PDF‑extracted text

You will receive a JSON object that contains:

* **pdfinfo** – optional dictionary with PDF‑level metadata (e.g. `title`, `author`, `creationDate`, `modDate`).
* **pages** – list of page objects, each with a `text` field that holds the OCR‑extracted text of that page.

Your task is to **parse this information and output a flat list of metadata fields** (one field name line followed by its value line) that can later be turned into a JSON record.  
All field names, value formats, and ordering rules are described below.  
If a piece of information cannot be found, use the exact “missing‑value” placeholder indicated in the table.

---

## 1. Output format

For every required field write **exactly two lines**:

```
field_name
value
```

* `field_name` is the literal name from the table (e.g. `language`).
* `value`

Average Metric: 2.18 / 3 (72.7%): 100%|██████████| 3/3 [00:04<00:00,  1.64s/it]

2025/09/30 16:14:04 INFO dspy.evaluate.evaluate: Average Metric: 2.1818181818181817 / 3 (72.7%)





2025/09/30 16:16:06 INFO dspy.teleprompt.gepa.gepa: Iteration 34: Proposed new text for predict: markdown
# Revised Instruction for Extracting Structured Metadata from PDF‑Extracted Text

You will receive **one JSON object** with the following top‑level keys:

* **pdfinfo** – optional dictionary containing PDF‑level metadata (`title`, `author`, `creationDate`, `modDate`, …).
* **pages** – an ordered list of page objects, each with:
  * `page` – integer page number (starting at 1)
  * `text` – the raw OCR/clipboard text of that page (preserve line‑breaks exactly as they appear).

Your task is to **scan the whole document** (all pages) and produce a **flat list of metadata fields**.  
Each field name must be printed on its own line, followed by its value on the next line, exactly as shown in the *Output format* section below.

All list values must be valid **Python‑style lists** (single quotes, commas, no trailing commas).  
If a piece of information cannot be found, output the *missing‑

Average Metric: 2.20 / 3 (73.2%): 100%|██████████| 3/3 [00:09<00:00,  3.04s/it]

2025/09/30 16:16:21 INFO dspy.evaluate.evaluate: Average Metric: 2.1969696969696972 / 3 (73.2%)





2025/09/30 16:18:05 INFO dspy.teleprompt.gepa.gepa: Iteration 35: Proposed new text for predict: markdown
# Instruction for extracting structured bibliographic metadata from PDF‑extracted text

You will receive a **single JSON object** with two top‑level keys:

* **pdfinfo** – optional dictionary containing PDF‑level metadata such as `title`, `author`, `creationDate`, `modDate`, etc.
* **pages** – a list of page objects, each with a `page` number (integer) and a `text` string that is the raw OCR/clipboard text of that page.

Your job is to **scan the whole document (all pages)** and produce a **flat list of metadata fields**.  
Each field name must be written on its own line, followed by its value on the next line, exactly as shown in the *Output format* section.

All list values must be valid Python‑style lists (single quotes, commas, no trailing commas).  
If a piece of information cannot be found, output the *missing‑value placeholder* described in the *Missing‑value handling* table

Average Metric: 1.82 / 3 (60.6%): 100%|██████████| 3/3 [00:08<00:00,  2.80s/it]

2025/09/30 16:18:20 INFO dspy.evaluate.evaluate: Average Metric: 1.8181818181818183 / 3 (60.6%)





2025/09/30 16:20:07 INFO dspy.teleprompt.gepa.gepa: Iteration 36: Proposed new text for predict: markdown
# Instruction for extracting structured metadata from PDF‑extracted text

You will be given **one JSON object** with two top‑level keys:

* **pdfinfo** – optional dictionary containing PDF‑level metadata (`title`, `author`, `creationDate`, `modDate`, …).  
* **pages** – a list of page objects, each with:
  * `page` – integer page number  
  * `text` – the raw OCR / clipboard text of that page (preserve line breaks exactly as they appear).

Your task is to **scan the whole document** (all pages) and output a **flat list of metadata fields** in the exact format described in section 4.  
If a field cannot be found, use the *missing‑value placeholder* from section 3.

---

## 1. Output fields

| Field | Type | Required format & notes |
|-------|------|------------------------|
| **language** | string | ISO‑639‑1 code (`fi`, `en`, `sv`, `se`). Detect from the **majority of visible words

Average Metric: 1.91 / 3 (63.6%): 100%|██████████| 3/3 [00:05<00:00,  1.95s/it]

2025/09/30 16:21:14 INFO dspy.evaluate.evaluate: Average Metric: 1.9090909090909092 / 3 (63.6%)





2025/09/30 16:22:51 INFO dspy.teleprompt.gepa.gepa: Iteration 37: Proposed new text for predict: markdown
# Instruction for extracting structured metadata from PDF‑extracted text

You will receive a **single JSON object** containing:

* **pdfinfo** – optional dictionary with PDF‑level metadata (e.g. `author`, `title`, `creationDate`, `modDate`).
* **pages** – list of page objects, each with a `page` number and a `text` string that is the raw OCR/clipboard text of that page.

Your task is to **parse this information and output a flat list of metadata fields** (one field per line, exactly as shown in the examples). The output must be ready to be turned into a JSON record later.

---

## 1. Required output fields (order does **not** matter)

| Field name | Type | Description & format |
|------------|------|----------------------|
| **language** | string or `None` | ISO‑639‑1 code of the **content language** (`en`, `fi`, `sv`, `se`, …). Detect from the majority of visible words (see § 2.1)

Average Metric: 1.77 / 3 (58.8%): 100%|██████████| 3/3 [00:06<00:00,  2.04s/it]

2025/09/30 16:23:02 INFO dspy.evaluate.evaluate: Average Metric: 1.7651515151515151 / 3 (58.8%)





2025/09/30 16:24:53 INFO dspy.teleprompt.gepa.gepa: Iteration 38: Proposed new text for predict: markdown
# Instruction for extracting structured bibliographic metadata from PDF‑extracted text

You will receive **one JSON object** per request.  
It contains two top‑level keys:

* **pdfinfo** – optional dictionary with PDF‑level metadata (`title`, `author`, `creationDate`, `modDate`).  
* **pages** – an ordered list of page objects, each with an integer `page` and a string `text` that is the raw OCR/clipboard text of that page.

Your job is to **scan the whole document (all pages)** and output a **flat list of metadata fields** exactly in the format described below.  
Treat each request as completely independent – do not carry over any information from previous requests.

---

## 1. Output format

* Write the **field name** on a line **by itself**, then write its **value** on the **next line**.  
* Follow the order given in the table under **1.1 Output fields**.  
* List values must be 

Average Metric: 1.91 / 3 (63.6%): 100%|██████████| 3/3 [00:09<00:00,  3.29s/it]

2025/09/30 16:25:58 INFO dspy.evaluate.evaluate: Average Metric: 1.9090909090909092 / 3 (63.6%)





2025/09/30 16:27:34 INFO dspy.teleprompt.gepa.gepa: Iteration 39: Proposed new text for predict: markdown
# Instruction for extracting structured metadata from PDF‑extracted text

You will receive **one JSON object** with two top‑level keys:

* **pdfinfo** – optional dictionary containing PDF‑level metadata (`title`, `author`, `creationDate`, `modDate`, …).  
* **pages** – an ordered list of page objects, each with:
  * `page` – integer page number  
  * `text` – the raw OCR / clipboard text of that page (preserve line breaks exactly as they appear).

Your job is to **scan the whole document** (all pages) and output a **flat list of metadata fields** in the exact format described in section 4.  
If a field cannot be found, output the *missing‑value placeholder* from section 3.

---

## 1. Output fields

| Field | Type | Required format & notes |
|-------|------|------------------------|
| **language** | string | ISO‑639‑1 code (`fi`, `sv`, `en`, `se`). Detect from the **majority of vis

Average Metric: 1.99 / 3 (66.2%): 100%|██████████| 3/3 [00:07<00:00,  2.37s/it]

2025/09/30 16:28:45 INFO dspy.evaluate.evaluate: Average Metric: 1.987012987012987 / 3 (66.2%)





2025/09/30 16:30:32 INFO dspy.teleprompt.gepa.gepa: Iteration 40: Proposed new text for predict: markdown
# Revised Instruction for Extracting Structured Bibliographic Metadata from PDF‑Extracted Text

You will receive **one JSON object** per request.  
It contains two top‑level keys:

* **pdfinfo** – optional dictionary with PDF‑level metadata (`title`, `author`, `creationDate`, `modDate`).  
* **pages** – an ordered list of page objects, each with an integer `page` and a string `text` that is the raw OCR/clipboard text of that page.

Your job is to **scan the whole document (all pages)** and output a **flat list of metadata fields** exactly in the format described below.  
Treat each request as completely independent – do not carry over any information from previous requests.

---

## 1. Output format

* Write the **field name** on a line **by itself**, then write its **value** on the **next line**.  
* Follow the order given in the table under **1.1 Output fields**.  
* List values 

Average Metric: 2.00 / 3 (66.7%): 100%|██████████| 3/3 [00:06<00:00,  2.29s/it]

2025/09/30 16:30:45 INFO dspy.evaluate.evaluate: Average Metric: 2.0 / 3 (66.7%)





2025/09/30 16:33:15 INFO dspy.teleprompt.gepa.gepa: Iteration 41: Proposed new text for predict: markdown
# 📄 Instruction for extracting structured metadata from PDF‑extracted text

You will receive **one JSON object** with the following structure  

```json
{
  "pdfinfo": { … },          // optional – may contain author, title, creationDate, modDate
  "pages": [
    {"page": 1, "text": "…"},
    {"page": 2, "text": "…"},
    …
  ]
}
```

Your job is to **produce a flat list of metadata fields** (one field name on a line, the value on the next line).  
The output **must follow the exact format** shown in the “Output format example” section, because it will later be turned into a JSON record.

---

## 1️⃣ Required output fields (order does **not** matter)

| Field name | Type | Description & format |
|------------|------|----------------------|
| `language` | `string` or `None` | ISO‑639‑1 code of the **content language** (`en`, `fi`, `sv`, …). |
| `title` | `string` or `None` | Full ma

Average Metric: 1.58 / 3 (52.5%): 100%|██████████| 3/3 [00:06<00:00,  2.02s/it]

2025/09/30 16:34:04 INFO dspy.evaluate.evaluate: Average Metric: 1.5757575757575757 / 3 (52.5%)





2025/09/30 16:35:35 INFO dspy.teleprompt.gepa.gepa: Iteration 42: Proposed new text for predict: markdown
# Task: Extract Structured Bibliographic Metadata from PDF‑Extracted Text

You will receive **one JSON object** with two top‑level keys:

* `pdfinfo` – optional dictionary containing PDF‑level metadata (e.g. `title`, `author`, `creationDate`, `modDate`).  
* `pages` – a list of page objects, each with:
  * `page` – integer page number  
  * `text` – raw OCR/clipboard text of that page (preserve line breaks exactly as they appear).

Your job is to **scan the whole document** and output a **flat list of metadata fields** in the exact format described in the *Output format* section below.

---

## 1. Required Output Fields

| Field | Type | When a value cannot be found → output |
|-------|------|----------------------------------------|
| `language` | string (ISO‑639‑1) | `None` |
| `title` | string | `None` |
| `alt_title` | list of strings | `[]` |
| `creator` | list of strings | `[

Average Metric: 2.27 / 3 (75.8%): 100%|██████████| 3/3 [00:05<00:00,  1.81s/it]

2025/09/30 16:35:45 INFO dspy.evaluate.evaluate: Average Metric: 2.2727272727272725 / 3 (75.8%)





2025/09/30 16:37:51 INFO dspy.teleprompt.gepa.gepa: Iteration 43: Proposed new text for predict: markdown
# 📄 Instruction for extracting structured metadata from PDF‑extracted text

You will receive **one JSON object** with the following structure:

* `pdfinfo` – optional dictionary containing PDF‑level metadata (e.g. `author`, `title`, `creationDate`, `modDate`).  
* `pages` – list of page objects, each with a numeric `page` field and a `text` field that holds the raw OCR/clipboard text of that page.

Your job is to **parse this information and output a flat list of metadata fields** (one field name on a line, the value on the next line).  
The output **must follow the exact format shown in the “Output format example” section** because it will later be turned into a JSON record.

---

## 1️⃣ Required output fields (order does **not** matter)

| Field name | Type | Description & format |
|------------|------|----------------------|
| `language` | `string` or `None` | ISO‑639‑1 code of 

Average Metric: 1.82 / 3 (60.6%): 100%|██████████| 3/3 [00:04<00:00,  1.59s/it]

2025/09/30 16:38:01 INFO dspy.evaluate.evaluate: Average Metric: 1.8181818181818181 / 3 (60.6%)





2025/09/30 16:39:18 INFO dspy.teleprompt.gepa.gepa: Iteration 44: Proposed new text for predict: markdown
# Instruction for extracting structured metadata from PDF‑extracted text

You will receive a JSON object that contains the raw text extracted from a PDF (field **pages** → list of pages with a **text** string) and, optionally, some PDF‑level information (author, title, creationDate, modDate, etc.) in the **pdfinfo** object.

Your job is to parse this information and produce a **flat list of metadata fields** (one field per line, exactly as shown in the examples) that can later be turned into a JSON record.  
All fields must follow the exact naming, type and formatting rules below. If a piece of information is not present, use the values shown in the “Missing values” table.

---

## 1. Required output fields (order does **not** matter)

| Field name | Type | Description & format |
|------------|------|----------------------|
| **language** | string or `None` | ISO‑639‑1 code of the 

Average Metric: 1.82 / 3 (60.6%): 100%|██████████| 3/3 [00:05<00:00,  1.75s/it]

2025/09/30 16:40:08 INFO dspy.evaluate.evaluate: Average Metric: 1.8181818181818183 / 3 (60.6%)





2025/09/30 16:41:50 INFO dspy.teleprompt.gepa.gepa: Iteration 45: Proposed new text for predict: markdown
# Instruction for extracting structured metadata from PDF‑extracted text

You will receive a single JSON object with two possible keys:

* **pdfinfo** – optional dictionary containing PDF‑level metadata (e.g. `title`, `author`, `creationDate`, `modDate`).
* **pages** – list of page objects, each with a `page` number and a `text` string that is the plain‑text extraction of that page.

Your job is to **parse this information and output a flat list of metadata fields** (one field name on a line by itself, followed by its value on the next line).  
The output must follow **exactly** the format shown in the “Example output” section at the end of this document.  
If a piece of information cannot be found, use the “Missing‑value handling” defaults.

---

## 1. Required output fields (order does **not** matter)

| Field name | Type | Description & formatting rules |
|------------|------|--

Average Metric: 2.18 / 3 (72.7%): 100%|██████████| 3/3 [00:05<00:00,  1.75s/it]

2025/09/30 16:42:00 INFO dspy.evaluate.evaluate: Average Metric: 2.1818181818181817 / 3 (72.7%)





2025/09/30 16:44:07 INFO dspy.teleprompt.gepa.gepa: Iteration 46: Proposed new text for predict: markdown
# Instruction for extracting structured metadata from PDF‑extracted text

You will receive **one JSON object** with the following keys:

* **pdfinfo** – optional dictionary containing PDF‑level metadata (e.g. `author`, `title`, `creationDate`, `modDate`).  
* **pages** – list of page objects, each with a `page` number (integer) and a `text` string that is the raw OCR/clipboard text of that page.

Your task is to **parse this information and output a flat list of metadata fields** (one field per line, exactly as shown in the examples).  
The output must be ready to be turned into a JSON record later.

---

## 1. Required output fields (order does **not** matter)

| Field name | Type | Description & format |
|------------|------|----------------------|
| **language** | string or `None` | ISO‑639‑1 code of the **content language** (`en`, `fi`, `sv`, …). Detect from the majority of vis

Average Metric: 1.97 / 3 (65.7%): 100%|██████████| 3/3 [00:05<00:00,  1.94s/it]

2025/09/30 16:44:53 INFO dspy.evaluate.evaluate: Average Metric: 1.9696969696969697 / 3 (65.7%)





2025/09/30 16:46:58 INFO dspy.teleprompt.gepa.gepa: Iteration 47: Proposed new text for predict: markdown
# Instruction for extracting structured bibliographic metadata from PDF‑extracted text

You will receive **one JSON object** with two top‑level keys:

* **pdfinfo** – optional dictionary containing PDF‑level metadata (`title`, `author`, `creationDate`, `modDate`, …).  
* **pages** – a list of page objects, each with:
  * `page` – integer page number (starting at 1)  
  * `text` – the raw OCR / clipboard text of that page (preserve line‑breaks exactly as they appear).

Your task is to **scan the whole document** (all pages) and output a **flat list of metadata fields** in the exact format described in section 4.  
If a field cannot be found, use the *missing‑value placeholder* from section 3.

---

## 1. Output fields

| Field | Type | Required format & notes |
|-------|------|------------------------|
| **language** | string | ISO‑639‑1 code (`fi`, `en`, `sv`, `se`). Detect from th

Average Metric: 1.82 / 3 (60.6%): 100%|██████████| 3/3 [00:04<00:00,  1.61s/it]

2025/09/30 16:47:10 INFO dspy.evaluate.evaluate: Average Metric: 1.8181818181818183 / 3 (60.6%)





2025/09/30 16:49:06 INFO dspy.teleprompt.gepa.gepa: Iteration 48: Proposed new text for predict: markdown
# Task: Structured Bibliographic Metadata Extraction from PDF‑Extracted Text  

You will receive **one JSON object** per request containing  

* `pdfinfo` – optional dictionary with PDF‑level metadata (`title`, `author`, `creationDate`, `modDate`).  
* `pages` – an **ordered** list of page objects, each with an integer `page` and a string `text` (the raw OCR/clipboard text of that page).

Your job is to **scan the whole document (all pages)** and output a **flat list of metadata fields** exactly in the format described below.  
Treat each request as completely independent – do not carry over any information from previous requests.

---

## 1. Output format  

* Write the **field name** on a line **by itself**, then write its **value** on the **next line**.  
* Follow the order given in the table under **1.1 Output fields**.  
* List values must be **valid Python‑style lists**: use 

Average Metric: 2.09 / 3 (69.7%): 100%|██████████| 3/3 [00:06<00:00,  2.19s/it]

2025/09/30 16:50:08 INFO dspy.evaluate.evaluate: Average Metric: 2.090909090909091 / 3 (69.7%)





2025/09/30 16:52:00 INFO dspy.teleprompt.gepa.gepa: Iteration 49: Proposed new text for predict: markdown
# 📄 Instruction for extracting structured metadata from PDF‑extracted text

You will receive **one JSON object** with the following structure:

* `pdfinfo` – optional dictionary containing PDF‑level metadata (e.g. `author`, `title`, `creationDate`, `modDate`).  
* `pages` – list of page objects, each with a numeric `page` field and a `text` field that holds the raw OCR/clipboard text of that page.

Your job is to **parse this information and output a flat list of metadata fields** (one field name on a line, the value on the next line).  
The output must follow the exact format shown in the “Output format example” section below, because it will later be turned into a JSON record.

---

## 1️⃣ Required output fields (order does **not** matter)

| Field name | Type | Description & format |
|------------|------|----------------------|
| `language` | `string` or `None` | ISO‑639‑1 code 

Average Metric: 2.09 / 3 (69.7%): 100%|██████████| 3/3 [00:05<00:00,  1.84s/it]

2025/09/30 16:52:49 INFO dspy.evaluate.evaluate: Average Metric: 2.090909090909091 / 3 (69.7%)





2025/09/30 16:54:39 INFO dspy.teleprompt.gepa.gepa: Iteration 50: Proposed new text for predict: markdown
# Instruction for extracting structured metadata from PDF‑extracted text

You will receive **one JSON object** with the following structure:

* `pdfinfo` – optional dictionary containing PDF‑level metadata (e.g. `author`, `title`, `creationDate`, `modDate`).  
* `pages` – list of page objects, each with a numeric `page` field and a `text` field that holds the raw OCR/clipboard text of that page.

Your job is to **parse this information and output a flat list of metadata fields** (one field name on a line, the value on the next line). The output must follow the exact format shown in the “Output format example” section below, because it will later be turned into a JSON record.

---

## 1. Required output fields (order does **not** matter)

| Field name | Type | Description & format |
|------------|------|----------------------|
| `language` | string or `None` | ISO‑639‑1 code of the 

Average Metric: 1.66 / 3 (55.4%): 100%|██████████| 3/3 [00:06<00:00,  2.00s/it]

2025/09/30 16:55:29 INFO dspy.evaluate.evaluate: Average Metric: 1.6623376623376622 / 3 (55.4%)





2025/09/30 16:57:24 INFO dspy.teleprompt.gepa.gepa: Iteration 51: Proposed new text for predict: markdown
# Instruction for extracting structured bibliographic metadata from PDF‑extracted text

You will receive **one JSON object** per request.  
It contains two top‑level keys:

* **pdfinfo** – optional dictionary with PDF‑level metadata (`title`, `author`, `creationDate`, `modDate`).  
* **pages** – an ordered list of page objects, each with an integer `page` and a string `text` that is the raw OCR/clipboard text of that page.

Your job is to **scan the whole document (all pages)** and output a **flat list of metadata fields** exactly in the format described below.  
Treat each request as completely independent – do **not** carry any information from previous requests.

---

## 1. Output format

* Write the field name on a line **by itself**, then write its value on the **next line**.  
* Follow the order given in the table under **1.1 Output fields**.  
* List values must be **valid P

Average Metric: 2.45 / 3 (81.8%): 100%|██████████| 3/3 [00:06<00:00,  2.11s/it]

2025/09/30 16:57:37 INFO dspy.evaluate.evaluate: Average Metric: 2.4545454545454546 / 3 (81.8%)





2025/09/30 16:59:03 INFO dspy.teleprompt.gepa.gepa: Iteration 52: Proposed new text for predict: markdown
# Instruction for extracting structured metadata from PDF‑extracted text

You will receive **one JSON object** with the following keys:

* `pdfinfo` – optional dictionary containing PDF‑level metadata (e.g. `author`, `title`, `creationDate`, `modDate`).
* `pages` – list of page objects, each with:
  * `page` – page number (integer)
  * `text` – the raw OCR/clipboard text of that page (string)

Your job is to **parse this information and output a flat list of metadata fields** (one field per line, exactly as shown in the examples).  
The output must be ready to be turned into a JSON record later.

---

## 1. Required output fields (order does **not** matter)

| Field name | Type | Description & format |
|------------|------|----------------------|
| **language** | string or `None` | ISO‑639‑1 code of the **content language** (`en`, `fi`, `sv`, …). Detect from the majority of visible

Average Metric: 1.73 / 3 (57.6%): 100%|██████████| 3/3 [00:06<00:00,  2.01s/it]

2025/09/30 16:59:14 INFO dspy.evaluate.evaluate: Average Metric: 1.727272727272727 / 3 (57.6%)





2025/09/30 17:01:26 INFO dspy.teleprompt.gepa.gepa: Iteration 53: Proposed new text for predict: markdown
# 📄 Instruction for extracting structured metadata from PDF‑extracted text

You will receive **one JSON object** with the following structure:

* `pdfinfo` – optional dictionary containing PDF‑level metadata (e.g. `author`, `title`, `creationDate`, `modDate`).  
* `pages` – list of page objects, each with a numeric `page` field and a `text` field that holds the raw OCR/clipboard text of that page.

Your job is to **parse this information and output a flat list of metadata fields** (one field name on a line, the value on the next line).  
The output **must** follow the exact format shown in the “Output format example” section below, because it will later be turned into a JSON record.

---

## 1️⃣ Required output fields (order does **not** matter)

| Field name | Type | Description & format |
|------------|------|----------------------|
| `language` | `string` or `None` | ISO‑639‑1 c

Average Metric: 1.82 / 3 (60.6%): 100%|██████████| 3/3 [00:08<00:00,  2.76s/it]

2025/09/30 17:02:36 INFO dspy.evaluate.evaluate: Average Metric: 1.8181818181818181 / 3 (60.6%)





2025/09/30 17:04:14 INFO dspy.teleprompt.gepa.gepa: Iteration 54: Proposed new text for predict: markdown
# Task: Structured metadata extraction from PDF‑extracted text

You will receive **one JSON object** with two top‑level keys:

* **pdfinfo** – optional dictionary containing PDF‑level metadata (`title`, `author`, `creationDate`, `modDate`, …).  
* **pages** – a list of page objects, each with:
  * `page` – integer page number  
  * `text` – the raw OCR / clipboard text of that page (preserve line breaks exactly as they appear).

Your job is to **scan the whole document** (all pages) and output a **flat list of metadata fields** in the exact format described in section 4.  
If a field cannot be found, use the *missing‑value placeholder* from section 3.

---

## 1. Output fields

| Field | Type | Required format & notes |
|-------|------|------------------------|
| **language** | string | ISO‑639‑1 code: `fi` (Finnish), `en` (English) or `sv` (Swedish). Detect from the **majority of 

Average Metric: 2.18 / 3 (72.7%): 100%|██████████| 3/3 [00:06<00:00,  2.28s/it]

2025/09/30 17:04:45 INFO dspy.evaluate.evaluate: Average Metric: 2.181818181818182 / 3 (72.7%)





2025/09/30 17:06:53 INFO dspy.teleprompt.gepa.gepa: Iteration 55: Proposed new text for predict: markdown
# 📄 Instruction for extracting structured metadata from PDF‑extracted text

You will receive **one JSON object** with the following structure:

* `pdfinfo` – optional dictionary containing PDF‑level metadata (e.g. `author`, `title`, `creationDate`, `modDate`).  
* `pages` – list of page objects, each with a numeric `page` field and a `text` field that holds the raw OCR/clipboard text of that page.

Your job is to **parse this information and output a flat list of metadata fields** (one field name on a line, the value on the next line).  
The output **must** follow the exact format shown in the “Output format example” section below, because it will later be turned into a JSON record.

---

## 1️⃣ Required output fields (order does **not** matter)

| Field name | Type | Description & format |
|------------|------|----------------------|
| `language` | `string` or `None` | ISO‑639‑1 c

Average Metric: 2.18 / 3 (72.7%): 100%|██████████| 3/3 [00:05<00:00,  1.85s/it]

2025/09/30 17:07:57 INFO dspy.evaluate.evaluate: Average Metric: 2.181818181818182 / 3 (72.7%)





2025/09/30 17:10:17 INFO dspy.teleprompt.gepa.gepa: Iteration 56: Proposed new text for predict: markdown
# 📄 Instruction for extracting structured metadata from PDF‑extracted text

You will receive **one JSON object** with the following structure:

* `pdfinfo` – optional dictionary containing PDF‑level metadata (e.g. `author`, `title`, `creationDate`, `modDate`).  
* `pages` – list of page objects, each with a numeric `page` field and a `text` field that holds the raw OCR/clipboard text of that page.

Your job is to **parse this information and output a flat list of metadata fields** (one field name on a line, the value on the next line).  
The output **must** follow the exact format shown in the “Output format example” section below, because it will later be turned into a JSON record.

---

## 1️⃣ Required output fields (order does **not** matter)

| Field name | Type | Description & format |
|------------|------|----------------------|
| `language` | `string` or `None` | ISO‑639‑1 c

Average Metric: 1.55 / 3 (51.5%): 100%|██████████| 3/3 [00:06<00:00,  2.23s/it]

2025/09/30 17:10:29 INFO dspy.evaluate.evaluate: Average Metric: 1.5454545454545454 / 3 (51.5%)





2025/09/30 17:12:25 INFO dspy.teleprompt.gepa.gepa: Iteration 57: Proposed new text for predict: markdown
# Revised Task – Structured Metadata Extraction from PDF‑Derived Text

You will receive a single JSON object with two top‑level keys:

* **pdfinfo** – a dictionary containing the raw PDF metadata (e.g. `title`, `author`, `creationDate`, …).  
* **pages** – an ordered list of page objects, each with:
  * **page** – the page number (integer, 1‑based).  
  * **text** – the plain‑text extracted from that page (UTF‑8, may contain markdown headings, bold/italic markup, hyperlinks, footnotes, line‑breaks, etc.).

Your job is to **output ONE JSON object** that contains **exactly** the fields listed in the table below (order does not matter).  
If a field cannot be determined, use the value prescribed in the *Missing‑Data Handling* section.

| Field | Type | Description | Extraction Rules |
|-------|------|-------------|-----------------|
| **language** | string | ISO‑639‑1 code of the docu

Average Metric: 1.77 / 3 (59.1%): 100%|██████████| 3/3 [00:06<00:00,  2.12s/it]

2025/09/30 17:13:24 INFO dspy.evaluate.evaluate: Average Metric: 1.7727272727272725 / 3 (59.1%)





2025/09/30 17:15:23 INFO dspy.teleprompt.gepa.gepa: Iteration 58: Proposed new text for predict: markdown
# 📄  Structured Metadata Extraction from PDF‑Extracted Text  

You will receive **one JSON object** with two top‑level keys:

* **pdfinfo** – optional dictionary containing PDF‑level metadata (`title`, `author`, `creationDate`, `modDate`, …).  
* **pages** – an ordered list of page objects, each with:  
  * `page` – integer page number  
  * `text` – the raw OCR / clipboard text of that page (preserve line‑breaks exactly as they appear).

Your job is to **scan the whole document** (all pages) and output a **flat list of metadata fields** in the exact format described in Section 4.  
If a field cannot be found, output the *missing‑value placeholder* from Section 3.

---

## 1️⃣ Output fields

| Field | Type | Required format & notes |
|-------|------|------------------------|
| **language** | string | ISO‑639‑1 code (`fi`, `sv`, `en`, `se`). Detect from the **majority of visible wor

Average Metric: 2.27 / 3 (75.8%): 100%|██████████| 3/3 [00:05<00:00,  1.96s/it]

2025/09/30 17:15:35 INFO dspy.evaluate.evaluate: Average Metric: 2.272727272727273 / 3 (75.8%)





2025/09/30 17:17:44 INFO dspy.teleprompt.gepa.gepa: Iteration 59: Proposed new text for predict: markdown
# 📄 Instruction for extracting structured metadata from PDF‑extracted text

You will receive **one JSON object** with the following structure:

* `pdfinfo` – optional dictionary containing PDF‑level metadata (e.g. `author`, `title`, `creationDate`, `modDate`).  
* `pages` – list of page objects, each with a numeric `page` field and a `text` field that holds the raw OCR/clipboard text of that page.

Your job is to **parse this information and output a flat list of metadata fields** (one field name on a line, the value on the next line).  
The output **must** follow the exact format shown in the “Output format example” section below, because it will later be turned into a JSON record.

---

## 1️⃣ Required output fields (order does **not** matter)

| Field name | Type | Description & format |
|------------|------|----------------------|
| `language` | `string` or `None` | ISO‑639‑1 c

Average Metric: 1.73 / 3 (57.6%): 100%|██████████| 3/3 [00:06<00:00,  2.09s/it]

2025/09/30 17:17:57 INFO dspy.evaluate.evaluate: Average Metric: 1.727272727272727 / 3 (57.6%)





2025/09/30 17:19:19 INFO dspy.teleprompt.gepa.gepa: Iteration 60: Proposed new text for predict: markdown
# Instruction for extracting structured bibliographic metadata from PDF‑extracted JSON

You will receive a single JSON object that contains the raw metadata produced by a PDF‑parsing tool.  
The object has two top‑level keys:

* **pdfinfo** – a dictionary that may contain `title`, `author`, `creationDate`, `modDate`, `keywords`, etc.
* **pages** – a list of dictionaries, each with a `page` number and a `text` string that is the OCR‑extracted content of that page.

Your job is to **populate a new JSON object** with the bibliographic fields listed below.  
All fields must follow the exact format and naming shown in the examples; any deviation will be marked as incorrect.

---

## Required output fields

| Field name | Type | Description & extraction rules |
|-----------|------|--------------------------------|
| **language** | string | ISO‑639‑1 two‑letter code. Detect the language o

Average Metric: 1.29 / 3 (42.9%): 100%|██████████| 3/3 [00:06<00:00,  2.17s/it]

2025/09/30 17:19:31 INFO dspy.evaluate.evaluate: Average Metric: 1.2884615384615383 / 3 (42.9%)





2025/09/30 17:21:56 INFO dspy.teleprompt.gepa.gepa: Iteration 61: Proposed new text for predict: markdown
# 📄 Instruction for extracting structured metadata from PDF‑extracted text

You will receive **one JSON object** with the following structure  

```json
{
  "pdfinfo": { … },          // optional – may contain author, title, creationDate, modDate
  "pages": [
    {"page": 1, "text": "…"},
    {"page": 2, "text": "…"},
    …
  ]
}
```

Your job is to **produce a flat list of metadata fields** (field name on one line,
its value on the next line).  
The output **must follow the exact format** shown in the “Output format example”
section, because it will later be turned into a JSON record.

---

## 1️⃣ Required output fields (order does **not** matter)

| Field name | Type | Description & format |
|------------|------|----------------------|
| `language` | `string` or `None` | ISO‑639‑1 code of the **content language** (`en`, `fi`, `sv`, …). |
| `title` | `string` or `None` | Full main

Average Metric: 1.95 / 3 (65.2%): 100%|██████████| 3/3 [00:13<00:00,  4.59s/it]

2025/09/30 17:22:16 INFO dspy.evaluate.evaluate: Average Metric: 1.9545454545454546 / 3 (65.2%)





2025/09/30 17:23:50 INFO dspy.teleprompt.gepa.gepa: Iteration 62: Proposed new text for predict: markdown
# Task: Structured metadata extraction from PDF‑derived text

You will receive a single JSON object with two top‑level keys:

* **pdfinfo** – a dictionary containing the raw PDF metadata (e.g. `title`, `author`, `creationDate`, …).  
* **pages** – an ordered list of page objects, each with:
  * **page** – the 1‑based page number (integer).  
  * **text** – the plain‑text extracted from that page (UTF‑8). The text may contain markdown syntax, hyperlinks, footnotes, line‑breaks, etc.

Your job is to **produce ONE JSON object** that contains **exactly** the fields listed in the table below (order does not matter). Do **not** add any extra keys.

| Field | Type | How to obtain (summary) |
|-------|------|--------------------------|
| **language** | string | ISO‑639‑1 code (`en`, `fi`, `sv`, …). Detect from the *body text* (see detailed heuristic). If you cannot decide, use `"und"` (und

Average Metric: 1.45 / 3 (48.5%): 100%|██████████| 3/3 [00:08<00:00,  2.99s/it]

2025/09/30 17:24:04 INFO dspy.evaluate.evaluate: Average Metric: 1.4545454545454546 / 3 (48.5%)





2025/09/30 17:26:02 INFO dspy.teleprompt.gepa.gepa: Iteration 63: Proposed new text for predict: markdown
# Instruction for extracting structured metadata from PDF‑extracted text

You will receive **one JSON object** with the following structure:

* `pdfinfo` – optional dictionary containing PDF‑level metadata (e.g. `author`, `title`, `creationDate`, `modDate`).  
* `pages` – list of page objects, each with a numeric `page` field and a `text` field that holds the raw OCR/clipboard text of that page.

Your job is to **parse this information and output a flat list of metadata fields** (one field name on a line, the value on the next line). The output **must** follow the exact format shown in the “Output format example” section below, because it will later be turned into a JSON record.

---

## 1. Output fields (order does **not** matter)

| Field name | Type | Meaning |
|------------|------|---------|
| `language` | string or `None` | ISO‑639‑1 code of the **content language**. Use the m

Average Metric: 1.79 / 3 (59.6%): 100%|██████████| 3/3 [00:08<00:00,  2.73s/it]

2025/09/30 17:26:52 INFO dspy.evaluate.evaluate: Average Metric: 1.7878787878787878 / 3 (59.6%)





2025/09/30 17:29:01 INFO dspy.teleprompt.gepa.gepa: Iteration 64: Proposed new text for predict: markdown
# 📄 Instruction for extracting structured metadata from PDF‑extracted text

You will receive **one JSON object** with the following structure:

* `pdfinfo` – optional dictionary containing PDF‑level metadata (e.g. `author`, `title`, `creationDate`, `modDate`).  
* `pages` – list of page objects, each with a numeric `page` field and a `text` field that holds the raw OCR/clipboard text of that page.

Your job is to **parse this information and output a flat list of metadata fields** (one field name on a line, the value on the next line).  
The output **must** follow the exact format shown in the “Output format example” section, because it will later be turned into a JSON record.

---

## 1️⃣ Required output fields (order does **not** matter)

| Field name | Type | Description & format |
|------------|------|----------------------|
| `language` | `string` or `None` | ISO‑639‑1 code of

Average Metric: 1.68 / 3 (56.1%): 100%|██████████| 3/3 [00:09<00:00,  3.26s/it]

2025/09/30 17:29:53 INFO dspy.evaluate.evaluate: Average Metric: 1.6818181818181819 / 3 (56.1%)





2025/09/30 17:31:53 INFO dspy.teleprompt.gepa.gepa: Iteration 65: Proposed new text for predict: markdown
# 📄 Instruction for extracting structured metadata from PDF‑extracted text

You will receive **one JSON object** with the following structure:

* `pdfinfo` – optional dictionary containing PDF‑level metadata (e.g. `author`, `title`, `creationDate`, `modDate`).  
* `pages` – list of page objects, each with a numeric `page` field and a `text` field that holds the raw OCR/clipboard text of that page.

Your job is to **parse this information and output a flat list of metadata fields** (one field name on a line, the value on the next line).  
The output must follow the exact format shown in the “Output format example” section below, because it will later be turned into a JSON record.

---

## 1️⃣ Required output fields (order **does not** matter)

| Field name | Type | Description & format |
|------------|------|----------------------|
| `language` | `string` or `None` | ISO‑639‑1 code 

Average Metric: 1.45 / 3 (48.5%): 100%|██████████| 3/3 [00:05<00:00,  1.87s/it]

2025/09/30 17:32:43 INFO dspy.evaluate.evaluate: Average Metric: 1.4545454545454546 / 3 (48.5%)





2025/09/30 17:35:05 INFO dspy.teleprompt.gepa.gepa: Iteration 66: Proposed new text for predict: markdown
# Instruction for extracting structured metadata from PDF‑extracted text

You will receive **one JSON object** with the following structure:

* `pdfinfo` – optional dictionary containing PDF‑level metadata (e.g. `author`, `title`, `creationDate`, `modDate`).  
* `pages` – list of page objects, each with a numeric `page` field and a `text` field that holds the raw OCR/clipboard text of that page.

Your job is to **parse this information and output a flat list of metadata fields** (one field name on a line, the value on the next line). The output **must** follow the exact format shown in the “Output format example” section below, because it will later be turned into a JSON record.

---

## 1. Required output fields (order does **not** matter)

| Field name | Type | Description & format |
|------------|------|----------------------|
| `language` | string or `None` | ISO‑639‑1 code of 

Average Metric: 1.91 / 3 (63.6%): 100%|██████████| 3/3 [00:06<00:00,  2.17s/it]

2025/09/30 17:35:53 INFO dspy.evaluate.evaluate: Average Metric: 1.9090909090909092 / 3 (63.6%)





2025/09/30 17:38:02 INFO dspy.teleprompt.gepa.gepa: Iteration 67: Proposed new text for predict: markdown
# 📄 Instruction for extracting structured metadata from PDF‑extracted text

You will receive **one JSON object** with the following structure:

* `pdfinfo` – optional dictionary containing PDF‑level metadata (e.g. `author`, `title`, `creationDate`, `modDate`).  
* `pages` – list of page objects, each with a numeric `page` field and a `text` field that holds the raw OCR/clipboard text of that page.

Your job is to **parse this information and output a flat list of metadata fields** (one field name on a line, the value on the next line).  
The output **must follow the exact format shown in the “Output format example” section** because it will later be turned into a JSON record.

---

## 1️⃣ Required output fields (order **does not** matter)

| Field name | Type | Description & format |
|------------|------|----------------------|
| `language` | `string` or `None` | ISO‑639‑1 code of 

Average Metric: 2.00 / 3 (66.7%): 100%|██████████| 3/3 [00:05<00:00,  1.88s/it]

2025/09/30 17:38:15 INFO dspy.evaluate.evaluate: Average Metric: 2.0 / 3 (66.7%)





2025/09/30 17:40:23 INFO dspy.teleprompt.gepa.gepa: Iteration 68: Proposed new text for predict: markdown
# Revised Instruction for Extracting Structured Bibliographic Metadata from PDF‑extracted Text

You will receive **one JSON object** per request.  
It contains two top‑level keys:

* **pdfinfo** – optional dictionary with PDF‑level metadata (`title`, `author`, `creationDate`, `modDate`).  
* **pages** – an ordered list of page objects, each with an integer `page` and a string `text` that is the raw OCR/clipboard text of that page.

Your job is to **scan the whole document (all pages)** and output a **flat list of metadata fields** exactly in the format described below.  
Do **not** carry any information from previous requests – treat each request as completely independent.

---

## 1. Output format

* Write the field name on a line **by itself**, then write its value on the **next line**.  
* Follow the order given in the table under **1.1 Output fields**.  
* List values must be *

Average Metric: 2.14 / 3 (71.2%): 100%|██████████| 3/3 [00:07<00:00,  2.64s/it]

2025/09/30 17:40:36 INFO dspy.evaluate.evaluate: Average Metric: 2.1363636363636367 / 3 (71.2%)





2025/09/30 17:42:41 INFO dspy.teleprompt.gepa.gepa: Iteration 69: Proposed new text for predict: markdown
# 📄 Instruction for extracting structured metadata from PDF‑extracted text

You will receive **one JSON object** with the following structure:

* `pdfinfo` – optional dictionary containing PDF‑level metadata (e.g. `author`, `title`, `creationDate`, `modDate`).  
* `pages` – list of page objects, each with a numeric `page` field and a `text` field that holds the raw OCR/clipboard text of that page.

Your job is to **parse this information and output a flat list of metadata fields** (one field name on a line, the value on the next line).  
The output **must** follow the exact format shown in the “Output format example” section below, because it will later be turned into a JSON record.

---

## 1️⃣ Required output fields (order does **not** matter)

| Field name | Type | Description & format |
|------------|------|----------------------|
| `language` | `string` or `None` | ISO‑639‑1 c

Average Metric: 2.18 / 3 (72.7%): 100%|██████████| 3/3 [00:05<00:00,  1.75s/it]

2025/09/30 17:42:53 INFO dspy.evaluate.evaluate: Average Metric: 2.1818181818181817 / 3 (72.7%)





2025/09/30 17:44:33 INFO dspy.teleprompt.gepa.gepa: Iteration 70: Proposed new text for predict: markdown
# Instruction for extracting structured metadata from PDF‑extracted text

You will receive **one JSON object** with two top‑level keys:

* `pdfinfo` – optional dictionary containing PDF‑level metadata (e.g. `author`,
  `creationDate`, `modDate`). Dates follow the PDF format `D:YYYYMMDD…`.
* `pages` – a list of page objects, each with:
  * `page` – integer page number (starting at 1)
  * `text` – the raw OCR/clipboard text of that page (preserve line‑breaks).

Your job is to **scan the whole document** (all pages) and output a **flat list of
metadata fields** in the exact format described in the *Output format* section.
All fields must be present – if a value cannot be found, use the *missing‑value
placeholder* defined below.

---

## 1. Output fields

| Field | Type | Required format & notes |
|-------|------|--------------------------|
| **language** | string | ISO‑639‑1 code of t

Average Metric: 1.79 / 3 (59.6%): 100%|██████████| 3/3 [00:07<00:00,  2.52s/it]

2025/09/30 17:45:22 INFO dspy.evaluate.evaluate: Average Metric: 1.787878787878788 / 3 (59.6%)





2025/09/30 17:47:37 INFO dspy.teleprompt.gepa.gepa: Iteration 71: Proposed new text for predict: markdown
# 📄 Instruction for extracting structured metadata from PDF‑extracted text

You will receive **one JSON object** with the following structure:

* `pdfinfo` – optional dictionary containing PDF‑level metadata (e.g. `author`, `title`, `creationDate`, `modDate`).  
* `pages` – list of page objects, each with a numeric `page` field and a `text` field that holds the raw OCR/clipboard text of that page.

Your job is to **parse this information and output a flat list of metadata fields** (one field name on a line, the value on the next line).  
The output **must** follow the exact format shown in the “Output format example” section below, because it will later be turned into a JSON record.

---

## 1️⃣ Required output fields (order does **not** matter)

| Field name | Type | Description & format |
|------------|------|----------------------|
| `language` | `string` or `None` | ISO‑639‑1 c

Average Metric: 2.00 / 3 (66.7%): 100%|██████████| 3/3 [00:06<00:00,  2.14s/it]

2025/09/30 17:47:51 INFO dspy.evaluate.evaluate: Average Metric: 2.0 / 3 (66.7%)





2025/09/30 17:49:51 INFO dspy.teleprompt.gepa.gepa: Iteration 72: Proposed new text for predict: markdown
# 📄 Instruction for extracting structured metadata from PDF‑extracted JSON

You will receive **one JSON object** with the following structure:

* `pdfinfo` – optional dictionary containing PDF‑level metadata (e.g. `author`, `title`, `creationDate`, `modDate`).  
* `pages` – list of page objects, each with a numeric `page` field and a `text` field that holds the raw OCR/clipboard text of that page.

Your job is to **parse this information and output a flat list of metadata fields** (one field name on a line, the value on the next line).  
The output must follow the exact format shown in the “Output format example” section below, because it will later be turned into a JSON record.

---

## 1️⃣ Required output fields (order does **not** matter)

| Field name | Type | Description & format |
|------------|------|----------------------|
| `language` | `string` or `None` | ISO‑639‑1 code 

Average Metric: 2.45 / 3 (81.8%): 100%|██████████| 3/3 [00:09<00:00,  3.16s/it]

2025/09/30 17:50:47 INFO dspy.evaluate.evaluate: Average Metric: 2.4545454545454546 / 3 (81.8%)





2025/09/30 17:53:01 INFO dspy.teleprompt.gepa.gepa: Iteration 73: Proposed new text for predict: markdown
# Instruction for extracting structured metadata from PDF‑extracted text

You will receive **one JSON object** with the following structure:

* `pdfinfo` – optional dictionary containing PDF‑level metadata (e.g. `author`, `title`, `creationDate`, `modDate`).  
* `pages` – list of page objects, each with a numeric `page` field and a `text` field that holds the raw OCR/clipboard text of that page.

Your job is to **parse this information and output a flat list of metadata fields** (one field name on a line, the value on the next line).  
The output **must** follow the exact format shown in the “Output format example” section below, because it will later be turned into a JSON record.

---

## 1. Required output fields (order does **not** matter)

| Field name | Type | Description & format |
|------------|------|----------------------|
| `language` | string or `None` | ISO‑639‑1 code o

Average Metric: 1.91 / 3 (63.6%): 100%|██████████| 3/3 [00:06<00:00,  2.11s/it]

2025/09/30 17:53:16 INFO dspy.evaluate.evaluate: Average Metric: 1.9090909090909092 / 3 (63.6%)





2025/09/30 17:54:46 INFO dspy.teleprompt.gepa.gepa: Iteration 74: Proposed new text for predict: markdown
# Instruction for extracting structured metadata from PDF‑extracted text

You will receive **one JSON object** with two top‑level keys:

* **pdfinfo** – optional dictionary containing PDF‑level metadata (e.g. `title`, `author`, `creationDate`, `modDate`).
* **pages** – a list of page objects, each with:
  * `page` – integer page number  
  * `text` – the raw OCR / clipboard text of that page (preserve line breaks as they appear).

Your job is to **scan the entire document** (all pages) and output a **flat list of metadata fields** exactly in the format described in *Output format* below.  
If a field cannot be found, output the **missing‑value placeholder** shown in the *Missing‑value handling* table.

---

## 1. Fields to extract

| Field | Type | Required format & notes |
|-------|------|--------------------------|
| **language** | string | ISO‑639‑1 code of the dominant language

Average Metric: 1.95 / 3 (65.2%): 100%|██████████| 3/3 [00:08<00:00,  2.71s/it]

2025/09/30 17:55:01 INFO dspy.evaluate.evaluate: Average Metric: 1.9545454545454546 / 3 (65.2%)





2025/09/30 17:56:52 INFO dspy.teleprompt.gepa.gepa: Iteration 75: Proposed new text for predict: markdown
# Instruction for extracting structured metadata from PDF‑extracted text

You will receive **one JSON object** with two top‑level keys:

* **pdfinfo** – optional dictionary containing PDF‑level metadata (`title`, `author`, `creationDate`, `modDate`, …).  
* **pages** – an **ordered** list of page objects, each with:
  * `page` – integer page number (starting at 1)  
  * `text` – the raw OCR / clipboard text of that page (preserve line breaks exactly as they appear).

Your job is to **scan the whole document** (all pages) and output a **flat list of metadata fields** in the exact format described in section 4.  
If a field cannot be found, output the *missing‑value placeholder* from section 3.

---

## 1. Output fields

| Field | Type | Required format & notes |
|-------|------|------------------------|
| **language** | string | ISO‑639‑1 code (`fi`, `sv`, `en`, `se`). Detect from t

Average Metric: 1.64 / 3 (54.5%): 100%|██████████| 3/3 [00:05<00:00,  1.82s/it]

2025/09/30 17:57:42 INFO dspy.evaluate.evaluate: Average Metric: 1.6363636363636362 / 3 (54.5%)





2025/09/30 17:59:35 INFO dspy.teleprompt.gepa.gepa: Iteration 76: Proposed new text for predict: markdown
# Instruction for extracting structured bibliographic metadata from PDF‑extracted text

You will receive **one JSON object** per request.  
It contains two top‑level keys:

* **pdfinfo** – optional dictionary with PDF‑level metadata (`title`, `author`, `creationDate`, `modDate`).  
* **pages** – an ordered list of page objects, each with an integer `page` and a string `text` that is the raw OCR/clipboard text of that page.

Your job is to **scan the whole document (all pages)** and output a **flat list of metadata fields** exactly in the format described below.  
Treat each request as completely independent – do not carry over any information from previous requests.

---

## 1. Output format

* Write the **field name** on a line **by itself**, then write its **value** on the **next line**.  
* Follow the order given in the table under **1.1 Output fields**.  
* List values must be 

Average Metric: 1.27 / 3 (42.4%): 100%|██████████| 3/3 [00:07<00:00,  2.34s/it]

2025/09/30 18:00:18 INFO dspy.evaluate.evaluate: Average Metric: 1.2727272727272727 / 3 (42.4%)





2025/09/30 18:02:12 INFO dspy.teleprompt.gepa.gepa: Iteration 77: Proposed new text for predict: markdown
# Instruction for extracting structured bibliographic metadata from PDF‑extracted text

You will receive **one JSON object** per request.  
It contains two top‑level keys:

* **pdfinfo** – optional dictionary with PDF‑level metadata (`title`, `author`, `creationDate`, `modDate`).  
* **pages** – an ordered list of page objects, each with an integer `page` and a string `text` that is the raw OCR/clipboard text of that page.

Your job is to **scan the whole document (all pages)** and output a **flat list of metadata fields** exactly in the format described below.  
Treat each request as completely independent – do **not** carry any information from previous requests.

---

## 1. Output format

* Write the field name on a line **by itself**, then write its value on the **next line**.  
* Follow the order given in the table under **1.1 Output fields**.  
* List values must be **valid P

Average Metric: 2.12 / 3 (70.7%): 100%|██████████| 3/3 [00:06<00:00,  2.03s/it]

2025/09/30 18:03:17 INFO dspy.evaluate.evaluate: Average Metric: 2.121212121212121 / 3 (70.7%)





2025/09/30 18:04:54 INFO dspy.teleprompt.gepa.gepa: Iteration 78: Proposed new text for predict: markdown
# Instruction for extracting structured metadata from PDF‑extracted text

You will receive **one JSON object** with two top‑level keys:

* **pdfinfo** – optional dictionary containing PDF‑level metadata (`title`, `author`, `creationDate`, `modDate`, …).
* **pages** – a list of page objects, each with an integer `page` number and a `text` string that is the raw OCR/clipboard text of that page.

Your job is to **scan the whole document** (all pages) and produce a **flat list of metadata fields**.  
Write each field name on its own line, followed by the extracted value on the next line (see *Output format*).  
If a value cannot be found, output the placeholder described in the *Missing‑value handling* table.

---

## 1. Required output fields

| Field name | Type | Notes |
|------------|------|-------|
| **language** | string | ISO‑639‑1 code of the document language: `fi`, `en`, `sv`

Average Metric: 1.61 / 3 (53.5%): 100%|██████████| 3/3 [00:05<00:00,  1.91s/it]

2025/09/30 18:05:42 INFO dspy.evaluate.evaluate: Average Metric: 1.606060606060606 / 3 (53.5%)





2025/09/30 18:07:14 INFO dspy.teleprompt.gepa.gepa: Iteration 79: Proposed new text for predict: markdown
# Instruction for extracting structured metadata from PDF‑extracted JSON

You will receive **one JSON object** with two top‑level keys:

* **pdfinfo** – optional dictionary containing PDF‑level metadata (`title`, `author`, `creationDate`, `modDate`, …).  
* **pages** – a list of page objects, each with an integer `page` number and a `text` string that is the raw OCR/clipboard text of that page.

Your job is to **scan the whole document** (all pages) and produce a **flat list of metadata fields**.  
Write each field name on its own line, followed by the extracted value on the next line (see *Output format*).  
If a value cannot be found, output the placeholder described in the *Missing‑value handling* table.

---

## 1. Required output fields

| Field name | Type | Notes |
|------------|------|-------|
| **language** | string | ISO‑639‑1 code of the document language: `fi`, `en`, `s

CPU times: user 1min 9s, sys: 17.5 s, total: 1min 27s
Wall time: 3h 6min 51s





In [9]:
for name, pred in optimized_program.named_predictors():
    print("================================")
    print(f"Predictor: {name}")
    print("================================")
    print("Prompt:")
    print(pred.signature.instructions)
    print("*********************************")

Predictor: predict
Prompt:
markdown
# 📄 Instruction for extracting structured metadata from PDF‑extracted text

You will receive **one JSON object** with the following structure:

* `pdfinfo` – optional dictionary containing PDF‑level metadata (e.g. `author`, `title`, `creationDate`, `modDate`).  
* `pages` – list of page objects, each with a numeric `page` field and a `text` field that holds the raw OCR/clipboard text of that page.

Your job is to **parse this information and output a flat list of metadata fields** (one field name on a line, the value on the next line).  
The output must follow the exact format shown in the “Output format example” section below, because it will later be turned into a JSON record.

---

## 1️⃣ Required output fields (order does **not** matter)

| Field name | Type | Description & format |
|------------|------|----------------------|
| `language` | `string` or `None` | ISO‑639‑1 code of the **content language** (`en`, `fi`, `sv`, `se`, …). Detect from t

In [10]:
%%time

evaluate = dspy.Evaluate(
    devset=test_set,
    metric=metadata_metric_with_feedback,
    num_threads=64,
    display_table=True,
    display_progress=True,
    provide_traceback=True
)

eval_result = evaluate(optimized_program)

Average Metric: 119.53 / 181 (66.0%):  99%|█████████▉| 181/182 [01:57<00:00,  2.07it/s]

2025/09/30 18:10:35 ERROR dspy.utils.parallelizer: Error for Example({'content': '{"pdfinfo": {"author": "ArtMedia", "creationDate": "D:20230116093016+02\'00\'", "modDate": "D:20230116093016+02\'00\'"}, "pages": [{"page": 1, "text": "### **Inskolningens och anknytningens betydelse inom sm\\u00e5barnspedagogiken**\\n\\n\\u2013\\n### **ur l\\u00e4rares synvinkel  en intervjustudie** Susanne Eklund\\nAvhandling f\\u00f6r magisterexamen\\nFakulteten f\\u00f6r pedagogik\\n\\n\\noch v\\u00e4lf\\u00e4rdsstudier\\n\\u00c5bo Akademi, Vasa 2022\\nHandledare: Ann-Katrin\\nSvensson\\n\\n\\n"}, {"page": 2, "text": "2\\n\\n\\n## **Abstrakt**\\n**F\\u00f6rfattare**\\nSusanne Eklund\\n**Arbetets titel**\\n**\\u00c5rtal**\\n\\n2022\\nInskolningen och anknytningens betydelse inom sm\\u00e5barnspedagogiken ur l\\u00e4rarens\\n\\n\\nsynvinkel \\u2013 en intervjustudie\\nOpublicerad avhandling f\\u00f6r magisterexamen i pedagogik\\nVasa: \\u00c5bo Akademi, Fakulteten f\\u00f6r pedagogik och v\\u00e4lf\\u00

Average Metric: 119.53 / 181 (66.0%): 100%|██████████| 182/182 [02:36<00:00,  1.16it/s]

2025/09/30 18:10:35 INFO dspy.evaluate.evaluate: Average Metric: 119.52768595041326 / 182 (65.7%)



CPU times: user 4.28 s, sys: 844 ms, total: 5.13 s
Wall time: 2min 36s


In [11]:
lm.inspect_history()





[34m[2025-09-30T18:10:35.498812][0m

[31mSystem message:[0m

Your input fields are:
1. `content` (str):
Your output fields are:
1. `reasoning` (str): 
2. `language` (str): The language of the resource expressed as a BCP47 language tag.
3. `title` (str): The main title of the publication.
4. `alt_title` (list[str]): Alternative or parallel titles of the publication, suffixed with a BCP47 language tag in curly brackets.
5. `creator` (list[str]): The primary author(s) of the resource (order: Last Name, First Names).
6. `year` (Union[str, NoneType]): The year on which the resource was issued or made available.
7. `publisher` (list[str]): The entity/entities responsible for making the resource available.
8. `doi` (Union[str, NoneType]): The Digital Object Identifier (DOI) associated with the resource.
9. `e_isbn` (list[str]): The ISBN associated with the electronic resource.
10. `p_isbn` (list[str]): The ISBN of the printed version of this document.
11. `e_issn` (Union[str, NoneType

In [12]:
# save the optimized program for later use (many formats, just in case)
optimized_program.save("gepa-optimized-module.json", save_program=False)
optimized_program.save("gepa-optimized-module.pkl", save_program=False)
# save just the prompt(s)
for name, pred in optimized_program.named_predictors():
    with open(f"gepa-optimized-prompt-{name}.txt", "w") as outfile:
        outfile.write(pred.signature.instructions)
