# DSPy Term Extraction for Italian Text

This notebook demonstrates term extraction using DSPy:
1. **Baseline**: Zero-shot prediction with DSPy
2. **Optimized**: Prompt optimization using MIPROv2

DSPy is a framework for programming with language models that:
- Automatically optimizes prompts
- Provides structured outputs
- Enables systematic LM pipeline development

Dataset: EvalITA 2025 ATE-IT (Automatic Term Extraction - Italian Testbed)

## Setup and Imports

In [48]:
import json
import os
import pickle
from typing import List, Dict
from tqdm import tqdm

import dspy
from dspy.teleprompt import MIPROv2

print("✓ Libraries imported")
print(f"DSPy version: {dspy.__version__}")

✓ Libraries imported
DSPy version: 3.0.3


## Data Loading and Processing

In [49]:
def load_jsonl(path: str):
    """Load a JSON lines file or JSON array file."""
    with open(path, 'r', encoding='utf-8') as f:
        text = f.read().strip()
    if not text:
        return []
    try:
        data = json.loads(text)
    except json.JSONDecodeError:
        data = []
        for line in text.splitlines():
            line = line.strip()
            if line:
                data.append(json.loads(line))
    return data


def build_sentence_gold_map(records):
    """Convert dataset rows into list of sentences with aggregated terms."""
    out = {}
    
    if isinstance(records, dict) and 'data' in records:
        rows = records['data']
    else:
        rows = records
    
    for r in rows:
        key = (r.get('document_id'), r.get('paragraph_id'), r.get('sentence_id'))
        if key not in out:
            out[key] = {
                'document_id': r.get('document_id'),
                'paragraph_id': r.get('paragraph_id'),
                'sentence_id': r.get('sentence_id'),
                'sentence_text': r.get('sentence_text', ''),
                'terms': []
            }
        
        if isinstance(r.get('term_list'), list):
            for t in r.get('term_list'):
                if t and t not in out[key]['terms']:
                    out[key]['terms'].append(t)
        else:
            term = r.get('term')
            if term and term not in out[key]['terms']:
                out[key]['terms'].append(term)
    
    return list(out.values())


print("✓ Data loading functions defined")

✓ Data loading functions defined


In [50]:
# Load training and dev data
train_data = load_jsonl('../data/subtask_a_train.json')
dev_data = load_jsonl('../data/subtask_a_dev.json')

train_sentences = build_sentence_gold_map(train_data)
dev_sentences = build_sentence_gold_map(dev_data)

print(f"Training sentences: {len(train_sentences)}")
print(f"Dev sentences: {len(dev_sentences)}")
print(f"\nExample sentence:")
print(f"  Text: {train_sentences[6]['sentence_text']}")
print(f"  Terms: {train_sentences[6]['terms']}")

Training sentences: 2308
Dev sentences: 577

Example sentence:
  Text: AFFIDAMENTO DEL “SERVIZIO DI SPAZZAMENTO, RACCOLTA, TRASPORTO E SMALTIMENTO/RECUPERO DEI RIFIUTI URBANI ED ASSIMILATI E SERVIZI COMPLEMENTARI DELLA CITTA' DI AGROPOLI” VALEVOLE PER UN QUINQUENNIO
  Terms: ['raccolta', 'recupero', 'servizio di raccolta', 'servizio di spazzamento', 'smaltimento', 'trasporto']


## Evaluation Metrics

Using the official evaluation metrics from the competition.

In [51]:
def micro_f1_score(gold_standard, system_output):
    """
    Evaluates performance using Precision, Recall, and F1 score 
    based on individual term matching (micro-average).
    """
    total_true_positives = 0
    total_false_positives = 0
    total_false_negatives = 0
    
    for gold, system in zip(gold_standard, system_output):
        gold_set = set(gold)
        system_set = set(system)
        
        true_positives = len(gold_set.intersection(system_set))
        false_positives = len(system_set - gold_set)
        false_negatives = len(gold_set - system_set)
        
        total_true_positives += true_positives
        total_false_positives += false_positives
        total_false_negatives += false_negatives
    
    precision = total_true_positives / (total_true_positives + total_false_positives) if (total_true_positives + total_false_positives) > 0 else 0
    recall = total_true_positives / (total_true_positives + total_false_negatives) if (total_true_positives + total_false_negatives) > 0 else 0
    f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0
    
    return precision, recall, f1, total_true_positives, total_false_positives, total_false_negatives


def type_f1_score(gold_standard, system_output):
    """
    Evaluates performance using Type Precision, Type Recall, and Type F1 score
    based on the set of unique terms extracted at least once across the entire dataset.
    """
    all_gold_terms = set()
    for item_terms in gold_standard:
        all_gold_terms.update(item_terms)
    
    all_system_terms = set()
    for item_terms in system_output:
        all_system_terms.update(item_terms)
    
    type_true_positives = len(all_gold_terms.intersection(all_system_terms))
    type_false_positives = len(all_system_terms - all_gold_terms)
    type_false_negatives = len(all_gold_terms - all_system_terms)
    
    type_precision = type_true_positives / (type_true_positives + type_false_positives) if (type_true_positives + type_false_positives) > 0 else 0
    type_recall = type_true_positives / (type_true_positives + type_false_negatives) if (type_true_positives + type_false_negatives) > 0 else 0
    type_f1 = 2 * (type_precision * type_recall) / (type_precision + type_recall) if (type_precision + type_recall) > 0 else 0
    
    return type_precision, type_recall, type_f1


print("✓ Evaluation functions defined")

✓ Evaluation functions defined


## Configure DSPy Language Model

In [None]:
from dotenv import load_dotenv

# Load environment variables
load_dotenv()

# Define LLM parameters
temperature = 0.2
max_tokens = 1024
use_cache = True

# Configure DSPy with Gemini
#lm = dspy.LM(model='gemini/gemini-2.5-flash', api_key=os.getenv('GEMINI_API_KEY'), temperature=temperature, max_tokens=max_tokens, cache=use_cache)

# You can also use OpenAI
lm = dspy.LM(model="openai/gpt-4o-mini", api_key=os.getenv('OPENAI_API_KEY'), temperature=temperature, max_tokens=max_tokens, cache=use_cache) 

# You can also use GROQ
#lm = dspy.LM(model='llama-3.1-8b-instant', api_key=os.getenv('GROQ_API_KEY'), api_base="https://api.groq.com/openai/v1", temperature=temperature, max_tokens=max_tokens, cache=use_cache)

# You can also use Azure
#lm = dspy.LM('azure/gpt-4o-mini', api_key=os.getenv('AZURE_GPT_4O_MINI_API_KEY'), api_base='https://marco-m9l7uxzf-eastus2.cognitiveservices.azure.com', temperature=temperature, max_tokens=max_tokens, cache=use_cache)

# You can also use Ollama locally
#lm = dspy.LM('ollama_chat/llama3.1:8b', api_base='http://localhost:11434', api_key='', temperature=temperature, max_tokens=max_tokens, cache=use_cache),

dspy.settings.configure(lm=lm)

print(f"✓ DSPy configured with {lm.model} model")

✓ DSPy configured with openai/gpt-4o-mini model


## Baseline Model: Zero-Shot DSPy

Simple DSPy signature for term extraction with no optimization.

In [53]:
class TermExtractionSignature(dspy.Signature):
    """Extract waste management terms from Italian text. 
    Return only domain-specific terms, ignore named entities and nested terms.
    Output a comma-separated list of terms."""
    
    sentence: str = dspy.InputField(desc="Italian sentence about waste management")
    terms: str = dspy.OutputField(desc="Comma-separated list of waste management terms (lowercase)")


class TermExtractor(dspy.Module):
    """Baseline term extractor using DSPy."""
    
    def __init__(self):
        super().__init__()
        self.extract = dspy.Predict(TermExtractionSignature)
        #self.extract = dspy.ChainOfThought(TermExtractionSignature)
    
    def forward(self, sentence: str) -> List[str]:
        """Extract terms from a sentence."""
        result = self.extract(sentence=sentence)
        
        # Parse output - handle None case
        terms_str = result.terms if result.terms else ""
        terms_str = terms_str.strip()
        
        if not terms_str or terms_str.lower() == 'none':
            return []
        
        # Split by comma and clean
        terms = [t.strip().lower() for t in terms_str.split(',') if t.strip()]
        return terms


print("✓ Baseline DSPy model defined")

✓ Baseline DSPy model defined


### Run and Evaluate Baseline Model

In [54]:
# Initialize baseline model
baseline_model = TermExtractor()

print("✓ Baseline model initialized")

✓ Baseline model initialized


In [55]:
# Predict on dev set (sample for speed)
print("Running baseline predictions on dev set...")
print("Note: Processing all sentences may take time. Using full dev set.")

dev_texts = [s['sentence_text'] for s in dev_sentences]
dev_gold = [s['terms'] for s in dev_sentences]

baseline_preds = []
for i, text in tqdm(enumerate(dev_texts), desc="Predicting", total=len(dev_texts)):
    try:
        terms = baseline_model(text)
        baseline_preds.append(terms)
    except Exception as e:
        print(f"  Warning: Error processing sentence {i}: {e}")
        baseline_preds.append([])

print(f"✓ Baseline predictions completed: {len(baseline_preds)} predictions")

Running baseline predictions on dev set...
Note: Processing all sentences may take time. Using full dev set.


Predicting:   0%|          | 0/577 [00:00<?, ?it/s]

Predicting: 100%|██████████| 577/577 [20:11<00:00,  2.10s/it]    

✓ Baseline predictions completed: 577 predictions





In [56]:
# Evaluate baseline
precision, recall, f1, tp, fp, fn = micro_f1_score(dev_gold, baseline_preds)
type_precision, type_recall, type_f1 = type_f1_score(dev_gold, baseline_preds)

print("\nBaseline Model Results:")
print("Micro-averaged metrics:")
print(f"  Precision: {precision:.4f}")
print(f"  Recall:    {recall:.4f}")
print(f"  F1 Score:  {f1:.4f}")
print(f"  TP={tp}, FP={fp}, FN={fn}")
print("\nType-level metrics:")
print(f"  Type Precision: {type_precision:.4f}")
print(f"  Type Recall:    {type_recall:.4f}")
print(f"  Type F1 Score:  {type_f1:.4f}")

# Store metrics for comparison
baseline_metrics = {
    'precision': precision,
    'recall': recall,
    'f1': f1,
    'type_precision': type_precision,
    'type_recall': type_recall,
    'type_f1': type_f1
}


Baseline Model Results:
Micro-averaged metrics:
  Precision: 0.2153
  Recall:    0.5565
  F1 Score:  0.3105
  TP=251, FP=915, FN=200

Type-level metrics:
  Type Precision: 0.1721
  Type Recall:    0.5000
  Type F1 Score:  0.2561


## Optimized Model: MIPROv2 Prompt Optimization

Use DSPy's MIPROv2 optimizer to automatically improve prompts based on training examples.

In [57]:
# Prepare training examples for DSPy
# Use a subset for optimization (MIPROv2 is computationally expensive)
train_subset_size = 100  # Adjust based on API limits

print(f"Preparing {train_subset_size} training examples for optimization...")

train_examples = []
for i, sent_data in enumerate(train_sentences[:train_subset_size]):
    # Create DSPy Example with input and expected output
    terms_str = ', '.join(sent_data['terms']) if sent_data['terms'] else 'none'
    example = dspy.Example(
        sentence=sent_data['sentence_text'],
        terms=terms_str
    ).with_inputs('sentence')
    train_examples.append(example)

print(f"✓ Created {len(train_examples)} training examples")

Preparing 100 training examples for optimization...
✓ Created 100 training examples


In [63]:
# Define metric for optimization
def term_extraction_metric(example, pred, trace=None):
    """Metric for evaluating term extraction."""
    # Parse gold terms
    gold_terms_str = example.terms.strip().lower()
    if gold_terms_str == 'none' or not gold_terms_str:
        gold_terms = set()
    else:
        gold_terms = set(t.strip() for t in gold_terms_str.split(',') if t.strip())
    
    # Parse predicted terms - handle both DSPy prediction object and list
    if isinstance(pred, list):
        # pred is already a list from forward()
        pred_terms = set(t.strip().lower() for t in pred if t)
    elif hasattr(pred, 'terms'):
        # pred is a DSPy prediction object
        pred_terms_str = pred.terms.strip().lower()
        if pred_terms_str == 'none' or not pred_terms_str:
            pred_terms = set()
        else:
            pred_terms = set(t.strip() for t in pred_terms_str.split(',') if t.strip())
    else:
        # Fallback: empty set
        pred_terms = set()
    
    # Calculate F1 score
    if len(pred_terms) == 0 and len(gold_terms) == 0:
        return 1.0
    
    tp = len(gold_terms.intersection(pred_terms))
    fp = len(pred_terms - gold_terms)
    fn = len(gold_terms - pred_terms)
    
    precision = tp / (tp + fp) if (tp + fp) > 0 else 0
    recall = tp / (tp + fn) if (tp + fn) > 0 else 0
    f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0
    
    return f1


print("✓ Optimization metric defined")

✓ Optimization metric defined


In [64]:
# Initialize MIPROv2 optimizer
print("Initializing MIPROv2 optimizer...")
print("Note: This may take several minutes depending on the number of trials.")

optimizer = MIPROv2(
    metric=term_extraction_metric,
    auto=None, # Can choose between light, medium, and heavy optimization runs. Set to None for manually defining num_candidates and num_trials.
    num_candidates=5,  # Number of prompt candidates to generate
    init_temperature=1.0,
    verbose=True
)

print("✓ Optimizer initialized")

Initializing MIPROv2 optimizer...
Note: This may take several minutes depending on the number of trials.
✓ Optimizer initialized


In [65]:
# Optimize the model
print("="*60)
print("Starting prompt optimization with MIPROv2...")
print("="*60)

# Create a fresh model for optimization
optimized_model = TermExtractor()

# Run optimization
optimized_model = optimizer.compile(
    optimized_model,
    trainset=train_examples,
    num_trials=10,  # Number of optimization trials
    max_bootstrapped_demos=3,  # Max examples to include in prompts
    max_labeled_demos=3,
    requires_permission_to_run=False
)

print("\n" + "="*60)
print("✓ OPTIMIZATION COMPLETED!")
print("="*60)

2025/11/05 14:14:42 INFO dspy.teleprompt.mipro_optimizer_v2: 
==> STEP 1: BOOTSTRAP FEWSHOT EXAMPLES <==
2025/11/05 14:14:42 INFO dspy.teleprompt.mipro_optimizer_v2: These will be used as few-shot example candidates for our program and for creating instructions.

2025/11/05 14:14:42 INFO dspy.teleprompt.mipro_optimizer_v2: Bootstrapping N=5 sets of demonstrations...
2025/11/05 14:14:42 INFO dspy.teleprompt.mipro_optimizer_v2: 
==> STEP 1: BOOTSTRAP FEWSHOT EXAMPLES <==
2025/11/05 14:14:42 INFO dspy.teleprompt.mipro_optimizer_v2: These will be used as few-shot example candidates for our program and for creating instructions.

2025/11/05 14:14:42 INFO dspy.teleprompt.mipro_optimizer_v2: Bootstrapping N=5 sets of demonstrations...


Starting prompt optimization with MIPROv2...
Bootstrapping set 1/5
Bootstrapping set 2/5
Bootstrapping set 3/5


 15%|█▌        | 3/20 [00:00<00:00, 20.70it/s]
 15%|█▌        | 3/20 [00:00<00:00, 20.70it/s]


Bootstrapped 3 full traces after 3 examples for up to 1 rounds, amounting to 3 attempts.
Bootstrapping set 4/5


 35%|███▌      | 7/20 [00:08<00:16,  1.25s/it]
 35%|███▌      | 7/20 [00:08<00:16,  1.25s/it]


Bootstrapped 3 full traces after 7 examples for up to 1 rounds, amounting to 7 attempts.
Bootstrapping set 5/5


 10%|█         | 2/20 [00:01<00:13,  1.31it/s]
2025/11/05 14:14:52 INFO dspy.teleprompt.mipro_optimizer_v2: 
==> STEP 2: PROPOSE INSTRUCTION CANDIDATES <==
2025/11/05 14:14:52 INFO dspy.teleprompt.mipro_optimizer_v2: We will use the few-shot examples from the previous step, a generated dataset summary, a summary of the program code, and a randomly selected prompting tip to propose instructions.
 10%|█         | 2/20 [00:01<00:13,  1.31it/s]
2025/11/05 14:14:52 INFO dspy.teleprompt.mipro_optimizer_v2: 
==> STEP 2: PROPOSE INSTRUCTION CANDIDATES <==
2025/11/05 14:14:52 INFO dspy.teleprompt.mipro_optimizer_v2: We will use the few-shot examples from the previous step, a generated dataset summary, a summary of the program code, and a randomly selected prompting tip to propose instructions.


Bootstrapped 1 full traces after 2 examples for up to 1 rounds, amounting to 2 attempts.
SOURCE CODE: TermExtractionSignature(sentence -> terms
    instructions='Extract waste management terms from Italian text. \nReturn only domain-specific terms, ignore named entities and nested terms.\nOutput a comma-separated list of terms.'
    sentence = Field(annotation=str required=True json_schema_extra={'desc': 'Italian sentence about waste management', '__dspy_field_type': 'input', 'prefix': 'Sentence:'})
    terms = Field(annotation=str required=True json_schema_extra={'desc': 'Comma-separated list of waste management terms (lowercase)', '__dspy_field_type': 'output', 'prefix': 'Terms:'})
)



class TermExtractor(dspy.Module):
    """Baseline term extractor using DSPy."""

    def __init__(self):
        super().__init__()
        self.extract = dspy.Predict(TermExtractionSignature)
        #self.extract = dspy.ChainOfThought(TermExtractionSignature)

    def forward(self, sentence: str) ->

2025/11/05 14:15:03 INFO dspy.teleprompt.mipro_optimizer_v2: 
Proposing N=5 instructions...



DATA SUMMARY: The dataset comprises structured administrative and legal documents related to municipal waste management services in the Alto Cilento area. It emphasizes formal naming conventions, procedural instructions, and accountability, underscoring the importance of transparency and compliance in local government operations. The organized formatting, including lists and clear headings, reflects a systematic approach to communication in public administration.
Using a randomly generated configuration for our grounded proposer.
Selected tip: description
PROGRAM DESCRIPTION: The program is designed to extract domain-specific terms related to waste management from Italian text. It accomplishes this task by implementing a term extraction module that leverages a predefined signature to process input sentences. The program specifies that only relevant waste management-related terms should be captured while ignoring named entities and nested terms. Given an Italian sentence, the module use

2025/11/05 14:15:48 INFO dspy.teleprompt.mipro_optimizer_v2: Proposed Instructions for Predictor 0:

2025/11/05 14:15:48 INFO dspy.teleprompt.mipro_optimizer_v2: 0: Extract waste management terms from Italian text. 
Return only domain-specific terms, ignore named entities and nested terms.
Output a comma-separated list of terms.

2025/11/05 14:15:48 INFO dspy.teleprompt.mipro_optimizer_v2: 1: Please extract the domain-specific waste management terms from the provided Italian sentence. Ensure that you exclude any named entities and nested terms from your extraction. Return a clean, comma-separated list of the identified terms.

2025/11/05 14:15:48 INFO dspy.teleprompt.mipro_optimizer_v2: 2: Analyze the following Italian sentence related to municipal waste management and extract any relevant waste management terms. Please ensure to focus solely on terms specific to the waste management domain, and disregard any named entities or nested terms. Return the extracted terms as a clean, comma-





[34m[2025-11-05T14:15:03.165538][0m

[31mSystem message:[0m

Your input fields are:
1. `observations` (str): Observations I have made about my dataset
Your output fields are:
1. `summary` (str): Two to Three sentence summary of only the most significant highlights of my observations
All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## observations ## ]]
{observations}

[[ ## summary ## ]]
{summary}

[[ ## completed ## ]]
In adhering to this structure, your objective is: 
        Given a series of observations I have made about my dataset, please summarize them into a brief 2-3 sentence summary which highlights only the most important details.


[31mUser message:[0m

[[ ## observations ## ]]
The dataset consists of administrative and legal documents primarily related to municipal services in the Alto Cilento area, specifically focusing on waste management and related contracts. The examples show a structured format with specif

  from .autonotebook import tqdm as notebook_tqdm
2025/11/05 14:15:49 INFO dspy.teleprompt.mipro_optimizer_v2: ==> STEP 3: FINDING OPTIMAL PROMPT PARAMETERS <==
2025/11/05 14:15:49 INFO dspy.teleprompt.mipro_optimizer_v2: We will evaluate the program over a series of trials with different combinations of instructions and few-shot examples to find the optimal combination using Bayesian Optimization.

2025/11/05 14:15:49 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 1 / 13 - Full Evaluation of Default Program ==
2025/11/05 14:15:49 INFO dspy.teleprompt.mipro_optimizer_v2: We will evaluate the program over a series of trials with different combinations of instructions and few-shot examples to find the optimal combination using Bayesian Optimization.

2025/11/05 14:15:49 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 1 / 13 - Full Evaluation of Default Program ==


Average Metric: 45.00 / 80 (56.2%): 100%|██████████| 80/80 [00:11<00:00,  6.71it/s]

2025/11/05 14:16:01 INFO dspy.evaluate.evaluate: Average Metric: 44.998051948051945 / 80 (56.2%)
2025/11/05 14:16:01 INFO dspy.teleprompt.mipro_optimizer_v2: Default program score: 56.25

2025/11/05 14:16:01 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 2 / 13 - Minibatch ==
2025/11/05 14:16:01 INFO dspy.teleprompt.mipro_optimizer_v2: Evaluating the following candidate program...

2025/11/05 14:16:01 INFO dspy.teleprompt.mipro_optimizer_v2: Default program score: 56.25

2025/11/05 14:16:01 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 2 / 13 - Minibatch ==
2025/11/05 14:16:01 INFO dspy.teleprompt.mipro_optimizer_v2: Evaluating the following candidate program...




Predictor 0
i: Please extract the domain-specific waste management terms from the provided Italian sentence. Ensure that you exclude any named entities and nested terms from your extraction. Return a clean, comma-separated list of the identified terms.
p: Terms:


Average Metric: 24.18 / 35 (69.1%): 100%|██████████| 35/35 [00:05<00:00,  6.77it/s]

2025/11/05 14:16:06 INFO dspy.evaluate.evaluate: Average Metric: 24.175757575757576 / 35 (69.1%)
2025/11/05 14:16:06 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 69.07 on minibatch of size 35 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 1'].
2025/11/05 14:16:06 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [69.07]
2025/11/05 14:16:06 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [56.25]
2025/11/05 14:16:06 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 56.25


2025/11/05 14:16:06 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 3 / 13 - Minibatch ==
2025/11/05 14:16:06 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 69.07 on minibatch of size 35 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 1'].
2025/11/05 14:16:06 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [69.07]
2025/11/05 14:16:06 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so f


Predictor 0
i: Analyze the following Italian sentence related to municipal waste management and extract any relevant waste management terms. Please ensure to focus solely on terms specific to the waste management domain, and disregard any named entities or nested terms. Return the extracted terms as a clean, comma-separated list in lowercase.
p: Terms:


Average Metric: 28.68 / 35 (81.9%): 100%|██████████| 35/35 [00:05<00:00,  6.65it/s]

2025/11/05 14:16:11 INFO dspy.evaluate.evaluate: Average Metric: 28.67878787878788 / 35 (81.9%)
2025/11/05 14:16:11 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 81.94 on minibatch of size 35 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 1'].
2025/11/05 14:16:11 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [69.07, 81.94]
2025/11/05 14:16:11 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [56.25]
2025/11/05 14:16:12 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 56.25


2025/11/05 14:16:12 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 4 / 13 - Minibatch ==
2025/11/05 14:16:12 INFO dspy.teleprompt.mipro_optimizer_v2: Evaluating the following candidate program...

2025/11/05 14:16:11 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 81.94 on minibatch of size 35 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 1'].
2025/11/05 14:16:11 INFO dspy.teleprompt.mipro_optimizer_v2: M


Predictor 0
i: Given an Italian sentence related to waste management, extract and return only the domain-specific terms while ignoring named entities and any nested terms. Provide the output as a cleaned, comma-separated list of terms.
p: Terms:


Average Metric: 24.68 / 35 (70.5%): 100%|██████████| 35/35 [00:05<00:00,  6.03it/s]

2025/11/05 14:16:17 INFO dspy.evaluate.evaluate: Average Metric: 24.683333333333334 / 35 (70.5%)
2025/11/05 14:16:17 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 70.52 on minibatch of size 35 with parameters ['Predictor 0: Instruction 4', 'Predictor 0: Few-Shot Set 1'].
2025/11/05 14:16:17 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [69.07, 81.94, 70.52]
2025/11/05 14:16:17 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 70.52 on minibatch of size 35 with parameters ['Predictor 0: Instruction 4', 'Predictor 0: Few-Shot Set 1'].
2025/11/05 14:16:17 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [69.07, 81.94, 70.52]
2025/11/05 14:16:17 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [56.25]
2025/11/05 14:16:17 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 56.25


2025/11/05 14:16:17 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 5 / 13 - Minibatch ==
2025/11/05 14:16:17 INFO dspy.teleprompt.mipro_optimiz


Predictor 0
i: Analyze the following Italian sentence related to municipal waste management and extract any relevant waste management terms. Please ensure to focus solely on terms specific to the waste management domain, and disregard any named entities or nested terms. Return the extracted terms as a clean, comma-separated list in lowercase.
p: Terms:


Average Metric: 25.00 / 35 (71.4%): 100%|██████████| 35/35 [00:04<00:00,  7.90it/s] 

2025/11/05 14:16:22 INFO dspy.evaluate.evaluate: Average Metric: 25.0 / 35 (71.4%)
2025/11/05 14:16:22 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 71.43 on minibatch of size 35 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 1'].
2025/11/05 14:16:22 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [69.07, 81.94, 70.52, 71.43]
2025/11/05 14:16:22 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [56.25]
2025/11/05 14:16:22 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 56.25


2025/11/05 14:16:22 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 6 / 13 - Minibatch ==
2025/11/05 14:16:22 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 71.43 on minibatch of size 35 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 1'].
2025/11/05 14:16:22 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [69.07, 81.94, 70.52, 71.43]
2025/11/05 14:16:22 INFO dspy.teleprompt.mipro_optimiz


Predictor 0
i: Given an Italian sentence related to waste management, extract and return only the domain-specific terms while ignoring named entities and any nested terms. Provide the output as a cleaned, comma-separated list of terms.
p: Terms:


Average Metric: 26.98 / 35 (77.1%): 100%|██████████| 35/35 [00:09<00:00,  3.59it/s]

2025/11/05 14:16:32 INFO dspy.evaluate.evaluate: Average Metric: 26.975757575757576 / 35 (77.1%)
2025/11/05 14:16:32 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 77.07 on minibatch of size 35 with parameters ['Predictor 0: Instruction 4', 'Predictor 0: Few-Shot Set 3'].
2025/11/05 14:16:32 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [69.07, 81.94, 70.52, 71.43, 77.07]
2025/11/05 14:16:32 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [56.25]
2025/11/05 14:16:32 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 56.25


2025/11/05 14:16:32 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 7 / 13 - Full Evaluation =====
2025/11/05 14:16:32 INFO dspy.teleprompt.mipro_optimizer_v2: Doing full eval on next top averaging program (Avg Score: 77.07) from minibatch trials...
2025/11/05 14:16:32 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 77.07 on minibatch of size 35 with parameters ['Predictor 0: Instruction 4', 'Predictor 0: Few


Average Metric: 61.90 / 80 (77.4%): 100%|██████████| 80/80 [02:06<00:00,  1.58s/it]

2025/11/05 14:18:39 INFO dspy.evaluate.evaluate: Average Metric: 61.8995670995671 / 80 (77.4%)
2025/11/05 14:18:39 INFO dspy.teleprompt.mipro_optimizer_v2: [92mNew best full eval score![0m Score: 77.37
2025/11/05 14:18:39 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [56.25, 77.37]
2025/11/05 14:18:39 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 77.37
2025/11/05 14:18:39 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/11/05 14:18:39 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 8 / 13 - Minibatch ==
2025/11/05 14:18:39 INFO dspy.teleprompt.mipro_optimizer_v2: Evaluating the following candidate program...

2025/11/05 14:18:39 INFO dspy.teleprompt.mipro_optimizer_v2: [92mNew best full eval score![0m Score: 77.37
2025/11/05 14:18:39 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [56.25, 77.37]
2025/11/05 14:18:39 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 77.37
2025/11/05 14:18:39 INFO dspy.teleprompt


Predictor 0
i: Extract waste management terms from Italian text. 
Return only domain-specific terms, ignore named entities and nested terms.
Output a comma-separated list of terms.
p: Terms:


Average Metric: 26.07 / 35 (74.5%): 100%|██████████| 35/35 [00:05<00:00,  6.33it/s]

2025/11/05 14:18:44 INFO dspy.evaluate.evaluate: Average Metric: 26.073809523809523 / 35 (74.5%)
2025/11/05 14:18:44 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 74.5 on minibatch of size 35 with parameters ['Predictor 0: Instruction 0', 'Predictor 0: Few-Shot Set 1'].
2025/11/05 14:18:44 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [69.07, 81.94, 70.52, 71.43, 77.07, 74.5]
2025/11/05 14:18:44 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [56.25, 77.37]
2025/11/05 14:18:44 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 77.37


2025/11/05 14:18:44 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 74.5 on minibatch of size 35 with parameters ['Predictor 0: Instruction 0', 'Predictor 0: Few-Shot Set 1'].
2025/11/05 14:18:44 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [69.07, 81.94, 70.52, 71.43, 77.07, 74.5]
2025/11/05 14:18:44 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [56.25, 77.37]



Predictor 0
i: Given an Italian sentence related to waste management, extract and return only the domain-specific terms while ignoring named entities and any nested terms. Provide the output as a cleaned, comma-separated list of terms.
p: Terms:


Average Metric: 21.01 / 35 (60.0%): 100%|██████████| 35/35 [00:07<00:00,  4.83it/s]

2025/11/05 14:18:52 INFO dspy.evaluate.evaluate: Average Metric: 21.007142857142856 / 35 (60.0%)
2025/11/05 14:18:52 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 60.02 on minibatch of size 35 with parameters ['Predictor 0: Instruction 4', 'Predictor 0: Few-Shot Set 4'].
2025/11/05 14:18:52 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [69.07, 81.94, 70.52, 71.43, 77.07, 74.5, 60.02]
2025/11/05 14:18:52 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [56.25, 77.37]
2025/11/05 14:18:52 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 77.37


2025/11/05 14:18:52 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 10 / 13 - Minibatch ==
2025/11/05 14:18:52 INFO dspy.teleprompt.mipro_optimizer_v2: Evaluating the following candidate program...

2025/11/05 14:18:52 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 60.02 on minibatch of size 35 with parameters ['Predictor 0: Instruction 4', 'Predictor 0: Few-Shot Set 4'].
2025/11/05 14:18:52


Predictor 0
i: Extract waste management terms from Italian text. 
Return only domain-specific terms, ignore named entities and nested terms.
Output a comma-separated list of terms.
p: Terms:


Average Metric: 18.50 / 35 (52.9%): 100%|██████████| 35/35 [00:00<00:00, 256.37it/s]

2025/11/05 14:18:52 INFO dspy.evaluate.evaluate: Average Metric: 18.5 / 35 (52.9%)
2025/11/05 14:18:52 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 52.86 on minibatch of size 35 with parameters ['Predictor 0: Instruction 0', 'Predictor 0: Few-Shot Set 0'].
2025/11/05 14:18:52 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [69.07, 81.94, 70.52, 71.43, 77.07, 74.5, 60.02, 52.86]
2025/11/05 14:18:52 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [56.25, 77.37]
2025/11/05 14:18:52 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 77.37


2025/11/05 14:18:52 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 11 / 13 - Minibatch ==
2025/11/05 14:18:52 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 52.86 on minibatch of size 35 with parameters ['Predictor 0: Instruction 0', 'Predictor 0: Few-Shot Set 0'].
2025/11/05 14:18:52 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [69.07, 81.94, 70.52, 71.43, 77.07, 74.5, 60.02,


Predictor 0
i: Analyze the following Italian sentence related to municipal waste management and extract any relevant waste management terms. Please ensure to focus solely on terms specific to the waste management domain, and disregard any named entities or nested terms. Return the extracted terms as a clean, comma-separated list in lowercase.
p: Terms:


Average Metric: 28.86 / 35 (82.4%): 100%|██████████| 35/35 [00:05<00:00,  5.99it/s]

2025/11/05 14:18:58 INFO dspy.evaluate.evaluate: Average Metric: 28.857142857142858 / 35 (82.4%)
2025/11/05 14:18:58 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 82.45 on minibatch of size 35 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 2'].
2025/11/05 14:18:58 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [69.07, 81.94, 70.52, 71.43, 77.07, 74.5, 60.02, 52.86, 82.45]
2025/11/05 14:18:58 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [56.25, 77.37]
2025/11/05 14:18:58 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 82.45 on minibatch of size 35 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 2'].
2025/11/05 14:18:58 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [69.07, 81.94, 70.52, 71.43, 77.07, 74.5, 60.02, 52.86, 82.45]
2025/11/05 14:18:58 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [56.25, 77.37]





2025/11/05 14:18:58 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 77.37


2025/11/05 14:18:58 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 12 / 13 - Minibatch ==
2025/11/05 14:18:58 INFO dspy.teleprompt.mipro_optimizer_v2: Evaluating the following candidate program...



2025/11/05 14:18:58 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 12 / 13 - Minibatch ==
2025/11/05 14:18:58 INFO dspy.teleprompt.mipro_optimizer_v2: Evaluating the following candidate program...



Predictor 0
i: Analyze the following Italian sentence related to municipal waste management and extract any relevant waste management terms. Please ensure to focus solely on terms specific to the waste management domain, and disregard any named entities or nested terms. Return the extracted terms as a clean, comma-separated list in lowercase.
p: Terms:


Average Metric: 27.20 / 35 (77.7%): 100%|██████████| 35/35 [00:06<00:00,  5.48it/s]

2025/11/05 14:19:04 INFO dspy.evaluate.evaluate: Average Metric: 27.2025974025974 / 35 (77.7%)
2025/11/05 14:19:04 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 77.72 on minibatch of size 35 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 2'].
2025/11/05 14:19:04 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [69.07, 81.94, 70.52, 71.43, 77.07, 74.5, 60.02, 52.86, 82.45, 77.72]
2025/11/05 14:19:04 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [56.25, 77.37]
2025/11/05 14:19:04 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 77.37


2025/11/05 14:19:04 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 13 / 13 - Full Evaluation =====
2025/11/05 14:19:04 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 77.72 on minibatch of size 35 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 2'].
2025/11/05 14:19:04 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [69.07, 81


Average Metric: 64.67 / 80 (80.8%): 100%|██████████| 80/80 [02:05<00:00,  1.56s/it]

2025/11/05 14:21:09 INFO dspy.evaluate.evaluate: Average Metric: 64.66926406926407 / 80 (80.8%)
2025/11/05 14:21:09 INFO dspy.teleprompt.mipro_optimizer_v2: [92mNew best full eval score![0m Score: 80.84
2025/11/05 14:21:09 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [56.25, 77.37, 80.84]
2025/11/05 14:21:09 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 80.84
2025/11/05 14:21:09 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/11/05 14:21:09 INFO dspy.teleprompt.mipro_optimizer_v2: Returning best identified program with score 80.84!
2025/11/05 14:21:09 INFO dspy.teleprompt.mipro_optimizer_v2: [92mNew best full eval score![0m Score: 80.84
2025/11/05 14:21:09 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [56.25, 77.37, 80.84]
2025/11/05 14:21:09 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 80.84
2025/11/05 14:21:09 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/11/05 14:21:09 INFO dspy.teleprompt.mipro_op



✓ OPTIMIZATION COMPLETED!


### Save Optimized Model

In [66]:
# Save optimized model
os.makedirs('models', exist_ok=True)
optimized_model.save('models/dspy_optimized.pkl')

print("✓ Optimized model saved to models/dspy_optimized.pkl")

✓ Optimized model saved to models/dspy_optimized.pkl


### Evaluate Optimized Model

In [68]:
# Predict on dev set with optimized model
print("Running optimized model predictions on dev set...")

optimized_preds = []
for i, text in tqdm(enumerate(dev_texts), desc="Predicting", total=len(dev_texts)):
    try:
        terms = optimized_model(text)
        optimized_preds.append(terms)
    except Exception as e:
        print(f"  Warning: Error processing sentence {i}: {e}")
        optimized_preds.append([])

print(f"✓ Optimized predictions completed: {len(optimized_preds)} predictions")

Running optimized model predictions on dev set...


Predicting: 100%|██████████| 577/577 [07:58<00:00,  1.21it/s]

✓ Optimized predictions completed: 577 predictions





In [69]:
# Evaluate optimized model
precision, recall, f1, tp, fp, fn = micro_f1_score(dev_gold, optimized_preds)
type_precision, type_recall, type_f1 = type_f1_score(dev_gold, optimized_preds)

print("Optimized Model Results:")
print("Micro-averaged metrics:")
print(f"  Precision: {precision:.4f}")
print(f"  Recall:    {recall:.4f}")
print(f"  F1 Score:  {f1:.4f}")
print(f"  TP={tp}, FP={fp}, FN={fn}")
print("\nType-level metrics:")
print(f"  Type Precision: {type_precision:.4f}")
print(f"  Type Recall:    {type_recall:.4f}")
print(f"  Type F1 Score:  {type_f1:.4f}")

# Store metrics for comparison
optimized_metrics = {
    'precision': precision,
    'recall': recall,
    'f1': f1,
    'type_precision': type_precision,
    'type_recall': type_recall,
    'type_f1': type_f1
}

Optimized Model Results:
Micro-averaged metrics:
  Precision: 0.2502
  Recall:    0.5787
  F1 Score:  0.3494
  TP=261, FP=782, FN=190

Type-level metrics:
  Type Precision: 0.2038
  Type Recall:    0.5289
  Type F1 Score:  0.2943


## Results Comparison

In [76]:
import pandas as pd

# Micro-averaged comparison
results_df = pd.DataFrame([
    {
        'Model': 'Baseline (Zero-Shot)',
        'Precision': baseline_metrics['precision'],
        'Recall': baseline_metrics['recall'],
        'F1': baseline_metrics['f1']
    },
    {
        'Model': 'Optimized (MIPROv2)',
        'Precision': optimized_metrics['precision'],
        'Recall': optimized_metrics['recall'],
        'F1': optimized_metrics['f1']
    }
])

print("Micro-averaged Metrics:")
print(results_df.to_markdown(index=False))

# Type-level comparison
type_results_df = pd.DataFrame([
    {
        'Model': 'Baseline (Zero-Shot)',
        'Type Precision': baseline_metrics['type_precision'],
        'Type Recall': baseline_metrics['type_recall'],
        'Type F1': baseline_metrics['type_f1']
    },
    {
        'Model': 'Optimized (MIPROv2)',
        'Type Precision': optimized_metrics['type_precision'],
        'Type Recall': optimized_metrics['type_recall'],
        'Type F1': optimized_metrics['type_f1']
    }
])

print("\n\nType-level Metrics:")
print(type_results_df.to_markdown(index=False))

# Show improvement
if baseline_metrics['f1'] > 0:
    f1_improvement = (optimized_metrics['f1'] - baseline_metrics['f1']) / baseline_metrics['f1'] * 100
    print(f"\n\nMicro F1 Score improvement: {f1_improvement:+.1f}%")

if baseline_metrics['type_f1'] > 0:
    type_f1_improvement = (optimized_metrics['type_f1'] - baseline_metrics['type_f1']) / baseline_metrics['type_f1'] * 100
    print(f"Type F1 Score improvement: {type_f1_improvement:+.1f}%")

Micro-averaged Metrics:
| Model                |   Precision |   Recall |       F1 |
|:---------------------|------------:|---------:|---------:|
| Baseline (Zero-Shot) |    0.215266 | 0.556541 | 0.310451 |
| Optimized (MIPROv2)  |    0.25024  | 0.578714 | 0.349398 |


Type-level Metrics:
| Model                |   Type Precision |   Type Recall |   Type F1 |
|:---------------------|-----------------:|--------------:|----------:|
| Baseline (Zero-Shot) |         0.172119 |      0.5      |  0.256085 |
| Optimized (MIPROv2)  |         0.203822 |      0.528926 |  0.294253 |


Micro F1 Score improvement: +12.5%
Type F1 Score improvement: +14.9%


## Save Predictions to Files

In [71]:
def save_predictions(predictions, sentences, output_path):
    """Save predictions in competition format."""
    output = {'data': []}
    for pred, sent in zip(predictions, sentences):
        output['data'].append({
            'document_id': sent['document_id'],
            'paragraph_id': sent['paragraph_id'],
            'sentence_id': sent['sentence_id'],
            'term_list': pred
        })
    
    os.makedirs(os.path.dirname(output_path) or '.', exist_ok=True)
    with open(output_path, 'w', encoding='utf-8') as f:
        json.dump(output, f, ensure_ascii=False, indent=2)
    print(f"✓ Saved {len(predictions)} predictions to {output_path}")


# Save both sets of predictions
save_predictions(baseline_preds, dev_sentences, 'predictions/subtask_a_dev_dspy_baseline_preds.json')
save_predictions(optimized_preds, dev_sentences, 'predictions/subtask_a_dev_dspy_optimized_preds.json')

✓ Saved 577 predictions to predictions/subtask_a_dev_dspy_baseline_preds.json
✓ Saved 577 predictions to predictions/subtask_a_dev_dspy_optimized_preds.json


## Load and Test Saved Model

In [72]:
# Test loading optimized model
loaded_model = TermExtractor()
loaded_model.load('models/dspy_optimized.pkl')

# Test on a sample - use __call__ instead of forward
test_pred = loaded_model(dev_texts[0])
assert test_pred == optimized_preds[0], "Loaded model predictions don't match"

print("✓ Optimized model saved and loaded correctly")
print("\nAll models successfully saved and can be reloaded!")

✓ Optimized model saved and loaded correctly

All models successfully saved and can be reloaded!


In [77]:
loaded_model.predictors()[0].signature.instructions

'Analyze the following Italian sentence related to municipal waste management and extract any relevant waste management terms. Please ensure to focus solely on terms specific to the waste management domain, and disregard any named entities or nested terms. Return the extracted terms as a clean, comma-separated list in lowercase.'

## Example Predictions

In [73]:
# Show example predictions from both models
example_idx = 10
example_text = dev_texts[example_idx]
example_gold = dev_gold[example_idx]
example_baseline = baseline_preds[example_idx]
example_optimized = optimized_preds[example_idx]

print(f"Sentence: {example_text}\n")
print(f"Gold terms: {example_gold}\n")
print(f"Baseline predictions: {example_baseline}")
print(f"Optimized predictions: {example_optimized}\n")

# Show what each model got right/wrong
baseline_correct = set(example_baseline) & set(example_gold)
optimized_correct = set(example_optimized) & set(example_gold)

print(f"Baseline correct: {baseline_correct}")
print(f"Optimized correct: {optimized_correct}")

Sentence: #Differenziati

Gold terms: ['differenziati']

Baseline predictions: ['differenziati']
Optimized predictions: ['differenziati']

Baseline correct: {'differenziati'}
Optimized correct: {'differenziati'}


## Inspect Optimized Prompts

DSPy allows us to inspect the optimized prompts to understand what changed.

In [79]:
# Inspect the optimized prompt
print("Optimized Model Prompt:")
print("="*60)

# Get the last prompt used by the model
try:
    dspy.inspect_history(n=1)
except Exception as e:
    print("Prompt details not available in this DSPy version")

print("="*60)

Optimized Model Prompt:




[34m[2025-11-05T14:53:41.450613][0m

[31mSystem message:[0m

Your input fields are:
1. `sentence` (str): Italian sentence about waste management
Your output fields are:
1. `terms` (str): Comma-separated list of waste management terms (lowercase)
All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## sentence ## ]]
{sentence}

[[ ## terms ## ]]
{terms}

[[ ## completed ## ]]
In adhering to this structure, your objective is: 
        Analyze the following Italian sentence related to municipal waste management and extract any relevant waste management terms. Please ensure to focus solely on terms specific to the waste management domain, and disregard any named entities or nested terms. Return the extracted terms as a clean, comma-separated list in lowercase.


[31mUser message:[0m

[[ ## sentence ## ]]
Unione dei Comuni “Alto Cilento”


[31mAssistant message:[0m

[[ ## terms ## ]]
none

[[ ## completed ##