# Notebook Summary
This notebook serves the purpose of learning the basic DSPy syntax and learning how each component in DSPy works. This will follow a few experiments and tutorial from the DSPy tutorials page:
- https://dspy.ai/tutorials/classification_finetuning/

# Imports

In [25]:
import dspy
from dspy.datasets import DataLoader
from typing import Literal
from datasets import load_dataset
import random

# Load Dataset

The tutorial uses the BANKING77 dataset. It is a collection of customer service queries related to online banking, annotated with 77 different intent categories

In [6]:
# Load the Banking77 dataset.
CLASSES = load_dataset("PolyAI/banking77", split="train", trust_remote_code=True).features['label'].names
kwargs = dict(fields=("text", "label"), input_keys=("text",), split="train", trust_remote_code=True)

In [19]:
print(f"Number of Classes: {len(CLASSES)}") # actually has 77 classes as the name suggests
print(CLASSES)

Number of Classes: 77
['activate_my_card', 'age_limit', 'apple_pay_or_google_pay', 'atm_support', 'automatic_top_up', 'balance_not_updated_after_bank_transfer', 'balance_not_updated_after_cheque_or_cash_deposit', 'beneficiary_not_allowed', 'cancel_transfer', 'card_about_to_expire', 'card_acceptance', 'card_arrival', 'card_delivery_estimate', 'card_linking', 'card_not_working', 'card_payment_fee_charged', 'card_payment_not_recognised', 'card_payment_wrong_exchange_rate', 'card_swallowed', 'cash_withdrawal_charge', 'cash_withdrawal_not_recognised', 'change_pin', 'compromised_card', 'contactless_not_working', 'country_support', 'declined_card_payment', 'declined_cash_withdrawal', 'declined_transfer', 'direct_debit_payment_not_recognised', 'disposable_card_limits', 'edit_personal_details', 'exchange_charge', 'exchange_rate', 'exchange_via_app', 'extra_charge_on_statement', 'failed_transfer', 'fiat_currency_support', 'get_disposable_virtual_card', 'get_physical_card', 'getting_spare_card', 

The tutorial uses 1,000 data points and provides 'hints,' but I don't actually understand their purpose or why `DSPy.Example` is being used. 

Let's take a closer look.

In [14]:
x = DataLoader().from_huggingface(dataset_name="PolyAI/banking77", **kwargs)[:1]
print(f"Raw Data : \n {x[0]}")

Raw Data : 
 Example({'text': 'I am still waiting on my card?', 'label': 11}) (input_keys={'text'})


In [21]:
check = [
    dspy.Example(x, label=CLASSES[x.label]).with_inputs("text") # what is example for?
    for x in DataLoader().from_huggingface(dataset_name="PolyAI/banking77", **kwargs)[:1]
]
print(f"DSPy Example Data : \n {check[0]}")

DSPy Example Data : 
 Example({'text': 'I am still waiting on my card?', 'label': 'card_arrival'}) (input_keys={'text'})


So it looks the same as the raw data, so it maintains the same format as the original although it converts the label numeric format to the corresponding text

Looking at the official DSPy documentation here [https://dspy.ai/deep-dive/data-handling/examples/]. We can see that it is similar to `Training` and `Testing` on traditional ML.

In [24]:
# move forward with the tutorial

# note to self: the 1000 rows of data will serve as our labelled training data that will be used for 'tuning' the model
raw_data = [
    dspy.Example(x, label=CLASSES[x.label]).with_inputs("text")
    for x in DataLoader().from_huggingface(dataset_name="PolyAI/banking77", **kwargs)[:1000]
] 
# while this will serve as our unlabeled training data
unlabelled_data = [
    dspy.Example(text = x.text).with_inputs("text")
    for x in DataLoader().from_huggingface(dataset_name="PolyAI/banking77", **kwargs)[:500]
]
random.Random(0).shuffle(raw_data)

# Load Model

In [29]:
# We will try a small gemma 3 model for this experiment
# Note: Make sure to have the ollama server running with the gemma3 model loaded
lm = dspy.LM('ollama_chat/gemma3:4b-it-qat', api_base='http://localhost:11434', api_key='')
dspy.configure(lm=lm)

# Creating a DSPy Signature

In [27]:
classify = dspy.ChainOfThought(f"text -> label : Literal{CLASSES}")
print(classify)

predict = Predict(StringSignature(text -> reasoning, label
    instructions='Given the fields `text`, produce the fields `label`.'
    text = Field(annotation=str required=True json_schema_extra={'__dspy_field_type': 'input', 'prefix': 'Text:', 'desc': '${text}'})
    reasoning = Field(annotation=str required=True json_schema_extra={'prefix': "Reasoning: Let's think step by step in order to", 'desc': '${reasoning}', '__dspy_field_type': 'output'})
    label = Field(annotation=Literal['activate_my_card', 'age_limit', 'apple_pay_or_google_pay', 'atm_support', 'automatic_top_up', 'balance_not_updated_after_bank_transfer', 'balance_not_updated_after_cheque_or_cash_deposit', 'beneficiary_not_allowed', 'cancel_transfer', 'card_about_to_expire', 'card_acceptance', 'card_arrival', 'card_delivery_estimate', 'card_linking', 'card_not_working', 'card_payment_fee_charged', 'card_payment_not_recognised', 'card_payment_wrong_exchange_rate', 'card_swallowed', 'cash_withdrawal_charge', 'cash_withdrawa

The tutorial proceeds with using a teacher student method for fine-tuning the model. As currently I don't have a paid model to use and tune, I will skip this part. Instead I will use MIPROv2

# Initial Performance

In [35]:
# this will set the language model for the classify program that we have made
classify.set_lm(lm)
# check the model's prediction using 1 example of data
print(f"Question: {raw_data[0].text}")
print(f"Answer : {raw_data[0].label}")

classify(text = raw_data[0].text) # this will return the model's prediction for the given example

Question: What if there is an error on the exchange rate?
Answer : card_payment_wrong_exchange_rate


Prediction(
    reasoning='The user is reporting an error related to the exchange rate. This falls under the category of incorrect exchange rates, specifically impacting a transaction or payment.',
    label='card_payment_wrong_exchange_rate'
)

Ok, looking at the example it's working quite well. The model is able to select the correct labels from the label list even with a small 4B "QUANTIZED" model

In [None]:
# Evaluate on the dataset
# Since I don't want to break my PC let's use a small devset of 100 examples
devset = raw_data[500:600]
metric = (lambda x, y, trace=None: x.label == y.label) # check the accuracy by comparing the label of the example with the model's prediction

# so evaluate takes 2 main arguments, the devset and the metric
evaluate = dspy.Evaluate(devset=devset, metric=metric, display_progress=True, display_table=5, num_threads=16)

In [37]:
evaluate(classify)

  0%|          | 0/100 [00:00<?, ?it/s]



Average Metric: 1.00 / 1 (100.0%):   1%|          | 1/100 [00:18<30:46, 18.65s/it]



Average Metric: 9.00 / 16 (56.2%):  15%|█▌        | 15/100 [00:33<01:33,  1.09s/it]



Average Metric: 28.00 / 48 (58.3%):  48%|████▊     | 48/100 [01:16<01:18,  1.50s/it]



Average Metric: 49.00 / 84 (58.3%):  84%|████████▍ | 84/100 [02:10<00:21,  1.35s/it]



Average Metric: 61.00 / 98 (62.2%):  98%|█████████▊| 98/100 [02:23<00:01,  1.09it/s]

2025/06/05 23:57:46 ERROR dspy.utils.parallelizer: Error for Example({'text': 'Please check my payment from last Saturday as I feel I have been overcharged on the exchange rate.  Thank you.', 'label': 'card_payment_wrong_exchange_rate'}) (input_keys={'text'}): Both structured output format and JSON mode failed. Please choose a model that supports `response_format` argument. Original error: 'wrong_exchange_rate' is not one of ('activate_my_card', 'age_limit', 'apple_pay_or_google_pay', 'atm_support', 'automatic_top_up', 'balance_not_updated_after_bank_transfer', 'balance_not_updated_after_cheque_or_cash_deposit', 'beneficiary_not_allowed', 'cancel_transfer', 'card_about_to_expire', 'card_acceptance', 'card_arrival', 'card_delivery_estimate', 'card_linking', 'card_not_working', 'card_payment_fee_charged', 'card_payment_not_recognised', 'card_payment_wrong_exchange_rate', 'card_swallowed', 'cash_withdrawal_charge', 'cash_withdrawal_not_recognised', 'change_pin', 'compromised_card', 'conta

Average Metric: 61.00 / 100 (61.0%): : 101it [02:37,  1.56s/it]                       

2025/06/05 23:57:49 INFO dspy.evaluate.evaluate: Average Metric: 61 / 100 (61.0%)





Unnamed: 0,text,example_label,reasoning,pred_label,<lambda>
0,Which fiat currencies do you currently support? Will this change i...,fiat_currency_support,The user is asking about the fiat currencies supported by the serv...,fiat_currency_support,✔️ [True]
1,I didn't receive my money earlier and it says the transaction is s...,pending_cash_withdrawal,The user is reporting that they haven't received their money and t...,failed_transfer,
2,what currencies do you accept?,fiat_currency_support,"The input text is a list of strings, each representing a potential...",Refund_not_showing_up,
3,Where can I find your exchange rates?,exchange_rate,The user is asking to find the exchange rates. This falls under th...,exchange_rate,✔️ [True]
4,why hasnt my card come in yet?,card_arrival,The user is asking about the delivery status of their card. This d...,card_arrival,✔️ [True]


61.0

So without using any prompt tuning, it is able to get an accuracy of 61%

# MIPROv2

In [40]:
optimizer = dspy.MIPROv2(metric=metric, auto="light", num_threads=4) # can choose between light, medium, and heavy optimization levels
optimized_program = optimizer.compile(classify, 
                                      trainset=devset,
                                      requires_permission_to_run=False
                                      )

2025/06/06 00:04:19 INFO dspy.teleprompt.mipro_optimizer_v2: 
RUNNING WITH THE FOLLOWING LIGHT AUTO RUN SETTINGS:
num_trials: 10
minibatch: True
num_fewshot_candidates: 6
num_instruct_candidates: 3
valset size: 80

2025/06/06 00:04:19 INFO dspy.teleprompt.mipro_optimizer_v2: 
==> STEP 1: BOOTSTRAP FEWSHOT EXAMPLES <==
2025/06/06 00:04:19 INFO dspy.teleprompt.mipro_optimizer_v2: These will be used as few-shot example candidates for our program and for creating instructions.

2025/06/06 00:04:19 INFO dspy.teleprompt.mipro_optimizer_v2: Bootstrapping N=6 sets of demonstrations...


Bootstrapping set 1/6
Bootstrapping set 2/6
Bootstrapping set 3/6


 35%|███▌      | 7/20 [00:12<00:23,  1.79s/it]


Bootstrapped 4 full traces after 7 examples for up to 1 rounds, amounting to 7 attempts.
Bootstrapping set 4/6


 15%|█▌        | 3/20 [00:02<00:13,  1.23it/s]


Bootstrapped 3 full traces after 3 examples for up to 1 rounds, amounting to 3 attempts.
Bootstrapping set 5/6


 20%|██        | 4/20 [00:03<00:12,  1.28it/s]


Bootstrapped 2 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.
Bootstrapping set 6/6


 10%|█         | 2/20 [00:01<00:15,  1.14it/s]
2025/06/06 00:04:39 INFO dspy.teleprompt.mipro_optimizer_v2: 
==> STEP 2: PROPOSE INSTRUCTION CANDIDATES <==
2025/06/06 00:04:39 INFO dspy.teleprompt.mipro_optimizer_v2: We will use the few-shot examples from the previous step, a generated dataset summary, a summary of the program code, and a randomly selected prompting tip to propose instructions.


Bootstrapped 1 full traces after 2 examples for up to 1 rounds, amounting to 2 attempts.
Error getting source code: unhashable type: 'dict'.

Running without program aware proposer.


2025/06/06 00:04:43 INFO dspy.teleprompt.mipro_optimizer_v2: 
Proposing N=3 instructions...

2025/06/06 00:04:46 INFO dspy.teleprompt.mipro_optimizer_v2: Proposed Instructions for Predictor 0:

2025/06/06 00:04:46 INFO dspy.teleprompt.mipro_optimizer_v2: 0: Given the fields `text`, produce the fields `label`.

2025/06/06 00:04:46 INFO dspy.teleprompt.mipro_optimizer_v2: 1: You are a customer support agent assisting users with inquiries about their card services and financial transactions. A user will provide a text query related to card delivery, transaction status, or fees. Your task is to determine the most relevant category for the user's question. Possible categories include: `card_arrival`, `transaction_status`, `fees`, and `other`. Respond with the label that best describes the user's query.

2025/06/06 00:04:46 INFO dspy.teleprompt.mipro_optimizer_v2: 2: You are a customer support assistant specializing in card services. Given a customer inquiry, classify the inquiry into one of

Average Metric: 50.00 / 80 (62.5%): 100%|██████████| 80/80 [00:00<00:00, 2688.27it/s]

2025/06/06 00:04:46 INFO dspy.evaluate.evaluate: Average Metric: 50 / 80 (62.5%)
2025/06/06 00:04:46 INFO dspy.teleprompt.mipro_optimizer_v2: Default program score: 62.5

2025/06/06 00:04:47 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 2 / 13 - Minibatch ==



Average Metric: 2.00 / 4 (50.0%):  11%|█▏        | 4/35 [00:06<00:40,  1.32s/it] 



Average Metric: 6.00 / 9 (66.7%):  26%|██▌       | 9/35 [00:21<01:19,  3.07s/it]

2025/06/06 00:05:13 ERROR dspy.utils.parallelizer: Error for Example({'text': 'Do you also have this extra fee on your statement?', 'label': 'extra_charge_on_statement'}) (input_keys={'text'}): Both structured output format and JSON mode failed. Please choose a model that supports `response_format` argument. Original error: 'cash_withdrawal_not_recognized' is not one of ('activate_my_card', 'age_limit', 'apple_pay_or_google_pay', 'atm_support', 'automatic_top_up', 'balance_not_updated_after_bank_transfer', 'balance_not_updated_after_cheque_or_cash_deposit', 'beneficiary_not_allowed', 'cancel_transfer', 'card_about_to_expire', 'card_acceptance', 'card_arrival', 'card_delivery_estimate', 'card_linking', 'card_not_working', 'card_payment_fee_charged', 'card_payment_not_recognised', 'card_payment_wrong_exchange_rate', 'card_swallowed', 'cash_withdrawal_charge', 'cash_withdrawal_not_recognised', 'change_pin', 'compromised_card', 'contactless_not_working', 'country_support', 'declined_card_p

Average Metric: 10.00 / 13 (76.9%):  40%|████      | 14/35 [00:29<00:29,  1.39s/it]



Average Metric: 12.00 / 16 (75.0%):  49%|████▊     | 17/35 [00:34<00:26,  1.48s/it]

2025/06/06 00:05:22 ERROR dspy.utils.parallelizer: Error for Example({'text': 'Why is there a fee added to my statement?', 'label': 'extra_charge_on_statement'}) (input_keys={'text'}): Both structured output format and JSON mode failed. Please choose a model that supports `response_format` argument. Original error: 'refund_not_showing_up' is not one of ('activate_my_card', 'age_limit', 'apple_pay_or_google_pay', 'atm_support', 'automatic_top_up', 'balance_not_updated_after_bank_transfer', 'balance_not_updated_after_cheque_or_cash_deposit', 'beneficiary_not_allowed', 'cancel_transfer', 'card_about_to_expire', 'card_acceptance', 'card_arrival', 'card_delivery_estimate', 'card_linking', 'card_not_working', 'card_payment_fee_charged', 'card_payment_not_recognised', 'card_payment_wrong_exchange_rate', 'card_swallowed', 'cash_withdrawal_charge', 'cash_withdrawal_not_recognised', 'change_pin', 'compromised_card', 'contactless_not_working', 'country_support', 'declined_card_payment', 'declined

Average Metric: 12.00 / 16 (75.0%):  51%|█████▏    | 18/35 [00:35<00:22,  1.31s/it]



Average Metric: 16.00 / 25 (64.0%):  77%|███████▋  | 27/35 [01:07<00:23,  2.94s/it]



Average Metric: 22.00 / 33 (66.7%): 100%|██████████| 35/35 [01:29<00:00,  2.56s/it]

2025/06/06 00:06:16 INFO dspy.evaluate.evaluate: Average Metric: 22.0 / 35 (62.9%)
2025/06/06 00:06:16 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 62.86 on minibatch of size 35 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 3'].
2025/06/06 00:06:16 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [62.86]
2025/06/06 00:06:16 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [62.5]
2025/06/06 00:06:16 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 62.5


2025/06/06 00:06:16 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 3 / 13 - Minibatch ==



  0%|          | 0/35 [00:00<?, ?it/s]



Average Metric: 7.00 / 10 (70.0%):  29%|██▊       | 10/35 [00:11<00:23,  1.04it/s]



Average Metric: 10.00 / 19 (52.6%):  54%|█████▍    | 19/35 [00:20<00:13,  1.14it/s]



Average Metric: 19.00 / 35 (54.3%): 100%|██████████| 35/35 [00:35<00:00,  1.01s/it]

2025/06/06 00:06:51 INFO dspy.evaluate.evaluate: Average Metric: 19 / 35 (54.3%)
2025/06/06 00:06:51 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 54.29 on minibatch of size 35 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 0'].
2025/06/06 00:06:51 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [62.86, 54.29]
2025/06/06 00:06:51 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [62.5]
2025/06/06 00:06:51 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 62.5


2025/06/06 00:06:51 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 4 / 13 - Minibatch ==



Average Metric: 2.00 / 2 (100.0%):   6%|▌         | 2/35 [00:12<02:50,  5.18s/it]



Average Metric: 5.00 / 6 (83.3%):  17%|█▋        | 6/35 [00:17<00:51,  1.76s/it] 



Average Metric: 5.00 / 8 (62.5%):  23%|██▎       | 8/35 [00:20<00:44,  1.64s/it]

2025/06/06 00:07:14 ERROR dspy.utils.parallelizer: Error for Example({'text': 'Is there a way to track my card?', 'label': 'card_arrival'}) (input_keys={'text'}): Both structured output format and JSON mode failed. Please choose a model that supports `response_format` argument. Original error: 'card_tracking' is not one of ('activate_my_card', 'age_limit', 'apple_pay_or_google_pay', 'atm_support', 'automatic_top_up', 'balance_not_updated_after_bank_transfer', 'balance_not_updated_after_cheque_or_cash_deposit', 'beneficiary_not_allowed', 'cancel_transfer', 'card_about_to_expire', 'card_acceptance', 'card_arrival', 'card_delivery_estimate', 'card_linking', 'card_not_working', 'card_payment_fee_charged', 'card_payment_not_recognised', 'card_payment_wrong_exchange_rate', 'card_swallowed', 'cash_withdrawal_charge', 'cash_withdrawal_not_recognised', 'change_pin', 'compromised_card', 'contactless_not_working', 'country_support', 'declined_card_payment', 'declined_cash_withdrawal', 'declined_t

Average Metric: 23.00 / 33 (69.7%):  97%|█████████▋| 34/35 [00:51<00:01,  1.14s/it]



Average Metric: 24.00 / 34 (70.6%): 100%|██████████| 35/35 [01:21<00:00,  2.32s/it]

2025/06/06 00:08:13 INFO dspy.evaluate.evaluate: Average Metric: 24.0 / 35 (68.6%)
2025/06/06 00:08:13 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 68.57 on minibatch of size 35 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 5'].
2025/06/06 00:08:13 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [62.86, 54.29, 68.57]
2025/06/06 00:08:13 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [62.5]
2025/06/06 00:08:13 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 62.5


2025/06/06 00:08:13 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 5 / 13 - Minibatch ==



Average Metric: 1.00 / 3 (33.3%):   9%|▊         | 3/35 [00:02<00:28,  1.11it/s] 



Average Metric: 4.00 / 7 (57.1%):  20%|██        | 7/35 [00:13<00:52,  1.86s/it]



Average Metric: 6.00 / 10 (60.0%):  26%|██▌       | 9/35 [00:25<01:59,  4.59s/it]



Average Metric: 19.00 / 34 (55.9%):  97%|█████████▋| 34/35 [01:16<00:01,  1.10s/it]



Average Metric: 20.00 / 35 (57.1%): 100%|██████████| 35/35 [01:27<00:00,  2.49s/it]

2025/06/06 00:09:40 INFO dspy.evaluate.evaluate: Average Metric: 20 / 35 (57.1%)
2025/06/06 00:09:40 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 57.14 on minibatch of size 35 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 2'].
2025/06/06 00:09:40 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [62.86, 54.29, 68.57, 57.14]
2025/06/06 00:09:40 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [62.5]
2025/06/06 00:09:40 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 62.5


2025/06/06 00:09:40 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 6 / 13 - Minibatch ==



Average Metric: 5.00 / 6 (83.3%):  17%|█▋        | 6/35 [00:05<00:30,  1.04s/it] 



Average Metric: 10.00 / 11 (90.9%):  31%|███▏      | 11/35 [00:19<00:45,  1.91s/it]



Average Metric: 13.00 / 17 (76.5%):  49%|████▊     | 17/35 [00:37<00:34,  1.93s/it]



Average Metric: 18.00 / 22 (81.8%):  63%|██████▎   | 22/35 [00:49<00:26,  2.01s/it]

2025/06/06 00:10:30 ERROR dspy.utils.parallelizer: Error for Example({'text': 'I want to have multiple currencies in my account if possible.', 'label': 'fiat_currency_support'}) (input_keys={'text'}): Both structured output format and JSON mode failed. Please choose a model that supports `response_format` argument. Original error: 'refund_not_showing_up' is not one of ('activate_my_card', 'age_limit', 'apple_pay_or_google_pay', 'atm_support', 'automatic_top_up', 'balance_not_updated_after_bank_transfer', 'balance_not_updated_after_cheque_or_cash_deposit', 'beneficiary_not_allowed', 'cancel_transfer', 'card_about_to_expire', 'card_acceptance', 'card_arrival', 'card_delivery_estimate', 'card_linking', 'card_not_working', 'card_payment_fee_charged', 'card_payment_not_recognised', 'card_payment_wrong_exchange_rate', 'card_swallowed', 'cash_withdrawal_charge', 'cash_withdrawal_not_recognised', 'change_pin', 'compromised_card', 'contactless_not_working', 'country_support', 'declined_card_pay

Average Metric: 23.00 / 27 (85.2%):  80%|████████  | 28/35 [01:05<00:20,  2.87s/it]



Average Metric: 24.00 / 28 (85.7%):  83%|████████▎ | 29/35 [01:07<00:15,  2.63s/it]

2025/06/06 00:10:49 ERROR dspy.utils.parallelizer: Error for Example({'text': 'Why has my deposit in the ATM not cleared yet?', 'label': 'pending_cash_withdrawal'}) (input_keys={'text'}): Both structured output format and JSON mode failed. Please choose a model that supports `response_format` argument. Original error: 'refund_not_showing_up' is not one of ('activate_my_card', 'age_limit', 'apple_pay_or_google_pay', 'atm_support', 'automatic_top_up', 'balance_not_updated_after_bank_transfer', 'balance_not_updated_after_cheque_or_cash_deposit', 'beneficiary_not_allowed', 'cancel_transfer', 'card_about_to_expire', 'card_acceptance', 'card_arrival', 'card_delivery_estimate', 'card_linking', 'card_not_working', 'card_payment_fee_charged', 'card_payment_not_recognised', 'card_payment_wrong_exchange_rate', 'card_swallowed', 'cash_withdrawal_charge', 'cash_withdrawal_not_recognised', 'change_pin', 'compromised_card', 'contactless_not_working', 'country_support', 'declined_card_payment', 'decli

Average Metric: 27.00 / 32 (84.4%):  97%|█████████▋| 34/35 [01:14<00:01,  1.53s/it]



Average Metric: 27.00 / 33 (81.8%): 100%|██████████| 35/35 [01:33<00:00,  2.67s/it]

2025/06/06 00:11:13 INFO dspy.evaluate.evaluate: Average Metric: 27.0 / 35 (77.1%)
2025/06/06 00:11:13 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 77.14 on minibatch of size 35 with parameters ['Predictor 0: Instruction 0', 'Predictor 0: Few-Shot Set 5'].
2025/06/06 00:11:13 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [62.86, 54.29, 68.57, 57.14, 77.14]
2025/06/06 00:11:13 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [62.5]
2025/06/06 00:11:13 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 62.5


2025/06/06 00:11:13 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 7 / 13 - Full Evaluation =====
2025/06/06 00:11:13 INFO dspy.teleprompt.mipro_optimizer_v2: Doing full eval on next top averaging program (Avg Score: 77.14) from minibatch trials...



Average Metric: 13.00 / 17 (76.5%):  21%|██▏       | 17/80 [00:05<00:19,  3.19it/s]



Average Metric: 14.00 / 19 (73.7%):  24%|██▍       | 19/80 [00:07<00:32,  1.87it/s]

2025/06/06 00:11:23 ERROR dspy.utils.parallelizer: Error for Example({'text': 'I see a $1 charge in a transaction.', 'label': 'extra_charge_on_statement'}) (input_keys={'text'}): Both structured output format and JSON mode failed. Please choose a model that supports `response_format` argument. Original error: 'card_exchange_rate' is not one of ('activate_my_card', 'age_limit', 'apple_pay_or_google_pay', 'atm_support', 'automatic_top_up', 'balance_not_updated_after_bank_transfer', 'balance_not_updated_after_cheque_or_cash_deposit', 'beneficiary_not_allowed', 'cancel_transfer', 'card_about_to_expire', 'card_acceptance', 'card_arrival', 'card_delivery_estimate', 'card_linking', 'card_not_working', 'card_payment_fee_charged', 'card_payment_not_recognised', 'card_payment_wrong_exchange_rate', 'card_swallowed', 'cash_withdrawal_charge', 'cash_withdrawal_not_recognised', 'change_pin', 'compromised_card', 'contactless_not_working', 'country_support', 'declined_card_payment', 'declined_cash_wit

Average Metric: 16.00 / 21 (76.2%):  26%|██▋       | 21/80 [00:10<00:49,  1.19it/s]



Average Metric: 18.00 / 23 (78.3%):  30%|███       | 24/80 [00:16<01:11,  1.28s/it]



Average Metric: 23.00 / 30 (76.7%):  38%|███▊      | 30/80 [00:20<00:41,  1.22it/s]



Average Metric: 28.00 / 38 (73.7%):  49%|████▉     | 39/80 [00:36<01:07,  1.65s/it]

2025/06/06 00:11:56 ERROR dspy.utils.parallelizer: Error for Example({'text': 'I was able to find my card. How to I go about putting it into my app?', 'label': 'card_linking'}) (input_keys={'text'}): Both structured output format and JSON mode failed. Please choose a model that supports `response_format` argument. Original error: litellm.APIConnectionError: Ollama_chatException - {"error":"POST predict: Post \"http://127.0.0.1:58107/completion\": read tcp 127.0.0.1:58110-\u003e127.0.0.1:58107: wsarecv: An existing connection was forcibly closed by the remote host."}. Set `provide_traceback=True` for traceback.


Average Metric: 36.00 / 50 (72.0%):  65%|██████▌   | 52/80 [00:55<00:47,  1.69s/it]



Average Metric: 39.00 / 57 (68.4%):  72%|███████▎  | 58/80 [01:07<00:44,  2.03s/it]



Average Metric: 42.00 / 60 (70.0%):  78%|███████▊  | 62/80 [01:08<00:24,  1.34s/it]



Average Metric: 43.00 / 61 (70.5%):  78%|███████▊  | 62/80 [01:08<00:24,  1.34s/it]

2025/06/06 00:12:22 ERROR dspy.utils.parallelizer: Error for Example({'text': 'I want to have multiple currencies in my account if possible.', 'label': 'fiat_currency_support'}) (input_keys={'text'}): Both structured output format and JSON mode failed. Please choose a model that supports `response_format` argument. Original error: 'refund_not_showing_up' is not one of ('activate_my_card', 'age_limit', 'apple_pay_or_google_pay', 'atm_support', 'automatic_top_up', 'balance_not_updated_after_bank_transfer', 'balance_not_updated_after_cheque_or_cash_deposit', 'beneficiary_not_allowed', 'cancel_transfer', 'card_about_to_expire', 'card_acceptance', 'card_arrival', 'card_delivery_estimate', 'card_linking', 'card_not_working', 'card_payment_fee_charged', 'card_payment_not_recognised', 'card_payment_wrong_exchange_rate', 'card_swallowed', 'cash_withdrawal_charge', 'cash_withdrawal_not_recognised', 'change_pin', 'compromised_card', 'contactless_not_working', 'country_support', 'declined_card_pay

Average Metric: 45.00 / 64 (70.3%):  84%|████████▍ | 67/80 [01:13<00:15,  1.17s/it]



Average Metric: 45.00 / 65 (69.2%):  85%|████████▌ | 68/80 [01:14<00:15,  1.25s/it]



Average Metric: 54.00 / 77 (70.1%): 100%|██████████| 80/80 [01:34<00:00,  1.18s/it]

2025/06/06 00:12:48 INFO dspy.evaluate.evaluate: Average Metric: 54.0 / 80 (67.5%)
2025/06/06 00:12:48 INFO dspy.teleprompt.mipro_optimizer_v2: [92mNew best full eval score![0m Score: 67.5
2025/06/06 00:12:48 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [62.5, 67.5]
2025/06/06 00:12:48 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 67.5
2025/06/06 00:12:48 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/06/06 00:12:48 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 8 / 13 - Minibatch ==



  0%|          | 0/35 [00:00<?, ?it/s]



Average Metric: 4.00 / 5 (80.0%):  14%|█▍        | 5/35 [00:05<00:25,  1.17it/s] 

2025/06/06 00:12:54 ERROR dspy.utils.parallelizer: Error for Example({'text': 'How can I check on the status of my new card?', 'label': 'card_arrival'}) (input_keys={'text'}): Both structured output format and JSON mode failed. Please choose a model that supports `response_format` argument. Original error: 'transaction_status' is not one of ('activate_my_card', 'age_limit', 'apple_pay_or_google_pay', 'atm_support', 'automatic_top_up', 'balance_not_updated_after_bank_transfer', 'balance_not_updated_after_cheque_or_cash_deposit', 'beneficiary_not_allowed', 'cancel_transfer', 'card_about_to_expire', 'card_acceptance', 'card_arrival', 'card_delivery_estimate', 'card_linking', 'card_not_working', 'card_payment_fee_charged', 'card_payment_not_recognised', 'card_payment_wrong_exchange_rate', 'card_swallowed', 'cash_withdrawal_charge', 'cash_withdrawal_not_recognised', 'change_pin', 'compromised_card', 'contactless_not_working', 'country_support', 'declined_card_payment', 'declined_cash_withdr

Average Metric: 4.00 / 6 (66.7%):  20%|██        | 7/35 [00:06<00:25,  1.10it/s]



Average Metric: 6.00 / 10 (60.0%):  29%|██▊       | 10/35 [00:08<00:30,  1.20s/it]



Average Metric: 7.00 / 13 (53.8%):  40%|████      | 14/35 [00:12<00:17,  1.19it/s]



Average Metric: 7.00 / 14 (50.0%):  43%|████▎     | 15/35 [00:13<00:17,  1.17it/s]



Average Metric: 9.00 / 23 (39.1%):  66%|██████▌   | 23/35 [00:16<00:06,  1.78it/s]



Average Metric: 15.00 / 34 (44.1%): 100%|██████████| 35/35 [00:24<00:00,  1.41it/s]

2025/06/06 00:13:13 INFO dspy.evaluate.evaluate: Average Metric: 15.0 / 35 (42.9%)
2025/06/06 00:13:13 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 42.86 on minibatch of size 35 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 0'].
2025/06/06 00:13:13 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [62.86, 54.29, 68.57, 57.14, 77.14, 42.86]
2025/06/06 00:13:13 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [62.5, 67.5]
2025/06/06 00:13:13 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 67.5


2025/06/06 00:13:13 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 9 / 13 - Minibatch ==



Average Metric: 5.00 / 6 (83.3%):  17%|█▋        | 6/35 [00:14<00:46,  1.60s/it] 



Average Metric: 27.00 / 34 (79.4%):  97%|█████████▋| 34/35 [00:49<00:01,  1.20s/it]



Average Metric: 28.00 / 35 (80.0%): 100%|██████████| 35/35 [01:27<00:00,  2.51s/it]

2025/06/06 00:14:41 INFO dspy.evaluate.evaluate: Average Metric: 28 / 35 (80.0%)
2025/06/06 00:14:41 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 80.0 on minibatch of size 35 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 5'].
2025/06/06 00:14:41 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [62.86, 54.29, 68.57, 57.14, 77.14, 42.86, 80.0]
2025/06/06 00:14:41 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [62.5, 67.5]
2025/06/06 00:14:41 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 67.5


2025/06/06 00:14:41 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 10 / 13 - Minibatch ==



Average Metric: 4.00 / 4 (100.0%):  11%|█▏        | 4/35 [00:04<00:37,  1.20s/it]



Average Metric: 14.00 / 19 (73.7%):  54%|█████▍    | 19/35 [00:41<00:35,  2.24s/it]



Average Metric: 23.00 / 35 (65.7%): 100%|██████████| 35/35 [01:24<00:00,  2.42s/it]

2025/06/06 00:16:05 INFO dspy.evaluate.evaluate: Average Metric: 23 / 35 (65.7%)
2025/06/06 00:16:05 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 65.71 on minibatch of size 35 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 4'].
2025/06/06 00:16:05 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [62.86, 54.29, 68.57, 57.14, 77.14, 42.86, 80.0, 65.71]
2025/06/06 00:16:05 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [62.5, 67.5]
2025/06/06 00:16:05 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 67.5


2025/06/06 00:16:05 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 11 / 13 - Minibatch ==



Average Metric: 15.00 / 18 (83.3%):  49%|████▊     | 17/35 [00:33<00:50,  2.78s/it]



Average Metric: 30.00 / 35 (85.7%): 100%|██████████| 35/35 [01:12<00:00,  2.06s/it]

2025/06/06 00:17:17 INFO dspy.evaluate.evaluate: Average Metric: 30 / 35 (85.7%)
2025/06/06 00:17:17 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 85.71 on minibatch of size 35 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 5'].
2025/06/06 00:17:17 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [62.86, 54.29, 68.57, 57.14, 77.14, 42.86, 80.0, 65.71, 85.71]
2025/06/06 00:17:17 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [62.5, 67.5]
2025/06/06 00:17:17 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 67.5


2025/06/06 00:17:17 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 12 / 13 - Minibatch ==







Average Metric: 29.00 / 34 (85.3%):  97%|█████████▋| 34/35 [00:20<00:00,  1.71it/s]



Average Metric: 29.00 / 35 (82.9%): 100%|██████████| 35/35 [00:38<00:00,  1.10s/it]

2025/06/06 00:17:56 INFO dspy.evaluate.evaluate: Average Metric: 29 / 35 (82.9%)
2025/06/06 00:17:56 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 82.86 on minibatch of size 35 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 5'].
2025/06/06 00:17:56 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [62.86, 54.29, 68.57, 57.14, 77.14, 42.86, 80.0, 65.71, 85.71, 82.86]
2025/06/06 00:17:56 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [62.5, 67.5]
2025/06/06 00:17:56 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 67.5


2025/06/06 00:17:56 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 13 / 13 - Full Evaluation =====
2025/06/06 00:17:56 INFO dspy.teleprompt.mipro_optimizer_v2: Doing full eval on next top averaging program (Avg Score: 82.85666666666667) from minibatch trials...



Average Metric: 19.00 / 26 (73.1%):  31%|███▏      | 25/80 [00:02<00:05,  9.92it/s]  



Average Metric: 27.00 / 35 (77.1%):  42%|████▎     | 34/80 [00:05<00:09,  5.03it/s]



Average Metric: 33.00 / 44 (75.0%):  54%|█████▍    | 43/80 [00:05<00:05,  6.26it/s]



Average Metric: 36.00 / 47 (76.6%):  57%|█████▊    | 46/80 [00:07<00:06,  5.12it/s]



Average Metric: 37.00 / 49 (75.5%):  60%|██████    | 48/80 [00:09<00:08,  3.61it/s]

2025/06/06 00:18:07 ERROR dspy.utils.parallelizer: Error for Example({'text': 'What is this extra fee on my statement.', 'label': 'extra_charge_on_statement'}) (input_keys={'text'}): Both structured output format and JSON mode failed. Please choose a model that supports `response_format` argument. Original error: 'transaction_status' is not one of ('activate_my_card', 'age_limit', 'apple_pay_or_google_pay', 'atm_support', 'automatic_top_up', 'balance_not_updated_after_bank_transfer', 'balance_not_updated_after_cheque_or_cash_deposit', 'beneficiary_not_allowed', 'cancel_transfer', 'card_about_to_expire', 'card_acceptance', 'card_arrival', 'card_delivery_estimate', 'card_linking', 'card_not_working', 'card_payment_fee_charged', 'card_payment_not_recognised', 'card_payment_wrong_exchange_rate', 'card_swallowed', 'cash_withdrawal_charge', 'cash_withdrawal_not_recognised', 'change_pin', 'compromised_card', 'contactless_not_working', 'country_support', 'declined_card_payment', 'declined_cash

Average Metric: 49.00 / 64 (76.6%):  80%|████████  | 64/80 [00:14<00:04,  3.87it/s]



Average Metric: 59.00 / 79 (74.7%): 100%|██████████| 80/80 [00:17<00:00,  4.64it/s]

2025/06/06 00:18:13 INFO dspy.evaluate.evaluate: Average Metric: 59.0 / 80 (73.8%)
2025/06/06 00:18:13 INFO dspy.teleprompt.mipro_optimizer_v2: [92mNew best full eval score![0m Score: 73.75
2025/06/06 00:18:13 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [62.5, 67.5, 73.75]
2025/06/06 00:18:13 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 73.75
2025/06/06 00:18:13 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/06/06 00:18:13 INFO dspy.teleprompt.mipro_optimizer_v2: Returning best identified program with score 73.75!





The optimizer is able to raise the final score from 61% to 75% (which amounts to around 14% increase). I think that is quite good!

Looking at the log, we can see the mechanism behind MIPROv2. There are 3 main steps:
- Bootstrap Few-Shot Examples
    - this is used to find for good few shot examples in the given data (if the language model generate the correct answer it is kept)
- Propose Instruction Candidates
    - this is to change the format of the prompt itself, looking at the documentation there are a few things that it does
        - create a summary of the training data
        - summary of the LM's code
        - change instruction to be 'more creative' or some other thing and change the wording to make the instruction higher quality
- Find an Optimized Combination of Few-Shot Examples & Instructions
    - Use Bayesian Optimization to choose which combinations of instructions and examples

In [41]:
optimized_program.save(f"optimized.json")

Looking at the json file, it definitely overfit to the 100 samples that we selected as the prompt that is used in the final data was
- classify the inquiry into one of the following categories: `card_arrival`, `exchange_rate`, `transaction_status`, or `lost_found_card`

Which is only 4 out of the 77 groups that are in the dataset. Might need to do some kind of sampling to make sure it includes all sort of labels or just try again on binary classification dataset to see the actual efficacy of DSPy's prompt tuning