# L4: Optimize DSPy Agent with DSPy Optimizer

In [3]:
import os
from dotenv import load_dotenv

load_dotenv()

os.environ["OPENAI_API_KEY"]  = os.getenv("OPENAI_API_KEY")

In [4]:
import mlflow

In [5]:
import os
from dotenv import load_dotenv, find_dotenv

# these expect to find a .env file at the directory above the lesson.    # the format for that file is (without the comment)                      #API_KEYNAME=AStringThatIsTheLongAPIKeyFromSomeService                                                            
def load_env():
    _ = load_dotenv(find_dotenv())

def get_openai_api_key():
    load_env()
    openai_api_key = os.getenv("OPENAI_API_KEY")
    return openai_api_key

def get_mlflow_tracking_uri():
    load_env()
    # Try to get from environment, fallback to default
    dlai_url = os.environ.get('DLAI_LOCAL_URL', 'http://localhost:{port}')
    return dlai_url.format(port=8080)

mlflow_tracking_uri = get_mlflow_tracking_uri()
mlflow.set_tracking_uri(mlflow_tracking_uri)

In [6]:
mlflow.set_experiment("dspy_course_4")

2025/06/14 11:10:52 INFO mlflow.tracking.fluent: Experiment with name 'dspy_course_4' does not exist. Creating a new experiment.


<Experiment: artifact_location='mlflow-artifacts:/322957703646966226', creation_time=1749874252307, experiment_id='322957703646966226', last_update_time=1749874252307, lifecycle_stage='active', name='dspy_course_4', tags={}>

In [9]:
mlflow.dspy.autolog(log_evals=True, log_compiles=True, log_traces_from_compile=True)

In [10]:
import dspy

dspy.configure(lm=dspy.LM("openai/gpt-4o-mini"))

## Build a RAG Agent

In [23]:
def search_wikipedia(query: str) -> list[str]:
    results = dspy.ColBERTv2(url="http://20.102.90.50:2017/wiki17_abstracts")(query, k=3)
    return [x["text"] for x in results]

react = dspy.ReAct("question -> answer", tools=[search_wikipedia])

In [25]:
import json

# Load trainset
trainset = []
with open("trainset.jsonl", "r") as f:
    for line in f:
        trainset.append(dspy.Example(**json.loads(line)).with_inputs("question"))

# Load valset
valset = []
with open("valset.jsonl", "r") as f:
    for line in f:
        valset.append(dspy.Example(**json.loads(line)).with_inputs("question"))

In [26]:
# Overview of the dataset.
print(trainset[0])

Example({'question': 'Are Smyrnium and Nymania both types of plant?', 'answer': 'yes'}) (input_keys={'question'})


In [27]:
tp = dspy.MIPROv2(
    metric=dspy.evaluate.answer_exact_match,
    auto="light",
    num_threads=16
)

In [29]:
dspy.cache.load_memory_cache("./memory_cache.pkl")

In [30]:
optimized_react = tp.compile(
    react,
    trainset=trainset,
    valset=valset,
    requires_permission_to_run=False,
)

2025/06/14 11:24:17 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID 'a26508f5320d4386af80a5a922813397', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current dspy workflow
2025/06/14 11:24:17 INFO dspy.teleprompt.mipro_optimizer_v2: 
RUNNING WITH THE FOLLOWING LIGHT AUTO RUN SETTINGS:
num_trials: 20
minibatch: True
num_fewshot_candidates: 6
num_instruct_candidates: 3
valset size: 100

2025/06/14 11:24:17 INFO dspy.teleprompt.mipro_optimizer_v2: 
==> STEP 1: BOOTSTRAP FEWSHOT EXAMPLES <==
2025/06/14 11:24:17 INFO dspy.teleprompt.mipro_optimizer_v2: These will be used as few-shot example candidates for our program and for creating instructions.

2025/06/14 11:24:17 INFO dspy.teleprompt.mipro_optimizer_v2: Bootstrapping N=6 sets of demonstrations...


Bootstrapping set 1/6
Bootstrapping set 2/6
Bootstrapping set 3/6


Downloading artifacts: 100%|██████████| 1/1 [00:00<00:00, 2519.10it/s] 
 15%|█▌        | 15/100 [02:00<11:24,  8.05s/it]


Bootstrapped 4 full traces after 15 examples for up to 1 rounds, amounting to 15 attempts.


Downloading artifacts: 100%|██████████| 1/1 [00:00<00:00, 222.12it/s]


Bootstrapping set 4/6


Downloading artifacts: 100%|██████████| 1/1 [00:00<00:00, 242.49it/s]
  1%|          | 1/100 [00:09<15:56,  9.66s/it]


Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts.


Downloading artifacts: 100%|██████████| 1/1 [00:00<00:00, 42.25it/s]


Bootstrapping set 5/6


Downloading artifacts: 100%|██████████| 1/1 [00:00<00:00, 89.08it/s]
  8%|▊         | 8/100 [00:41<08:01,  5.23s/it]


Bootstrapped 4 full traces after 8 examples for up to 1 rounds, amounting to 8 attempts.


Downloading artifacts: 100%|██████████| 1/1 [00:00<00:00, 242.29it/s]


Bootstrapping set 6/6


Downloading artifacts: 100%|██████████| 1/1 [00:00<00:00, 261.70it/s]
  2%|▏         | 2/100 [00:07<06:20,  3.89s/it]


Bootstrapped 1 full traces after 2 examples for up to 1 rounds, amounting to 2 attempts.


Downloading artifacts: 100%|██████████| 1/1 [00:00<00:00, 232.13it/s]
2025/06/14 11:27:18 INFO dspy.teleprompt.mipro_optimizer_v2: 
==> STEP 2: PROPOSE INSTRUCTION CANDIDATES <==
2025/06/14 11:27:18 INFO dspy.teleprompt.mipro_optimizer_v2: We will use the few-shot examples from the previous step, a generated dataset summary, a summary of the program code, and a randomly selected prompting tip to propose instructions.
2025/06/14 11:27:48 INFO dspy.teleprompt.mipro_optimizer_v2: 
Proposing N=3 instructions...

2025/06/14 11:28:37 INFO dspy.teleprompt.mipro_optimizer_v2: Proposed Instructions for Predictor 0:

2025/06/14 11:28:37 INFO dspy.teleprompt.mipro_optimizer_v2: 0: Given the fields `question`, produce the fields `answer`.

You are an Agent. In each episode, you will be given the fields `question` as input. And you can see your past trajectory so far.
Your goal is to use one or more of the supplied tools to collect any necessary information for producing `answer`.

To do this, you 

Average Metric: 29.00 / 100 (29.0%): 100%|██████████| 100/100 [00:59<00:00,  1.67it/s]

2025/06/14 11:29:37 INFO dspy.evaluate.evaluate: Average Metric: 29 / 100 (29.0%)
2025/06/14 11:29:37 INFO dspy.teleprompt.mipro_optimizer_v2: Default program score: 29.0




🏃 View run eval_full_0 at: http://localhost:8080/#/experiments/322957703646966226/runs/fc5535abea3e486e85f46495ad8b5ab1
🧪 View experiment at: http://localhost:8080/#/experiments/322957703646966226


2025/06/14 11:29:37 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 2 / 25 - Minibatch ==


Average Metric: 9.00 / 35 (25.7%): 100%|██████████| 35/35 [00:27<00:00,  1.29it/s]

2025/06/14 11:30:05 INFO dspy.evaluate.evaluate: Average Metric: 9 / 35 (25.7%)
2025/06/14 11:30:05 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 25.71 on minibatch of size 35 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 3', 'Predictor 1: Instruction 2', 'Predictor 1: Few-Shot Set 0'].
2025/06/14 11:30:05 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [25.71]
2025/06/14 11:30:05 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [29.0]
2025/06/14 11:30:05 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 29.0


2025/06/14 11:30:05 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 3 / 25 - Minibatch ==



🏃 View run eval_minibatch_0 at: http://localhost:8080/#/experiments/322957703646966226/runs/e6c5ff6acf4a47ec97b4c1417508b1fb
🧪 View experiment at: http://localhost:8080/#/experiments/322957703646966226
Average Metric: 19.00 / 35 (54.3%): 100%|██████████| 35/35 [00:19<00:00,  1.81it/s]

2025/06/14 11:30:24 INFO dspy.evaluate.evaluate: Average Metric: 19 / 35 (54.3%)



🏃 View run eval_minibatch_1 at: http://localhost:8080/#/experiments/322957703646966226/runs/45aa4ec123c6461284da8e6b125c5b76
🧪 View experiment at: http://localhost:8080/#/experiments/322957703646966226


2025/06/14 11:30:24 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 54.29 on minibatch of size 35 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 5', 'Predictor 1: Instruction 2', 'Predictor 1: Few-Shot Set 2'].
2025/06/14 11:30:24 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [25.71, 54.29]
2025/06/14 11:30:24 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [29.0]
2025/06/14 11:30:24 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 29.0


2025/06/14 11:30:24 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 4 / 25 - Minibatch ==


Average Metric: 14.00 / 35 (40.0%): 100%|██████████| 35/35 [00:24<00:00,  1.41it/s]

2025/06/14 11:30:49 INFO dspy.evaluate.evaluate: Average Metric: 14 / 35 (40.0%)
2025/06/14 11:30:49 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 40.0 on minibatch of size 35 with parameters ['Predictor 0: Instruction 0', 'Predictor 0: Few-Shot Set 5', 'Predictor 1: Instruction 2', 'Predictor 1: Few-Shot Set 0'].
2025/06/14 11:30:49 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [25.71, 54.29, 40.0]
2025/06/14 11:30:49 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [29.0]
2025/06/14 11:30:49 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 29.0


2025/06/14 11:30:49 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 5 / 25 - Minibatch ==



🏃 View run eval_minibatch_2 at: http://localhost:8080/#/experiments/322957703646966226/runs/ac92c293161444008d1d30a7e590465d
🧪 View experiment at: http://localhost:8080/#/experiments/322957703646966226
Average Metric: 18.00 / 35 (51.4%): 100%|██████████| 35/35 [00:23<00:00,  1.49it/s]

2025/06/14 11:31:13 INFO dspy.evaluate.evaluate: Average Metric: 18 / 35 (51.4%)
2025/06/14 11:31:13 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 51.43 on minibatch of size 35 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 5', 'Predictor 1: Instruction 1', 'Predictor 1: Few-Shot Set 4'].
2025/06/14 11:31:13 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [25.71, 54.29, 40.0, 51.43]
2025/06/14 11:31:13 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [29.0]
2025/06/14 11:31:13 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 29.0


2025/06/14 11:31:13 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 6 / 25 - Minibatch ==



🏃 View run eval_minibatch_3 at: http://localhost:8080/#/experiments/322957703646966226/runs/b6be249838e04f0085ca7b3a30d34027
🧪 View experiment at: http://localhost:8080/#/experiments/322957703646966226
Average Metric: 18.00 / 35 (51.4%): 100%|██████████| 35/35 [00:23<00:00,  1.51it/s]

2025/06/14 11:31:36 INFO dspy.evaluate.evaluate: Average Metric: 18 / 35 (51.4%)
2025/06/14 11:31:36 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 51.43 on minibatch of size 35 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 5', 'Predictor 1: Instruction 2', 'Predictor 1: Few-Shot Set 2'].
2025/06/14 11:31:36 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [25.71, 54.29, 40.0, 51.43, 51.43]
2025/06/14 11:31:36 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [29.0]
2025/06/14 11:31:36 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 29.0


2025/06/14 11:31:36 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 7 / 25 - Full Evaluation =====
2025/06/14 11:31:36 INFO dspy.teleprompt.mipro_optimizer_v2: Doing full eval on next top averaging program (Avg Score: 54.29) from minibatch trials...



🏃 View run eval_minibatch_4 at: http://localhost:8080/#/experiments/322957703646966226/runs/6ac9197451a54ea296fb5a3e74418226
🧪 View experiment at: http://localhost:8080/#/experiments/322957703646966226
Average Metric: 48.00 / 100 (48.0%): 100%|██████████| 100/100 [00:30<00:00,  3.24it/s]

2025/06/14 11:32:07 INFO dspy.evaluate.evaluate: Average Metric: 48 / 100 (48.0%)
2025/06/14 11:32:07 INFO dspy.teleprompt.mipro_optimizer_v2: [92mNew best full eval score![0m Score: 48.0
2025/06/14 11:32:07 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [29.0, 48.0]
2025/06/14 11:32:07 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 48.0
2025/06/14 11:32:07 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/06/14 11:32:07 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 8 / 25 - Minibatch ==



🏃 View run eval_full_1 at: http://localhost:8080/#/experiments/322957703646966226/runs/39a5da873c8b47729001662a7ff8c01e
🧪 View experiment at: http://localhost:8080/#/experiments/322957703646966226
Average Metric: 14.00 / 35 (40.0%): 100%|██████████| 35/35 [00:22<00:00,  1.53it/s]

2025/06/14 11:32:30 INFO dspy.evaluate.evaluate: Average Metric: 14 / 35 (40.0%)
2025/06/14 11:32:30 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 40.0 on minibatch of size 35 with parameters ['Predictor 0: Instruction 0', 'Predictor 0: Few-Shot Set 5', 'Predictor 1: Instruction 0', 'Predictor 1: Few-Shot Set 0'].
2025/06/14 11:32:30 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [25.71, 54.29, 40.0, 51.43, 51.43, 40.0]
2025/06/14 11:32:30 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [29.0, 48.0]
2025/06/14 11:32:30 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 48.0


2025/06/14 11:32:30 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 9 / 25 - Minibatch ==



🏃 View run eval_minibatch_5 at: http://localhost:8080/#/experiments/322957703646966226/runs/8c81a1062cf243f7a11637edbb5a690e
🧪 View experiment at: http://localhost:8080/#/experiments/322957703646966226
Average Metric: 20.00 / 35 (57.1%): 100%|██████████| 35/35 [00:23<00:00,  1.52it/s]

2025/06/14 11:32:53 INFO dspy.evaluate.evaluate: Average Metric: 20 / 35 (57.1%)
2025/06/14 11:32:53 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 57.14 on minibatch of size 35 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 2', 'Predictor 1: Instruction 2', 'Predictor 1: Few-Shot Set 1'].
2025/06/14 11:32:53 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [25.71, 54.29, 40.0, 51.43, 51.43, 40.0, 57.14]
2025/06/14 11:32:53 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [29.0, 48.0]
2025/06/14 11:32:53 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 48.0


2025/06/14 11:32:53 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 10 / 25 - Minibatch ==



🏃 View run eval_minibatch_6 at: http://localhost:8080/#/experiments/322957703646966226/runs/14c0738062184ca2b484b75adf4f7499
🧪 View experiment at: http://localhost:8080/#/experiments/322957703646966226
Average Metric: 9.00 / 35 (25.7%): 100%|██████████| 35/35 [00:30<00:00,  1.17it/s]

2025/06/14 11:33:23 INFO dspy.evaluate.evaluate: Average Metric: 9 / 35 (25.7%)
2025/06/14 11:33:23 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 25.71 on minibatch of size 35 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 0', 'Predictor 1: Instruction 0', 'Predictor 1: Few-Shot Set 0'].
2025/06/14 11:33:23 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [25.71, 54.29, 40.0, 51.43, 51.43, 40.0, 57.14, 25.71]
2025/06/14 11:33:23 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [29.0, 48.0]
2025/06/14 11:33:23 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 48.0


2025/06/14 11:33:23 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 11 / 25 - Minibatch ==



🏃 View run eval_minibatch_7 at: http://localhost:8080/#/experiments/322957703646966226/runs/f1af729334be4e249c12ade04619eaba
🧪 View experiment at: http://localhost:8080/#/experiments/322957703646966226
Average Metric: 14.00 / 35 (40.0%): 100%|██████████| 35/35 [00:20<00:00,  1.73it/s]

2025/06/14 11:33:44 INFO dspy.evaluate.evaluate: Average Metric: 14 / 35 (40.0%)
2025/06/14 11:33:44 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 40.0 on minibatch of size 35 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 2', 'Predictor 1: Instruction 1', 'Predictor 1: Few-Shot Set 1'].
2025/06/14 11:33:44 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [25.71, 54.29, 40.0, 51.43, 51.43, 40.0, 57.14, 25.71, 40.0]
2025/06/14 11:33:44 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [29.0, 48.0]
2025/06/14 11:33:44 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 48.0


2025/06/14 11:33:44 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 12 / 25 - Minibatch ==



🏃 View run eval_minibatch_8 at: http://localhost:8080/#/experiments/322957703646966226/runs/3930bb67eab748f5a6d04d485710661c
🧪 View experiment at: http://localhost:8080/#/experiments/322957703646966226
Average Metric: 20.00 / 35 (57.1%): 100%|██████████| 35/35 [00:24<00:00,  1.45it/s]

2025/06/14 11:34:08 INFO dspy.evaluate.evaluate: Average Metric: 20 / 35 (57.1%)
2025/06/14 11:34:08 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 57.14 on minibatch of size 35 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 4', 'Predictor 1: Instruction 2', 'Predictor 1: Few-Shot Set 1'].
2025/06/14 11:34:08 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [25.71, 54.29, 40.0, 51.43, 51.43, 40.0, 57.14, 25.71, 40.0, 57.14]
2025/06/14 11:34:08 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [29.0, 48.0]
2025/06/14 11:34:08 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 48.0


2025/06/14 11:34:08 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 13 / 25 - Full Evaluation =====
2025/06/14 11:34:08 INFO dspy.teleprompt.mipro_optimizer_v2: Doing full eval on next top averaging program (Avg Score: 57.14) from minibatch trials...



🏃 View run eval_minibatch_9 at: http://localhost:8080/#/experiments/322957703646966226/runs/b4682582270b49a9908b8648527d7631
🧪 View experiment at: http://localhost:8080/#/experiments/322957703646966226
Average Metric: 50.00 / 100 (50.0%): 100%|██████████| 100/100 [00:29<00:00,  3.39it/s]

2025/06/14 11:34:37 INFO dspy.evaluate.evaluate: Average Metric: 50 / 100 (50.0%)
2025/06/14 11:34:38 INFO dspy.teleprompt.mipro_optimizer_v2: [92mNew best full eval score![0m Score: 50.0
2025/06/14 11:34:38 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [29.0, 48.0, 50.0]
2025/06/14 11:34:38 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 50.0
2025/06/14 11:34:38 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/06/14 11:34:38 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 14 / 25 - Minibatch ==



🏃 View run eval_full_2 at: http://localhost:8080/#/experiments/322957703646966226/runs/b0c75ec81bdf4049acbe8a360b616906
🧪 View experiment at: http://localhost:8080/#/experiments/322957703646966226
Average Metric: 17.00 / 35 (48.6%): 100%|██████████| 35/35 [00:20<00:00,  1.74it/s]

2025/06/14 11:34:58 INFO dspy.evaluate.evaluate: Average Metric: 17 / 35 (48.6%)
2025/06/14 11:34:58 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 48.57 on minibatch of size 35 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 4', 'Predictor 1: Instruction 0', 'Predictor 1: Few-Shot Set 1'].
2025/06/14 11:34:58 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [25.71, 54.29, 40.0, 51.43, 51.43, 40.0, 57.14, 25.71, 40.0, 57.14, 48.57]
2025/06/14 11:34:58 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [29.0, 48.0, 50.0]
2025/06/14 11:34:58 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 50.0


2025/06/14 11:34:58 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 15 / 25 - Minibatch ==



🏃 View run eval_minibatch_10 at: http://localhost:8080/#/experiments/322957703646966226/runs/ff7c287e7e1a4f4ab697b8c9388f56ac
🧪 View experiment at: http://localhost:8080/#/experiments/322957703646966226
Average Metric: 16.00 / 35 (45.7%): 100%|██████████| 35/35 [00:14<00:00,  2.44it/s]

2025/06/14 11:35:12 INFO dspy.evaluate.evaluate: Average Metric: 16 / 35 (45.7%)
2025/06/14 11:35:12 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 45.71 on minibatch of size 35 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 4', 'Predictor 1: Instruction 2', 'Predictor 1: Few-Shot Set 1'].
2025/06/14 11:35:12 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [25.71, 54.29, 40.0, 51.43, 51.43, 40.0, 57.14, 25.71, 40.0, 57.14, 48.57, 45.71]
2025/06/14 11:35:12 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [29.0, 48.0, 50.0]
2025/06/14 11:35:12 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 50.0


2025/06/14 11:35:12 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 16 / 25 - Minibatch ==



🏃 View run eval_minibatch_11 at: http://localhost:8080/#/experiments/322957703646966226/runs/01fad881102a480ea62a54bcaff8e4c0
🧪 View experiment at: http://localhost:8080/#/experiments/322957703646966226
Average Metric: 19.00 / 35 (54.3%): 100%|██████████| 35/35 [00:05<00:00,  6.57it/s]

2025/06/14 11:35:18 INFO dspy.evaluate.evaluate: Average Metric: 19 / 35 (54.3%)
2025/06/14 11:35:18 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 54.29 on minibatch of size 35 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 2', 'Predictor 1: Instruction 2', 'Predictor 1: Few-Shot Set 4'].
2025/06/14 11:35:18 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [25.71, 54.29, 40.0, 51.43, 51.43, 40.0, 57.14, 25.71, 40.0, 57.14, 48.57, 45.71, 54.29]
2025/06/14 11:35:18 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [29.0, 48.0, 50.0]
2025/06/14 11:35:18 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 50.0


2025/06/14 11:35:18 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 17 / 25 - Minibatch ==



🏃 View run eval_minibatch_12 at: http://localhost:8080/#/experiments/322957703646966226/runs/37f4ec5128de43a4a76453742aa9e565
🧪 View experiment at: http://localhost:8080/#/experiments/322957703646966226
Average Metric: 13.00 / 35 (37.1%): 100%|██████████| 35/35 [00:18<00:00,  1.87it/s]

2025/06/14 11:35:36 INFO dspy.evaluate.evaluate: Average Metric: 13 / 35 (37.1%)
2025/06/14 11:35:37 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 37.14 on minibatch of size 35 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 2', 'Predictor 1: Instruction 2', 'Predictor 1: Few-Shot Set 1'].
2025/06/14 11:35:37 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [25.71, 54.29, 40.0, 51.43, 51.43, 40.0, 57.14, 25.71, 40.0, 57.14, 48.57, 45.71, 54.29, 37.14]
2025/06/14 11:35:37 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [29.0, 48.0, 50.0]
2025/06/14 11:35:37 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 50.0


2025/06/14 11:35:37 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 18 / 25 - Minibatch ==



🏃 View run eval_minibatch_13 at: http://localhost:8080/#/experiments/322957703646966226/runs/99362ef736bb42f6baeae2762de8fac0
🧪 View experiment at: http://localhost:8080/#/experiments/322957703646966226
Average Metric: 18.00 / 35 (51.4%): 100%|██████████| 35/35 [00:23<00:00,  1.50it/s]

2025/06/14 11:36:00 INFO dspy.evaluate.evaluate: Average Metric: 18 / 35 (51.4%)
2025/06/14 11:36:00 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 51.43 on minibatch of size 35 with parameters ['Predictor 0: Instruction 0', 'Predictor 0: Few-Shot Set 4', 'Predictor 1: Instruction 1', 'Predictor 1: Few-Shot Set 5'].
2025/06/14 11:36:00 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [25.71, 54.29, 40.0, 51.43, 51.43, 40.0, 57.14, 25.71, 40.0, 57.14, 48.57, 45.71, 54.29, 37.14, 51.43]
2025/06/14 11:36:00 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [29.0, 48.0, 50.0]
2025/06/14 11:36:00 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 50.0


2025/06/14 11:36:00 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 19 / 25 - Full Evaluation =====
2025/06/14 11:36:00 INFO dspy.teleprompt.mipro_optimizer_v2: Doing full eval on next top averaging program (Avg Score: 54.29) from minibatch trials...



🏃 View run eval_minibatch_14 at: http://localhost:8080/#/experiments/322957703646966226/runs/736f9b9d7cbc4676ac931e46553af92d
🧪 View experiment at: http://localhost:8080/#/experiments/322957703646966226
Average Metric: 51.00 / 100 (51.0%): 100%|██████████| 100/100 [00:08<00:00, 11.36it/s]

2025/06/14 11:36:09 INFO dspy.evaluate.evaluate: Average Metric: 51 / 100 (51.0%)
2025/06/14 11:36:09 INFO dspy.teleprompt.mipro_optimizer_v2: [92mNew best full eval score![0m Score: 51.0
2025/06/14 11:36:09 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [29.0, 48.0, 50.0, 51.0]
2025/06/14 11:36:09 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 51.0
2025/06/14 11:36:09 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/06/14 11:36:09 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 20 / 25 - Minibatch ==



🏃 View run eval_full_3 at: http://localhost:8080/#/experiments/322957703646966226/runs/a1bc54ff52c04493b5618cf19c65eff6
🧪 View experiment at: http://localhost:8080/#/experiments/322957703646966226
Average Metric: 21.00 / 35 (60.0%): 100%|██████████| 35/35 [00:06<00:00,  5.79it/s]

2025/06/14 11:36:15 INFO dspy.evaluate.evaluate: Average Metric: 21 / 35 (60.0%)



🏃 View run eval_minibatch_15 at: http://localhost:8080/#/experiments/322957703646966226/runs/117b08ae41404b22ad0b308355a11188
🧪 View experiment at: http://localhost:8080/#/experiments/322957703646966226


2025/06/14 11:36:15 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 60.0 on minibatch of size 35 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 2', 'Predictor 1: Instruction 0', 'Predictor 1: Few-Shot Set 3'].
2025/06/14 11:36:15 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [25.71, 54.29, 40.0, 51.43, 51.43, 40.0, 57.14, 25.71, 40.0, 57.14, 48.57, 45.71, 54.29, 37.14, 51.43, 60.0]
2025/06/14 11:36:15 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [29.0, 48.0, 50.0, 51.0]
2025/06/14 11:36:15 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 51.0


2025/06/14 11:36:15 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 21 / 25 - Minibatch ==


Average Metric: 23.00 / 35 (65.7%): 100%|██████████| 35/35 [00:03<00:00,  9.75it/s]

2025/06/14 11:36:19 INFO dspy.evaluate.evaluate: Average Metric: 23 / 35 (65.7%)
2025/06/14 11:36:19 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 65.71 on minibatch of size 35 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 2', 'Predictor 1: Instruction 0', 'Predictor 1: Few-Shot Set 3'].
2025/06/14 11:36:19 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [25.71, 54.29, 40.0, 51.43, 51.43, 40.0, 57.14, 25.71, 40.0, 57.14, 48.57, 45.71, 54.29, 37.14, 51.43, 60.0, 65.71]
2025/06/14 11:36:19 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [29.0, 48.0, 50.0, 51.0]
2025/06/14 11:36:19 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 51.0


2025/06/14 11:36:19 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 22 / 25 - Minibatch ==



🏃 View run eval_minibatch_16 at: http://localhost:8080/#/experiments/322957703646966226/runs/b67ea09187ae477fa7566218819638bd
🧪 View experiment at: http://localhost:8080/#/experiments/322957703646966226
Average Metric: 16.00 / 35 (45.7%): 100%|██████████| 35/35 [00:19<00:00,  1.77it/s]

2025/06/14 11:36:39 INFO dspy.evaluate.evaluate: Average Metric: 16 / 35 (45.7%)
2025/06/14 11:36:39 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 45.71 on minibatch of size 35 with parameters ['Predictor 0: Instruction 0', 'Predictor 0: Few-Shot Set 2', 'Predictor 1: Instruction 0', 'Predictor 1: Few-Shot Set 3'].
2025/06/14 11:36:39 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [25.71, 54.29, 40.0, 51.43, 51.43, 40.0, 57.14, 25.71, 40.0, 57.14, 48.57, 45.71, 54.29, 37.14, 51.43, 60.0, 65.71, 45.71]
2025/06/14 11:36:39 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [29.0, 48.0, 50.0, 51.0]
2025/06/14 11:36:39 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 51.0


2025/06/14 11:36:39 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 23 / 25 - Minibatch ==



🏃 View run eval_minibatch_17 at: http://localhost:8080/#/experiments/322957703646966226/runs/100f033576ae469584a414e91cf7589c
🧪 View experiment at: http://localhost:8080/#/experiments/322957703646966226
Average Metric: 18.00 / 35 (51.4%): 100%|██████████| 35/35 [00:05<00:00,  6.35it/s]

2025/06/14 11:36:44 INFO dspy.evaluate.evaluate: Average Metric: 18 / 35 (51.4%)



🏃 View run eval_minibatch_18 at: http://localhost:8080/#/experiments/322957703646966226/runs/ac1cc5836a0c4462848c966c02861a3a
🧪 View experiment at: http://localhost:8080/#/experiments/322957703646966226


2025/06/14 11:36:44 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 51.43 on minibatch of size 35 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 2', 'Predictor 1: Instruction 0', 'Predictor 1: Few-Shot Set 5'].
2025/06/14 11:36:44 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [25.71, 54.29, 40.0, 51.43, 51.43, 40.0, 57.14, 25.71, 40.0, 57.14, 48.57, 45.71, 54.29, 37.14, 51.43, 60.0, 65.71, 45.71, 51.43]
2025/06/14 11:36:44 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [29.0, 48.0, 50.0, 51.0]
2025/06/14 11:36:44 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 51.0


2025/06/14 11:36:44 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 24 / 25 - Minibatch ==


Average Metric: 17.00 / 35 (48.6%): 100%|██████████| 35/35 [00:21<00:00,  1.60it/s]

2025/06/14 11:37:06 INFO dspy.evaluate.evaluate: Average Metric: 17 / 35 (48.6%)
2025/06/14 11:37:06 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 48.57 on minibatch of size 35 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 1', 'Predictor 1: Instruction 0', 'Predictor 1: Few-Shot Set 3'].
2025/06/14 11:37:06 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [25.71, 54.29, 40.0, 51.43, 51.43, 40.0, 57.14, 25.71, 40.0, 57.14, 48.57, 45.71, 54.29, 37.14, 51.43, 60.0, 65.71, 45.71, 51.43, 48.57]
2025/06/14 11:37:06 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [29.0, 48.0, 50.0, 51.0]
2025/06/14 11:37:06 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 51.0


2025/06/14 11:37:06 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 25 / 25 - Full Evaluation =====
2025/06/14 11:37:06 INFO dspy.teleprompt.mipro_optimizer_v2: Doing full eval on next top averaging program (Avg Score: 62.855) from minibatch trials..


🏃 View run eval_minibatch_19 at: http://localhost:8080/#/experiments/322957703646966226/runs/50a39ca3340d4b519bdd548dbfaf26f0
🧪 View experiment at: http://localhost:8080/#/experiments/322957703646966226
Average Metric: 56.00 / 100 (56.0%): 100%|██████████| 100/100 [00:06<00:00, 14.43it/s]

2025/06/14 11:37:13 INFO dspy.evaluate.evaluate: Average Metric: 56 / 100 (56.0%)
2025/06/14 11:37:13 INFO dspy.teleprompt.mipro_optimizer_v2: [92mNew best full eval score![0m Score: 56.0
2025/06/14 11:37:13 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [29.0, 48.0, 50.0, 51.0, 56.0]
2025/06/14 11:37:13 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 56.0
2025/06/14 11:37:13 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/06/14 11:37:13 INFO dspy.teleprompt.mipro_optimizer_v2: Returning best identified program with score 56.0!



🏃 View run eval_full_4 at: http://localhost:8080/#/experiments/322957703646966226/runs/7f6216681b4a41a3a76b7f5ea6930fa4
🧪 View experiment at: http://localhost:8080/#/experiments/322957703646966226


Downloading artifacts: 100%|██████████| 1/1 [00:00<00:00, 235.12it/s]

🏃 View run resilient-dove-503 at: http://localhost:8080/#/experiments/322957703646966226/runs/a26508f5320d4386af80a5a922813397
🧪 View experiment at: http://localhost:8080/#/experiments/322957703646966226





In [31]:
optimized_react.react.signature

StringSignature(question, trajectory -> next_thought, next_tool_name, next_tool_args
    instructions='Given the `question` about a specific topic, your task is to iteratively determine the necessary information needed to formulate a comprehensive `answer`. You will maintain a `trajectory` of your reasoning process and actions taken. For each step, produce the following outputs: \n\n1. `next_thought`: Reflect on the current situation and reason about the next steps you need to take to gather information.\n2. `next_tool_name`: Decide which tool to use next; you can choose to either search Wikipedia for relevant information or finish the task if you believe you have enough data to provide an answer.\n3. `next_tool_args`: Provide the arguments for the selected tool in JSON format, ensuring that the query for the Wikipedia search is clearly defined.\n\nUse the information from your `trajectory` to make informed decisions at each step, ensuring that you gather all necessary information befo

In [32]:
optimized_react.react.demos

[Example({'augmented': True, 'question': 'That Darn Cat! and Never a Dull Moment were both produced by what studio?', 'trajectory': '', 'next_thought': 'I need to find out which studio produced both "That Darn Cat!" and "Never a Dull Moment." This information is likely available on Wikipedia, so I will search for it there.', 'next_tool_name': 'search_wikipedia', 'next_tool_args': {'query': 'That Darn Cat! and Never a Dull Moment production studio'}}) (input_keys=None),
 Example({'augmented': True, 'question': 'Which of Zhucheng and Pingquan has more people?', 'trajectory': '[[ ## thought_0 ## ]]\nI need to find the population data for both Zhucheng and Pingquan to determine which one has more people. I will search for the population statistics of both locations.\n\n[[ ## tool_name_0 ## ]]\nsearch_wikipedia\n\n[[ ## tool_args_0 ## ]]\n{"query": "Zhucheng population"}\n\n[[ ## observation_0 ## ]]\n[1] «Zhucheng | Zhucheng () is a county-level city in the southeast of Shandong province, P

In [33]:
evaluator = dspy.Evaluate(
    metric=dspy.evaluate.answer_exact_match,
    devset=valset,
    display_table=True,
    display_progress=True,
    num_threads=24,
)

In [34]:
original_score = evaluator(react)
print(f"Original score: {original_score}")

Average Metric: 29.00 / 100 (29.0%): 100%|██████████| 100/100 [00:02<00:00, 48.34it/s]

2025/06/14 11:37:46 INFO dspy.evaluate.evaluate: Average Metric: 29 / 100 (29.0%)





Unnamed: 0,question,example_answer,trajectory,reasoning,pred_answer,answer_exact_match
0,"What movie did ""the king of cool"" play in with Bud Ekins as his st...","""The Great Escape""","{'thought_0': 'I need to find out which movie ""the king of cool"" s...","The question refers to a movie featuring ""the king of cool,"" which...",The Great Escape,✔️ [True]
1,whos family had their own reality tv show. Robert Kardashian or Ma...,their family reality television series,"{'thought_0': 'I need to determine which individual, Robert Kardas...",Robert Kardashian's family is well-known for their reality TV show...,Robert Kardashian's family had their own reality TV show.,
2,Which star in Shadows in Paradise is a Russian ballerina?,Sofya Skya,"{'thought_0': 'I need to find out which star in the film ""Shadows ...",After conducting multiple searches regarding the cast of the film ...,"There is no information available about a Russian ballerina in ""Sh...",
3,What was the meaning of the name of the man who appointed Amashsai?,comforter,{'thought_0': 'I need to gather information about the man who appo...,"The man who appointed Amashsai was Nehemiah, a significant figure ...","The meaning of the name of the man who appointed Amashsai, Nehemia...",
4,"In addition to the Austrian passport, what is needed to gain acces...",national identity card,{'thought_0': 'I need to find out what additional requirements or ...,To gain access to 173 countries and territories with an Austrian p...,"In addition to the Austrian passport, Austrian citizens may need t...",
...,...,...,...,...,...,...
95,"What date did the American actress and singer-songwriter, known fo...","April 19, 1994",{'thought_0': 'I need to find out the name of the American actress...,The American actress and singer-songwriter known for her role as P...,2007,
96,What animated creatures were the title characters of the film whic...,seals,{'thought_0': 'The question refers to an animated film based on a ...,"The animated film ""Gnomeo & Juliet"" is based on William Shakespear...",The title characters are garden gnomes named Gnomeo and Juliet.,
97,The 1925 Saint Mary's Gaels football team represented what private...,Saint Mary's College of California,"{'thought_0': ""I need to find out which private, coeducational col...",The 1925 Saint Mary's Gaels football team represented Saint Mary's...,Saint Mary's College of California,✔️ [True]
98,Were Dorothy Arzner and Richard Wallace both French film directors?,no,{'thought_0': 'I need to verify the nationalities and contribution...,"Dorothy Arzner was an American film director, known for her signif...","No, both Dorothy Arzner and Richard Wallace were American film dir...",


🏃 View run eval at: http://localhost:8080/#/experiments/322957703646966226/runs/2ec04c4b98814cfe93bbc85ce783e06f
🧪 View experiment at: http://localhost:8080/#/experiments/322957703646966226
Original score: 29.0


In [35]:
optimized_score = evaluator(optimized_react)
print(f"Optimized score: {optimized_score}")

Average Metric: 56.00 / 100 (56.0%): 100%|██████████| 100/100 [00:03<00:00, 33.23it/s]

2025/06/14 11:38:00 INFO dspy.evaluate.evaluate: Average Metric: 56 / 100 (56.0%)





Unnamed: 0,question,example_answer,trajectory,reasoning,pred_answer,answer_exact_match
0,"What movie did ""the king of cool"" play in with Bud Ekins as his st...","""The Great Escape""","{'thought_0': 'I need to find out which movie ""the king of cool"" s...","Steve McQueen, often referred to as ""the king of cool,"" starred in...",The Great Escape,✔️ [True]
1,whos family had their own reality tv show. Robert Kardashian or Ma...,their family reality television series,{'thought_0': 'I need to find out which family had their own reali...,I found that Robert Kardashian's family had their own reality TV s...,Robert Kardashian,
2,Which star in Shadows in Paradise is a Russian ballerina?,Sofya Skya,"{'thought_0': 'I need to find out which star in ""Shadows in Paradi...",The searches conducted did not yield specific information about a ...,"No specific information found regarding a Russian ballerina in ""Sh...",
3,What was the meaning of the name of the man who appointed Amashsai?,comforter,"{'thought_0': ""I need to find out who appointed Amashsai and the m...","Amashsai was appointed by Nehemiah. The name ""Nehemiah"" means ""Yah...","""Yahweh comforts""",
4,"In addition to the Austrian passport, what is needed to gain acces...",national identity card,{'thought_0': 'I need to find out what additional requirements are...,The search results indicate that Austrian citizens have visa-free ...,A valid passport,
...,...,...,...,...,...,...
95,"What date did the American actress and singer-songwriter, known fo...","April 19, 1994",{'thought_0': 'I need to find out the release date of the first al...,"Katey Sagal, known for her role as Peggy Bundy, released her first...","April 19, 1994",✔️ [True]
96,What animated creatures were the title characters of the film whic...,seals,"{'thought_0': ""I need to identify the animated creatures that were...","The animated film ""Gnomeo & Juliet"" is based on William Shakespear...",Gnomeo and Juliet,
97,The 1925 Saint Mary's Gaels football team represented what private...,Saint Mary's College of California,"{'thought_0': ""I need to find out which private, coeducational col...",The 1925 Saint Mary's Gaels football team represented Saint Mary's...,Saint Mary's College of California,✔️ [True]
98,Were Dorothy Arzner and Richard Wallace both French film directors?,no,{'thought_0': 'I need to determine the nationalities of both Dorot...,I found that Dorothy Arzner was an American film director. Additio...,No,✔️ [True]


🏃 View run eval at: http://localhost:8080/#/experiments/322957703646966226/runs/bc98b2736965487280cb20d21dc227a3
🧪 View experiment at: http://localhost:8080/#/experiments/322957703646966226
Optimized score: 56.0
