# Exercise 1

Use DSPy (or a simplified version if DSPy isn’t accessible) to optimize a multi-step QA pipeline. For example, pipeline: (1) retrieve relevant text from a small corpus, (2) ask LLM to answer question given retrieved text. Define the metric as accuracy of answer. Let the system tune the retrieval prompt and answer prompt. Observe what changes it makes (e.g. does it add “Let’s think step by step” automatically?). Report the before vs after performance.

## Solution

In [None]:
import os
import random
import re

import numpy as np
import dspy
from openai import OpenAI

# Cheaper embeddings are fine for synthetic corpora; switch to -large if you want max recall.
EMBED_MODEL = "text-embedding-3-small"
# DSPy v3.x uses LiteLLM-style names: "provider/model"
LM_MODEL = "openai/gpt-4.1-nano"

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
if not OPENAI_API_KEY:
    raise RuntimeError(
        "Missing OPENAI_API_KEY. In a notebook, set it with `%env OPENAI_API_KEY=...` "
        "or export it in your shell before starting Jupyter."
    )

client = OpenAI(api_key=OPENAI_API_KEY)

# DSPy v3.x
lm = dspy.LM(model=LM_MODEL, max_tokens=128)

# Configure DSPy to use the LLM
if hasattr(dspy, "settings"):
    dspy.settings.configure(lm=lm)

random.seed(42)


In [34]:
TOP_K = 3
NUM_DOCS = 60
QA_PER_DOC = 2
TRAIN_FRAC = 0.8

_rng = random.Random(42)

_adjs = ["Aurora","Nimbus","Orion","Kestrel","Zephyr","Raven","Maple","Helix","Osprey","Forge","Slate","Cedar","Lumen","Delta","Vesta","Atlas","Nova","Redstone","Aster","Northbridge"]
_nouns = ["Project","Protocol","Battery","Clinic","Drone","Study","Sensor","API","Route","Compiler","Festival","Satellite","Plant","Library"]
_people = ["Mara Ortiz","Jun Park","Amina Khan","Elena Petrov","Sam Rivera","Noah Chen","Ivy Patel","Luca Rossi","Fatima Ali","Owen Brooks","Hana Suzuki","Diego Silva"]
_cities = ["Portland","Austin","Berlin","Toronto","Lisbon","Oslo","Seoul","Kyoto","Nairobi","Lima","Dublin","Prague"]


def make_name(kind):
    a=_rng.choice(_adjs)
    b=_rng.choice(_adjs)
    # encourage shared tokens to make lexical matching harder
    if _rng.random() < 0.35:
        b = a
    return f"{a} {kind} {b}" if kind in {"Project","Study","Route"} else f"{a} {b} {kind}"


def generate_doc(doc_id):
    kind = _rng.choice(_nouns)

    if kind == "Project":
        name = make_name("Project")
        lead = _rng.choice(_people)
        budget = round(_rng.uniform(1.5, 9.5), 1)
        hq = _rng.choice(_cities)
        year = _rng.randint(2017, 2024)
        text = (
            f"{name}'s lead engineer is {lead}. "
            f"The project budget was {budget} million dollars. "
            f"The project started in {year} and is headquartered in {hq}."
        )
        qas = [
            (f"Who is the lead engineer for {name}?", lead),
            (f"What was the budget for {name}?", f"{budget} million dollars"),
            (f"Where is {name} headquartered?", hq),
        ]

    elif kind == "Battery":
        name = make_name("Battery")
        cap = round(_rng.uniform(2.0, 9.0), 1)
        mins = _rng.randint(12, 38)
        pct = _rng.choice([70, 75, 80, 85])
        text = (
            f"The {name} has a capacity of {cap} kWh and charges to {pct} percent in {mins} minutes."
        )
        qas = [
            (f"What is the capacity of the {name}?", f"{cap} kWh"),
            (f"How long does the {name} take to reach {pct} percent?", f"{mins} minutes"),
            (f"To what percent does the {name} charge in {mins} minutes?", f"{pct} percent"),
        ]

    elif kind == "Clinic":
        name = make_name("Clinic")
        system = _rng.choice(["Atlas","Aster","Nova","Vesta","Redstone"])
        city = _rng.choice(_cities)
        year = _rng.randint(2016, 2024)
        text = f"The {name} runs on the {system} scheduling system. The clinic opened in {city} in {year}."
        qas = [
            (f"Which scheduling system does the {name} use?", system),
            (f"In which city did the {name} open?", city),
            (f"What year did the {name} open?", str(year)),
        ]

    elif kind == "Drone":
        name = make_name("Drone")
        speed = _rng.choice([120, 130, 140, 150, 160])
        end = _rng.choice([42, 50, 55, 60, 68])
        text = f"The {name}'s top speed is {speed} kilometers per hour. Its endurance is {end} minutes."
        qas = [
            (f"What is the top speed of the {name}?", f"{speed} kilometers per hour"),
            (f"What is the endurance of the {name}?", f"{end} minutes"),
            (f"How long is the endurance of the {name}?", f"{end} minutes"),
        ]

    elif kind == "Protocol":
        name = make_name("Protocol")
        key = _rng.choice([128, 192, 256, 384])
        year = _rng.randint(2015, 2023)
        text = f"The {name} encrypts data using a {key}-bit key. It was ratified in {year}."
        qas = [
            (f"What key size does the {name} use for encryption?", f"{key}-bit"),
            (f"In what year was the {name} ratified?", str(year)),
            (f"Which year was the {name} ratified?", str(year)),
        ]

    elif kind == "Study":
        name = make_name("Study")
        sessions = _rng.choice([8, 10, 12, 14, 16])
        improve = _rng.choice([12, 15, 18, 21, 24])
        text = (
            f"In the {name}, participants completed {sessions} sessions. "
            f"The primary outcome improved by {improve} percent."
        )
        qas = [
            (f"How many sessions were completed in the {name}?", str(sessions)),
            (f"By what percent did the primary outcome improve in the {name}?", f"{improve} percent"),
            (f"What was the percent improvement in the {name}?", f"{improve} percent"),
        ]

    elif kind == "Sensor":
        name = make_name("Sensor")
        material = _rng.choice(["sapphire","quartz","ceramic","glass"])
        diam = _rng.choice([7, 8, 9, 10, 11])
        text = f"The {name} uses a {material} lens. The lens diameter is {diam} millimeters."
        qas = [
            (f"What material is the {name}'s lens made of?", material),
            (f"What is the diameter of the {name}'s lens?", f"{diam} millimeters"),
            (f"How wide is the {name}'s lens diameter?", f"{diam} millimeters"),
        ]

    elif kind == "API":
        name = make_name("API")
        limit = _rng.choice([60, 90, 120, 150, 200])
        outs = _rng.choice([
            "JSON and CSV",
            "JSON and XML",
            "CSV and Parquet",
            "JSON and YAML",
        ])
        text = f"The {name} has a default rate limit of {limit} requests per minute. It supports {outs} outputs."
        qas = [
            (f"What is the default rate limit of the {name}?", f"{limit} requests per minute"),
            (f"Which outputs does the {name} support?", outs),
            (f"What outputs does the {name} support?", outs),
        ]

    elif kind == "Route":
        name = make_name("Route")
        km = _rng.choice([420, 480, 540, 610, 690])
        day = _rng.choice(["Mondays","Tuesdays","Wednesdays","Thursdays","Fridays"])
        text = f"Cargo {name} covers {km} kilometers and departs on {day}."
        qas = [
            (f"How many kilometers does {name} cover?", f"{km} kilometers"),
            (f"On what day does {name} depart?", day),
            (f"Which day does {name} depart?", day),
        ]

    elif kind == "Compiler":
        name = make_name("Compiler")
        vm = _rng.choice(["Vesta VM","Nova VM","Atlas VM","Redstone VM"])
        ver = f"{_rng.randint(1,4)}.{_rng.randint(0,9)}"
        text = f"The {name} targets the {vm}. The latest release is version {ver}."
        qas = [
            (f"Which VM does the {name} target?", vm),
            (f"What is the latest release version of the {name}?", ver),
            (f"What version is the latest release of the {name}?", ver),
        ]

    elif kind == "Festival":
        name = make_name("Festival")
        days = _rng.choice([2, 3, 4, 5])
        month = _rng.choice(["June","July","August","September"])
        date = _rng.randint(10, 24)
        text = f"The {name} lasts {days} days and begins on {month} {date}."
        qas = [
            (f"How long does the {name} last?", f"{days} days"),
            (f"On what date does the {name} begin?", f"{month} {date}"),
            (f"When does the {name} begin?", f"{month} {date}"),
        ]

    elif kind == "Satellite":
        name = make_name("Satellite")
        alt = _rng.choice([520, 620, 710, 840])
        freq = _rng.choice(["7.6 GHz","8.2 GHz","9.1 GHz","10.4 GHz"])
        text = f"The {name} orbits at {alt} kilometers. Its downlink frequency is {freq}."
        qas = [
            (f"At what altitude does the {name} orbit?", f"{alt} kilometers"),
            (f"What is the downlink frequency of the {name}?", freq),
            (f"Which frequency is the downlink of the {name}?", freq),
        ]

    elif kind == "Plant":
        name = make_name("Plant")
        rate = _rng.choice([45, 55, 65, 75, 85])
        mold = _rng.choice(["cobalt alloy","titanium alloy","steel","ceramic"])
        text = f"The {name} produces {rate} units per hour. It uses {mold} molds."
        qas = [
            (f"How many units per hour does the {name} produce?", f"{rate} units per hour"),
            (f"What type of molds does the {name} use?", mold),
            (f"Which molds does the {name} use?", mold),
        ]

    else:  # Library
        name = make_name("Library")
        py = _rng.choice(["Python 3.9","Python 3.10","Python 3.11"])
        ser = _rng.choice(["Nova","Atlas","Aster","Redstone"]) + " serializer"
        text = f"The {name} requires {py}. It introduces the {ser}."
        qas = [
            (f"Which Python version does the {name} require?", py),
            (f"What serializer does the {name} introduce?", ser.split(' ',1)[0] if ' ' in ser else ser),
            (f"What does the {name} introduce?", ser),
        ]

    # pick QA_PER_DOC questions from this doc
    _rng.shuffle(qas)
    qas = qas[:QA_PER_DOC]

    return {"id": doc_id, "text": text}, [
        {"question": q, "answer": a, "doc_id": doc_id}
        for (q,a) in qas
    ]


# Build corpus + QAs
_docs = []
_qa = []
for i in range(NUM_DOCS):
    d, qas = generate_doc(i)
    _docs.append(d)
    _qa.extend(qas)

docs = _docs
qa_pairs = _qa

_rng.shuffle(qa_pairs)
cut = int(len(qa_pairs) * TRAIN_FRAC)
train_pairs = qa_pairs[:cut]
dev_pairs = qa_pairs[cut:]

trainset = [
    dspy.Example(question=p["question"], answer=p["answer"], doc_id=p["doc_id"]).with_inputs("question")
    for p in train_pairs
]

devset = [
    dspy.Example(question=p["question"], answer=p["answer"], doc_id=p["doc_id"]).with_inputs("question")
    for p in dev_pairs
]

print(f"docs={len(docs)} qa_pairs={len(qa_pairs)} train={len(trainset)} dev={len(devset)} TOP_K={TOP_K}")


docs=60 qa_pairs=120 train=96 dev=24 TOP_K=3


In [35]:
# Embeddings-based retrieval (top-k)

def embed_texts(texts, model=EMBED_MODEL, batch_size=64):
    # Batch to avoid provider limits when corpus grows.
    embs = []
    for start in range(0, len(texts), batch_size):
        batch = texts[start : start + batch_size]
        resp = client.embeddings.create(model=model, input=batch)
        data = sorted(resp.data, key=lambda x: x.index)
        embs.extend([d.embedding for d in data])
    return np.array(embs)

_doc_texts = [d["text"] for d in docs]
_doc_embeddings = embed_texts(_doc_texts)
_doc_norms = np.linalg.norm(_doc_embeddings, axis=1)

_query_cache = {}

def embed_query(text):
    if text not in _query_cache:
        _query_cache[text] = embed_texts([text])[0]
    return _query_cache[text]

def retrieve(query, k=TOP_K):
    q_emb = embed_query(query)
    denom = _doc_norms * (np.linalg.norm(q_emb) + 1e-9)
    sims = (_doc_embeddings @ q_emb) / denom
    topk = np.argsort(sims)[-k:][::-1]
    context = "\n\n".join([f"[{i}] {docs[i]['text']}" for i in topk])
    return context, topk.tolist()


In [36]:
# DSPy module: query rewrite -> retrieve -> answer

def normalize_text(text):
    text = text.lower().strip()
    text = re.sub(r"[^a-z0-9 ]+", "", text)
    text = re.sub(r"\s+", " ", text)
    return text

def exact_match(pred, gold):
    return normalize_text(pred) == normalize_text(gold)

class QueryRewrite(dspy.Signature):
    # Rewrite a question into a search-friendly query.
    question = dspy.InputField()
    query = dspy.OutputField(desc="concise search query with key entities")

class AnswerQuestion(dspy.Signature):
    # Answer using the provided context only.
    context = dspy.InputField(desc="retrieved passages")
    question = dspy.InputField()
    answer = dspy.OutputField(desc="short exact answer copied from context")

class QAWithRewrite(dspy.Module):
    def __init__(self, k=TOP_K):
        super().__init__()
        self.k = k
        self.rewrite = dspy.Predict(QueryRewrite)
        self.answer = dspy.Predict(AnswerQuestion)

    def forward(self, question):
        rewritten = self.rewrite(question=question).query
        context, ids = retrieve(rewritten, k=self.k)
        pred = self.answer(context=context, question=question)
        return dspy.Prediction(
            answer=pred.answer,
            rewritten_query=rewritten,
            context_ids=ids,
        )


def evaluate(module, dataset, desc="Baseline eval"):
    """Evaluate with a progress bar (uses tqdm if installed)."""
    try:
        from tqdm.auto import tqdm  # type: ignore
        iterator = tqdm(dataset, total=len(dataset), desc=desc)
    except Exception:
        iterator = dataset

    correct = 0
    retrieval_hits = 0
    for ex in iterator:
        pred = module(question=ex.question)
        if exact_match(pred.answer, ex.answer):
            correct += 1
        if ex.doc_id in getattr(pred, "context_ids", []):
            retrieval_hits += 1
    total = len(dataset)
    return {
        "accuracy": correct / total,
        "retrieval_hit_rate": retrieval_hits / total,
    }


In [37]:
# Baseline evaluation
baseline = QAWithRewrite(k=TOP_K)
baseline_metrics = evaluate(baseline, devset)
baseline_metrics


  from .autonotebook import tqdm as notebook_tqdm
  PydanticSerializationUnexpectedValue(Expected 10 fields but got 6: Expected `Message` - serialized value may not be as expected [field_name='message', input_value=Message(content='[[ ## qu...: None}, annotations=[]), input_type=Message])
  PydanticSerializationUnexpectedValue(Expected `StreamingChoices` - serialized value may not be as expected [field_name='choices', input_value=Choices(finish_reason='st...ider_specific_fields={}), input_type=Choices])
  return self.__pydantic_serializer__.to_python(
  PydanticSerializationUnexpectedValue(Expected 10 fields but got 6: Expected `Message` - serialized value may not be as expected [field_name='message', input_value=Message(content='[[ ## an...: None}, annotations=[]), input_type=Message])
  PydanticSerializationUnexpectedValue(Expected `StreamingChoices` - serialized value may not be as expected [field_name='choices', input_value=Choices(finish_reason='st...ider_specific_fields={}), inpu

{'accuracy': 0.6666666666666666, 'retrieval_hit_rate': 1.0}

In [38]:
# DSPy optimization (query-rewrite + answer prompt)
from dspy.teleprompt import BootstrapFewShotWithRandomSearch

# DSPy passes (example, prediction, trace) to metrics.
def combined_metric(example, pred, trace=None):
    if not pred or not hasattr(pred, "answer"):
        return 0
    answer_ok = exact_match(pred.answer, example.answer)
    context_ok = example.doc_id in getattr(pred, "context_ids", [])
    return 1 if (answer_ok and context_ok) else 0

teleprompter = BootstrapFewShotWithRandomSearch(
    metric=combined_metric,
    max_bootstrapped_demos=3,
    max_labeled_demos=3,
    num_candidate_programs=3,
)

optimized = teleprompter.compile(baseline, trainset=trainset)
optimized_metrics = evaluate(optimized, devset)
optimized_metrics


Going to sample between 1 and 3 traces per predictor.
Will attempt to bootstrap 3 candidate sets.
Average Metric: 72.00 / 96 (75.0%): 100%|██████████| 96/96 [00:19<00:00,  4.87it/s]

2026/01/08 11:44:50 INFO dspy.evaluate.evaluate: Average Metric: 72 / 96 (75.0%)



New best score: 75.0 for seed -3
Scores so far: [75.0]
Best score so far: 75.0
Average Metric: 76.00 / 96 (79.2%): 100%|██████████| 96/96 [00:11<00:00,  8.51it/s]

2026/01/08 11:45:01 INFO dspy.evaluate.evaluate: Average Metric: 76 / 96 (79.2%)



New best score: 79.17 for seed -2
Scores so far: [75.0, 79.17]
Best score so far: 79.17


  4%|▍         | 4/96 [00:00<00:13,  7.02it/s]


Bootstrapped 3 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.
Average Metric: 76.00 / 96 (79.2%): 100%|██████████| 96/96 [00:21<00:00,  4.43it/s]

2026/01/08 11:45:24 INFO dspy.evaluate.evaluate: Average Metric: 76 / 96 (79.2%)



Scores so far: [75.0, 79.17, 79.17]
Best score so far: 79.17


  2%|▏         | 2/96 [00:02<01:37,  1.04s/it]


Bootstrapped 2 full traces after 2 examples for up to 1 rounds, amounting to 2 attempts.
Average Metric: 72.00 / 96 (75.0%): 100%|██████████| 96/96 [00:19<00:00,  4.97it/s]

2026/01/08 11:45:45 INFO dspy.evaluate.evaluate: Average Metric: 72 / 96 (75.0%)



Scores so far: [75.0, 79.17, 79.17, 75.0]
Best score so far: 79.17


  1%|          | 1/96 [00:00<01:17,  1.23it/s]


Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts.
Average Metric: 77.00 / 96 (80.2%): 100%|██████████| 96/96 [00:17<00:00,  5.46it/s]


2026/01/08 11:46:04 INFO dspy.evaluate.evaluate: Average Metric: 77 / 96 (80.2%)


New best score: 80.21 for seed 1
Scores so far: [75.0, 79.17, 79.17, 75.0, 80.21]
Best score so far: 80.21


  1%|          | 1/96 [00:00<01:17,  1.22it/s]


Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts.
Average Metric: 76.00 / 96 (79.2%): 100%|██████████| 96/96 [00:19<00:00,  4.85it/s]

2026/01/08 11:46:25 INFO dspy.evaluate.evaluate: Average Metric: 76 / 96 (79.2%)



Scores so far: [75.0, 79.17, 79.17, 75.0, 80.21, 79.17]
Best score so far: 80.21
6 candidate programs found.


Baseline eval: 100%|██████████| 24/24 [00:39<00:00,  1.66s/it]


{'accuracy': 0.875, 'retrieval_hit_rate': 1.0}

In [44]:
# What improved? (baseline vs optimized on dev)

print("Baseline metrics:", baseline_metrics)
print("Optimized metrics:", optimized_metrics)

def show_case(i, ex, base_pred, opt_pred):
    print("\n" + "="*80)
    print(f"[dev #{i}] Q:", ex.question)
    print("Gold:", ex.answer, "| gold doc:", ex.doc_id)
    print("- Baseline")
    print("  rewritten:", getattr(base_pred, "rewritten_query", None))
    print("  context_ids:", getattr(base_pred, "context_ids", None))
    print("  answer:", getattr(base_pred, "answer", None))
    print("  answer_ok:", exact_match(base_pred.answer, ex.answer))
    print("- Optimized")
    print("  rewritten:", getattr(opt_pred, "rewritten_query", None))
    print("  context_ids:", getattr(opt_pred, "context_ids", None))
    print("  answer:", getattr(opt_pred, "answer", None))
    print("  answer_ok:", exact_match(opt_pred.answer, ex.answer))

improved = []
regressed = []

for i, ex in enumerate(devset):
    base_pred = baseline(question=ex.question)
    opt_pred  = optimized(question=ex.question)

    base_ok = exact_match(base_pred.answer, ex.answer)
    opt_ok  = exact_match(opt_pred.answer, ex.answer)

    if (not base_ok) and opt_ok:
        improved.append((i, ex, base_pred, opt_pred))
    elif base_ok and (not opt_ok):
        regressed.append((i, ex, base_pred, opt_pred))

print(f"\nImproved cases (wrong→right): {len(improved)}")
print(f"Regressions (right→wrong): {len(regressed)}")

# Show a few examples
for tup in improved[:5]:
    show_case(*tup)

if regressed:
    print("\nShowing regressions:")
    for tup in regressed[:3]:
        show_case(*tup)

# Optional: inspect learned few-shot demos (“winning prompts”)
def show_demos(label, prog):
    print("\n" + "-"*80)
    print(label)
    try:
        preds = getattr(prog, "predictors", lambda: [])()
        for p in preds:
            print("\nPREDICTOR:", type(p).__name__)
            demos = getattr(p, "demos", None)
            if not demos:
                print("  (no demos)")
                continue
            print(f"  demos: {len(demos)}")
            for d in demos[:3]:
                # demos are dspy.Example-like objects
                print("   -", d)
    except Exception as e:
        print("Could not inspect demos:", e)

show_demos("Baseline demos", baseline)
show_demos("Optimized demos", optimized)

Baseline metrics: {'accuracy': 0.6666666666666666, 'retrieval_hit_rate': 1.0}
Optimized metrics: {'accuracy': 0.875, 'retrieval_hit_rate': 1.0}



Improved cases (wrong→right): 5
Regressions (right→wrong): 0

[dev #1] Q: What is the latest release version of the Northbridge Nimbus Compiler?
Gold: 3.4 | gold doc: 35
- Baseline
  rewritten: latest release version Northbridge Nimbus Compiler
  context_ids: [35, 46, 33]
  answer: version 3.4
  answer_ok: False
- Optimized
  rewritten: latest release version Northbridge Nimbus Compiler
  context_ids: [35, 46, 33]
  answer: 3.4
  answer_ok: True

[dev #2] Q: What version is the latest release of the Osprey Aurora Compiler?
Gold: 4.5 | gold doc: 3
- Baseline
  rewritten: latest release version Osprey Aurora Compiler
  context_ids: [3, 39, 20]
  answer: version 4.5
  answer_ok: False
- Optimized
  rewritten: latest version Osprey Aurora Compiler
  context_ids: [3, 39, 20]
  answer: 4.5
  answer_ok: True

[dev #3] Q: What outputs does the Redstone Redstone API support?
Gold: JSON and CSV | gold doc: 29
- Baseline
  rewritten: Redstone API supported output fields
  context_ids: [29, 40, 9

### Result Analysis

- **Baseline vs optimized**:
  - Baseline dev metrics were accuracy = 0.6667 and retrieval hit rate = 1.0.
  - Optimized dev metrics were accuracy = 0.875 and retrieval hit rate = 1.0.

- **Which metric improved**:
   - Accuracy improved (0.6667 → 0.875).
   - Retrieval hit rate did not change (stayed 1.0), the correct doc was already being retrieved.

- **Why accuracy improved**:
  - Most failures were answer formatting, not missing knowledge.
  - Baseline often added extra words like "version 3.4" or "JSON and CSV outputs".
  - Our metric is strict exact match, so those extra words count as wrong.
  - The optimized program learned (via DSPy’s compiled demos) to output the short exact span (e.g., "3.4", "JSON and CSV").

## Exercise 2

Coding: Implement a simple version of EvoPrompt. Represent a prompt as a list of tokens or words. Define two evolutionary operators: mutate (randomly replace or insert a word) and crossover (swap a segment between two prompts). Use an LLM (or a heuristic function) to evaluate fitness (e.g. BLEU score or any task-specific score) of prompts. Start with a few initial prompts and run a few generations of evolution. Did the prompts improve? This could be done on a trivial task (like prompt an LLM to output a specific keyword - evolve prompts to maximize the occurrence of that keyword in the response).

In [None]:
# Implementation

# Exercise 3

Compare reinforcement learning vs. evolutionary search for prompt optimization. If our “policy” is the prompt text and the “environment” gives a reward (quality score), RL would tweak the prompt based on gradient of reward (if possible) or black-box optimization. Evolutionary methods like GEPA/EvoPrompt treat it like a search over strings. List pros and cons of each: e.g., RL (with methods like RLPrompt or policy gradients) can directly optimize an objective but may get stuck in local optima or require many samples; evolutionary approaches are more global and can incorporate heuristic knowledge (via LLM reflections in GEPA) but might be slower if search space is huge. In practice, why might GEPA’s ability to incorporate natural language reflections be advantageous in prompt tuning?

## Solution

# Exercise 4