## 04_06 - Using DSPY to optimize our prompts

### 0] Setting Up

DSPY is a framework for letting you optimize prompts by defining programs. Rather than focusing on the prompt, you focus on the system and let an optimizer optimize prompts for you.

Inspired and samples from the intro DSPY notebook https://github.com/stanfordnlp/dspy/blob/main/intro.ipynb

In [1]:
%pip install -U pip dspy-ai openai~=0.28.1
import dspy


Note: you may need to restart the kernel to use updated packages.


  from .autonotebook import tqdm as notebook_tqdm


In [2]:
prompt_model_name = "gpt-4o-mini"
prompt_model = dspy.OpenAI(model=prompt_model_name, max_tokens=1000, stop=["\n\n", "\n---"])
task_model = dspy.OpenAI(model=prompt_model_name, max_tokens=1000, stop=["\n\n", "\n---", "assistant"])
dspy.settings.configure(lm=task_model)

### 1] Basic Q&A

We'll use the HotpotQA dataset to get us started. It's a collection of abstracts (first paragraphs) from wikipedia https://hotpotqa.github.io/

In [3]:
from dspy.datasets import HotPotQA

dataset = HotPotQA(train_seed=1, train_size=20, eval_seed=2023, dev_size=50, test_size=0)

# parse the training set and dev set
trainset = [x.with_inputs('question') for x in dataset.train]
devset = [x.with_inputs('question') for x in dataset.dev]

len(trainset), len(devset)

(20, 50)

We loaded a training dataset called `trainset` (20 examples) and our validation set `devset` (50 examples). 

In [4]:
train_example = trainset[0]
print(f"Question: {train_example.question}")
print(f"Answer: {train_example.answer}")

Question: At My Window was released by which American singer-songwriter?
Answer: John Townes Van Zandt


### 2] DSPY Building blocks

##### Programs, Signatures & Predictors

Program - LLM task you want to accomplish

Signature - Input and output guidance for a program

Predictor - A way to run a program and generate a result

Each program needs at least one signature.

A signature consists of three simple elements:

- A minimal description of the sub-task the LM is supposed to solve.
- A description of one or more input fields (e.g., input question) that we will give to the LM.
- A description of one or more output fields (e.g., the question's answer) that we will expect from the LM.

Let's define a simple signature for basic question answering.

In [5]:
class BasicQA(dspy.Signature):
    """Answer questions with short factoid answers."""

    question = dspy.InputField()
    answer = dspy.OutputField(desc="often between 1 and 5 words")

In [6]:
# Define the predictor.
generate_answer = dspy.Predict(BasicQA)
dev_example = devset[18]
# predict the answer using the signature
pred = generate_answer(question=dev_example.question)
print(f"Question: {dev_example.question}")
print(f"Predicted Answer: {pred.answer}")

Question: What is the nationality of the chef and restaurateur featured in Restaurant: Impossible?
Predicted Answer: American


 Let's inspect the history of our LM (**gpt4omini**).

In [7]:
task_model.inspect_history(n=1)




Answer questions with short factoid answers.

---

Follow the following format.

Question: ${question}
Answer: often between 1 and 5 words

---

Question: What is the nationality of the chef and restaurateur featured in Restaurant: Impossible?
Answer:[32m American[0m





'\n\n\nAnswer questions with short factoid answers.\n\n---\n\nFollow the following format.\n\nQuestion: ${question}\nAnswer: often between 1 and 5 words\n\n---\n\nQuestion: What is the nationality of the chef and restaurateur featured in Restaurant: Impossible?\nAnswer:\x1b[32m American\x1b[0m\n\n\n'

Let's define a chain of thought predictor and see how the responses change

In [8]:
# Define the predictor. Notice we're just changing the class. The signature BasicQA is unchanged.
generate_answer_with_chain_of_thought = dspy.ChainOfThought(BasicQA)

# Call the predictor on the same input.
pred = generate_answer_with_chain_of_thought(question=dev_example.question)

# Print the input, the chain of thought, and the prediction.
print(f"Question: {dev_example.question}")
print(f"Thought: {pred.rationale.split('.', 1)[1].strip()}")
print(f"Predicted Answer: {pred.answer}")

Question: What is the nationality of the chef and restaurateur featured in Restaurant: Impossible?
Thought: The chef featured in Restaurant: Impossible is Robert Irvine, who is from the United Kingdom.
Predicted Answer: British


This is indeed a better answer: the model figures out that the chef in question is **Robert Irvine** and correctly identifies that he's British.

These predictors (`dspy.Predict` and `dspy.ChainOfThought`) can be applied to _any_ signature. As we'll see below, they can also be optimized to learn from your data and validation logic.

### 3] Defining metrics and evaluator

Now that we have a basic predictor set up let's make it more systematic by defining an exact metric and running through an evaluator

In [9]:
from dspy.evaluate import Evaluate
metric = dspy.evaluate.answer_exact_match
evaluate = Evaluate(devset=devset, metric=metric)

baseline_train_score = evaluate(generate_answer,devset=trainset)
baseline_dev_score = evaluate(generate_answer, devset=devset)

In [10]:
print("Baseline train score:", baseline_train_score)
print("Baseline dev score:", baseline_dev_score)

Baseline train score: 30.0
Baseline dev score: 22.0


### 4] Compiling a program with a teleprompter

So far we've just used a regular prompt to get an answer, let's have DSPY try to optimize our prompts. We'll need to define a search technique called a **teleprompter** and a teacher model to construct the demontrations. 

In [11]:
from dspy.teleprompt import BootstrapFewShotWithRandomSearch

# we'll generate an number of programs in parralel when compiling using gpt4omini as an evaluator
bootstrap_optimizer = BootstrapFewShotWithRandomSearch(
    max_bootstrapped_demos=5,
    max_labeled_demos=5,
    num_candidate_programs=5,
    num_threads=8,
    metric=metric,
    teacher_settings=dict(lm=prompt_model))

Going to sample between 1 and 5 traces per predictor.
Will attempt to bootstrap 5 candidate sets.


In [12]:
cot_fewshot = bootstrap_optimizer.compile(generate_answer_with_chain_of_thought, trainset=trainset, valset=devset)

Average Metric: 4 / 8  (50.0):  14%|█▍        | 7/50 [00:00<00:00, 338.26it/s]

Average Metric: 17 / 50  (34.0): 100%|██████████| 50/50 [00:00<00:00, 756.61it/s]


Score: 34.0 for set: [0]
New best sscore: 34.0 for seed -3
Scores so far: [34.0]
Best score: 34.0


Average Metric: 19 / 50  (38.0): 100%|██████████| 50/50 [00:00<00:00, 576.44it/s]


Score: 38.0 for set: [5]
New best sscore: 38.0 for seed -2
Scores so far: [34.0, 38.0]
Best score: 38.0


 65%|██████▌   | 13/20 [00:00<00:00, 864.68it/s]


Bootstrapped 5 full traces after 14 examples in round 0.


Average Metric: 20 / 50  (40.0): 100%|██████████| 50/50 [00:00<00:00, 560.48it/s]


Score: 40.0 for set: [5]
New best sscore: 40.0 for seed -1
Scores so far: [34.0, 38.0, 40.0]
Best score: 40.0
Average of max per entry across top 1 scores: 0.4
Average of max per entry across top 2 scores: 0.42
Average of max per entry across top 3 scores: 0.44
Average of max per entry across top 5 scores: 0.44
Average of max per entry across top 8 scores: 0.44
Average of max per entry across top 9999 scores: 0.44


 40%|████      | 8/20 [00:00<00:00, 1126.14it/s]


Bootstrapped 4 full traces after 9 examples in round 0.


Average Metric: 19 / 50  (38.0): 100%|██████████| 50/50 [00:00<00:00, 584.76it/s]


Score: 38.0 for set: [5]
Scores so far: [34.0, 38.0, 40.0, 38.0]
Best score: 40.0
Average of max per entry across top 1 scores: 0.4
Average of max per entry across top 2 scores: 0.42
Average of max per entry across top 3 scores: 0.42
Average of max per entry across top 5 scores: 0.44
Average of max per entry across top 8 scores: 0.44
Average of max per entry across top 9999 scores: 0.44


 35%|███▌      | 7/20 [00:00<00:00, 955.80it/s]


Bootstrapped 2 full traces after 8 examples in round 0.


Average Metric: 19 / 50  (38.0): 100%|██████████| 50/50 [00:00<00:00, 548.01it/s]


Score: 38.0 for set: [5]
Scores so far: [34.0, 38.0, 40.0, 38.0, 38.0]
Best score: 40.0
Average of max per entry across top 1 scores: 0.4
Average of max per entry across top 2 scores: 0.42
Average of max per entry across top 3 scores: 0.42
Average of max per entry across top 5 scores: 0.44
Average of max per entry across top 8 scores: 0.44
Average of max per entry across top 9999 scores: 0.44


  5%|▌         | 1/20 [00:00<00:00, 743.01it/s]


Bootstrapped 1 full traces after 2 examples in round 0.


Average Metric: 16 / 50  (32.0): 100%|██████████| 50/50 [00:00<00:00, 604.46it/s]


Score: 32.0 for set: [5]
Scores so far: [34.0, 38.0, 40.0, 38.0, 38.0, 32.0]
Best score: 40.0
Average of max per entry across top 1 scores: 0.4
Average of max per entry across top 2 scores: 0.42
Average of max per entry across top 3 scores: 0.42
Average of max per entry across top 5 scores: 0.44
Average of max per entry across top 8 scores: 0.44
Average of max per entry across top 9999 scores: 0.44


 30%|███       | 6/20 [00:00<00:00, 867.28it/s]


Bootstrapped 2 full traces after 7 examples in round 0.


Average Metric: 19 / 50  (38.0): 100%|██████████| 50/50 [00:00<00:00, 540.50it/s]


Score: 38.0 for set: [5]
Scores so far: [34.0, 38.0, 40.0, 38.0, 38.0, 32.0, 38.0]
Best score: 40.0
Average of max per entry across top 1 scores: 0.4
Average of max per entry across top 2 scores: 0.42
Average of max per entry across top 3 scores: 0.42
Average of max per entry across top 5 scores: 0.42
Average of max per entry across top 8 scores: 0.44
Average of max per entry across top 9999 scores: 0.44


 35%|███▌      | 7/20 [00:00<00:00, 781.42it/s]


Bootstrapped 2 full traces after 8 examples in round 0.


Average Metric: 19 / 50  (38.0): 100%|██████████| 50/50 [00:00<00:00, 599.60it/s]

Score: 38.0 for set: [5]
Scores so far: [34.0, 38.0, 40.0, 38.0, 38.0, 32.0, 38.0, 38.0]
Best score: 40.0
Average of max per entry across top 1 scores: 0.4
Average of max per entry across top 2 scores: 0.42
Average of max per entry across top 3 scores: 0.42
Average of max per entry across top 5 scores: 0.42
Average of max per entry across top 8 scores: 0.44
Average of max per entry across top 9999 scores: 0.44
8 candidate programs found.





In [13]:
task_model.inspect_history(n=1)




Answer questions with short factoid answers.

---

Follow the following format.

Question: ${question}
Reasoning: Let's think step by step in order to ${produce the answer}. We ...
Answer: often between 1 and 5 words

---

Question: Who is older, Aleksandr Danilovich Aleksandrov or Anatoly Fomenko?
Reasoning: Let's think step by step in order to determine their birth years. Aleksandr Danilovich Aleksandrov was born in 1914, while Anatoly Fomenko was born in 1932. Therefore, Aleksandr Danilovich Aleksandrov is older.
Answer: Aleksandr Danilovich Aleksandrov

---

Question: "Everything Has Changed" is a song from an album released under which record label ?
Reasoning: Let's think step by step in order to identify the record label associated with the song "Everything Has Changed." We need to consider the artist and the album's release details. The song is by Taylor Swift featuring Ed Sheeran, from the album "Red," which was released under Big Machine Records.
Answer: Big Machine Record

'\n\n\nAnswer questions with short factoid answers.\n\n---\n\nFollow the following format.\n\nQuestion: ${question}\nReasoning: Let\'s think step by step in order to ${produce the answer}. We ...\nAnswer: often between 1 and 5 words\n\n---\n\nQuestion: Who is older, Aleksandr Danilovich Aleksandrov or Anatoly Fomenko?\nReasoning: Let\'s think step by step in order to determine their birth years. Aleksandr Danilovich Aleksandrov was born in 1914, while Anatoly Fomenko was born in 1932. Therefore, Aleksandr Danilovich Aleksandrov is older.\nAnswer: Aleksandr Danilovich Aleksandrov\n\n---\n\nQuestion: "Everything Has Changed" is a song from an album released under which record label ?\nReasoning: Let\'s think step by step in order to identify the record label associated with the song "Everything Has Changed." We need to consider the artist and the album\'s release details. The song is by Taylor Swift featuring Ed Sheeran, from the album "Red," which was released under Big Machine Records.

In [14]:
compiled_dev_score = evaluate(cot_fewshot, devset=devset)
print(compiled_dev_score)

40.0


### 4] Trying a different optimizer - MIPRO
Similar to different optmizers in model training like SGD, AdamW and Sophia, DSPY has a similar suite, Let's try compiling with Mipro to see if we get better results

In [15]:
from dspy.teleprompt import MIPROv2
mipro_teleprompter = MIPROv2(prompt_model=prompt_model, task_model=task_model, metric=metric, num_candidates=5, init_temperature=1.0)

In [16]:
cot_fewshot_mipro = mipro_teleprompter.compile(generate_answer_with_chain_of_thought, trainset=trainset, 
                                               valset=devset,num_batches=10, max_bootstrapped_demos=1,
                                               max_labeled_demos=2,requires_permission_to_run=False)


Please be advised that based on the parameters you have set, the maximum number of LM calls is projected as follows:


[93m- Prompt Model: [94m[1m10[0m[93m data summarizer calls + [94m[1m5[0m[93m * [94m[1m1[0m[93m lm calls in program + ([94m[1m2[0m[93m) lm calls in program aware proposer = [94m[1m17[0m[93m prompt model calls[0m
[93m- Task Model: [94m[1m25[0m[93m examples in minibatch * [94m[1m10[0m[93m batches + [94m[1m20[0m[93m examples in train set * [94m[1m1[0m[93m full evals = [94m[1m270[0m[93m task model calls[0m

[93m[1mEstimated Cost Calculation:[0m

[93mTotal Cost = (Number of calls to task model * (Avg Input Token Length per Call * Task Model Price per Input Token + Avg Output Token Length per Call * Task Model Price per Output Token) 
            + (Number of calls to prompt model * (Avg Input Token Length per Call * Task Prompt Price per Input Token + Avg Output Token Length per Call * Prompt Model Price per Output Token).[0m

Error getting source code: could not find class definition.

Running without program aware proposer.
b: 10
summary: Prediction(
    summary='Observations: The dataset includes various trivia questions related to musicians, actors, historical events, and publications, with a strong focus on American culture and notable figures. Key examples highlight the release of songs, film debuts, and sports history, alongside factual data about organizations and geographical locations.'
)
DATA SUMMARY: The dataset includes various trivia questions related to musicians, actors, historical events, and publications, with a strong focus on American culture and notable figures. Key examples highlight the release of songs, film debuts, and sports history, alongside factual data about organizations and geographical locations.


  5%|▌         | 1/20 [00:00<00:00, 631.29it/s]


Bootstrapped 1 full traces after 2 examples in round 0.


  5%|▌         | 1/20 [00:00<00:00, 690.53it/s]


Bootstrapped 1 full traces after 2 examples in round 0.


 10%|█         | 2/20 [00:00<00:00, 702.21it/s]


Bootstrapped 1 full traces after 3 examples in round 0.
Using a randomly generated configuration for our grounded proposer.
Selected tip: description
task_demos 



Use the information below to learn about a task that we are trying to solve using calls to an LM, then generate a new instruction that will be used to prompt a Language Model to better solve the task.

---

Follow the following format.

DATASET SUMMARY: A description of the dataset that we are using.

TASK DEMO(S): Example inputs/outputs of our module.

BASIC INSTRUCTION: Basic instruction.

TIP: A suggestion for how to go about generating the new instruction.

PROPOSED INSTRUCTION: Propose an instruction that will be used to prompt a Language Model to perform this task.

---

DATASET SUMMARY: The dataset includes various trivia questions related to musicians, actors, historical events, and publications, with a strong focus on American culture and notable figures. Key examples highlight the release of songs, film debuts, an

[I 2024-07-21 19:14:03,819] A new study created in memory with name: no-name-36dad537-7db2-44c9-bd5f-97862548ee62


task_demos Question: Which of these publications was most recently published, Who Put the Bomp or Self?
Reasoning: Let's think step by step in order to determine the publication dates of both titles. We need to check the release years of "Who Put the Bomp" and "Self" to see which one is more recent.
Answer: Self




Use the information below to learn about a task that we are trying to solve using calls to an LM, then generate a new instruction that will be used to prompt a Language Model to better solve the task.

---

Follow the following format.

DATASET SUMMARY: A description of the dataset that we are using.

TASK DEMO(S): Example inputs/outputs of our module.

BASIC INSTRUCTION: Basic instruction.

TIP: A suggestion for how to go about generating the new instruction.

PROPOSED INSTRUCTION: Propose an instruction that will be used to prompt a Language Model to perform this task.

---

DATASET SUMMARY: The dataset includes various trivia questions related to musicians, actors, histo

[I 2024-07-21 19:15:05,898] Trial 0 finished with value: 45.0 and parameters: {'0_predictor_instruction': 1, '0_predictor_demos': 1}. Best is trial 0 with value: 45.0.


CANDIDATE PROGRAM:
Predictor 0
i: Answer trivia questions by providing a concise fact-based response. When applicable, include a brief reasoning process that outlines how you arrived at the answer, especially when comparing items or dates. Use clear examples and relevant context related to American culture, musicians, actors, historical events, and publications to reinforce your answer.
p: Answer:


...


[I 2024-07-21 19:15:31,196] Trial 1 finished with value: 35.0 and parameters: {'0_predictor_instruction': 2, '0_predictor_demos': 1}. Best is trial 0 with value: 45.0.


FULL TRACE



Answer trivia questions by providing a concise fact-based response. When applicable, include a brief reasoning process that outlines how you arrived at the answer, especially when comparing items or dates. Use clear examples and relevant context related to American culture, musicians, actors, historical events, and publications to reinforce your answer.

---

Follow the following format.

Question: ${question}
Reasoning: Let's think step by step in order to ${produce the answer}. We ...
Answer: often between 1 and 5 words

---

Question: At My Window was released by which American singer-songwriter?
Answer: John Townes Van Zandt

---

Question: Who acted in the shot film The Shore and is also the youngest actress ever to play Ophelia in a Royal Shakespeare Company production of "Hamlet." ?
Answer: Kerry Condon

---

Question: Which of these publications was most recently published, Who Put the Bomp or Self?
Reasoning: Let's think step by step in order to[32m compare the 

[I 2024-07-21 19:16:17,129] Trial 2 finished with value: 30.0 and parameters: {'0_predictor_instruction': 4, '0_predictor_demos': 1}. Best is trial 0 with value: 45.0.
[I 2024-07-21 19:16:17,141] Trial 3 finished with value: 35.0 and parameters: {'0_predictor_instruction': 2, '0_predictor_demos': 1}. Best is trial 0 with value: 45.0.


FULL TRACE



Compare and contrast the heights of two given structures or objects by step-by-step reasoning. Begin by identifying the heights of each structure, then evaluate which one is taller, and provide a concise answer indicating the taller structure. Include specific numerical data to support your reasoning.

---

Follow the following format.

Question: ${question}
Reasoning: Let's think step by step in order to ${produce the answer}. We ...
Answer: often between 1 and 5 words

---

Question: At My Window was released by which American singer-songwriter?
Answer: John Townes Van Zandt

Question: Who acted in the shot film The Shore and is also the youngest actress ever to play Ophelia in a Royal Shakespeare Company production of "Hamlet." ?
Answer: Kerry Condon

Question: Which of these publications was most recently published, Who Put the Bomp or Self?
Reasoning: Let's think step by step in order to determine which publication was released last. We need to find the release dates

[I 2024-07-21 19:16:43,882] Trial 4 finished with value: 40.0 and parameters: {'0_predictor_instruction': 4, '0_predictor_demos': 3}. Best is trial 0 with value: 45.0.


FULL TRACE



Compare and contrast the heights of two given structures or objects by step-by-step reasoning. Begin by identifying the heights of each structure, then evaluate which one is taller, and provide a concise answer indicating the taller structure. Include specific numerical data to support your reasoning.

---

Follow the following format.

Question: ${question}
Reasoning: Let's think step by step in order to ${produce the answer}. We ...
Answer: often between 1 and 5 words

---

Question: Who is older, Aleksandr Danilovich Aleksandrov or Anatoly Fomenko?
Reasoning: Let's think step by step in order to determine their birth years. Aleksandr Danilovich Aleksandrov was born in 1937, while Anatoly Fomenko was born in 1945. Therefore, Aleksandr Danilovich Aleksandrov is older.
Answer: Aleksandr Danilovich Aleksandrov

---

Question: What is the code name for the German offensive that started this Second World War engagement on the Eastern Front (a few hundred kilometers from Mosc

[I 2024-07-21 19:17:04,469] Trial 5 finished with value: 50.0 and parameters: {'0_predictor_instruction': 0, '0_predictor_demos': 1}. Best is trial 5 with value: 50.0.


FULL TRACE



Answer questions with short factoid answers.

---

Follow the following format.

Question: ${question}
Reasoning: Let's think step by step in order to ${produce the answer}. We ...
Answer: often between 1 and 5 words

---

Question: At My Window was released by which American singer-songwriter?
Answer: John Townes Van Zandt

---

Question: Who acted in the shot film The Shore and is also the youngest actress ever to play Ophelia in a Royal Shakespeare Company production of "Hamlet." ?
Answer: Kerry Condon

---

Question: Which of these publications was most recently published, Who Put the Bomp or Self?
Reasoning: Let's think step by step in order to[32m determine the publication dates of both titles. We need to check the release years of "Who Put the Bomp" and "Self" to see which one is more recent. 
Answer: Self[0m






Answer questions with short factoid answers.

---

Follow the following format.

Question: ${question}
Reasoning: Let's think step by step in order to

[I 2024-07-21 19:17:32,595] Trial 6 finished with value: 35.0 and parameters: {'0_predictor_instruction': 4, '0_predictor_demos': 4}. Best is trial 5 with value: 50.0.


FULL TRACE



Compare and contrast the heights of two given structures or objects by step-by-step reasoning. Begin by identifying the heights of each structure, then evaluate which one is taller, and provide a concise answer indicating the taller structure. Include specific numerical data to support your reasoning.

---

Follow the following format.

Question: ${question}
Reasoning: Let's think step by step in order to ${produce the answer}. We ...
Answer: often between 1 and 5 words

---

Question: Which is taller, the Empire State Building or the Bank of America Tower?
Reasoning: Let's think step by step in order to compare their heights. The Empire State Building is 1,454 feet tall, while the Bank of America Tower is 1,200 feet tall. Therefore, the Empire State Building is taller.
Answer: Empire State Building

---

Question: This American guitarist best known for her work with the Iron Maidens is an ancestor of a composer who was known as what?
Answer: The Waltz King

---

Question

[I 2024-07-21 19:17:50,080] Trial 7 finished with value: 35.0 and parameters: {'0_predictor_instruction': 0, '0_predictor_demos': 0}. Best is trial 5 with value: 50.0.


FULL TRACE



Answer questions with short factoid answers.

---

Follow the following format.

Question: ${question}
Reasoning: Let's think step by step in order to ${produce the answer}. We ...
Answer: often between 1 and 5 words

---

Question: Which of these publications was most recently published, Who Put the Bomp or Self?
Reasoning: Let's think step by step in order to[32m compare the publication dates of both works. "Who Put the Bomp" was published in 1996, while "Self" was published in 2000. Therefore, "Self" is the more recent publication.  
Answer: Self[0m






Answer questions with short factoid answers.

---

Follow the following format.

Question: ${question}
Reasoning: Let's think step by step in order to ${produce the answer}. We ...
Answer: often between 1 and 5 words

---

Question: Which of these publications was most recently published, Who Put the Bomp or Self?
Reasoning: Let's think step by step in order to[32m compare the publication dates of both works. "Who 

[I 2024-07-21 19:18:34,651] Trial 8 finished with value: 40.0 and parameters: {'0_predictor_instruction': 3, '0_predictor_demos': 1}. Best is trial 5 with value: 50.0.
[I 2024-07-21 19:18:34,664] Trial 9 finished with value: 40.0 and parameters: {'0_predictor_instruction': 4, '0_predictor_demos': 3}. Best is trial 5 with value: 50.0.


FULL TRACE



Propose a detailed instruction that prompts the Language Model to analyze and reason through trivia questions by first identifying the key figures and events involved, then providing the necessary information to derive an accurate answer, while emphasizing the importance of comparing relevant facts to reach a conclusion.

---

Follow the following format.

Question: ${question}
Reasoning: Let's think step by step in order to ${produce the answer}. We ...
Answer: often between 1 and 5 words

---

Question: At My Window was released by which American singer-songwriter?
Answer: John Townes Van Zandt

Question: Who acted in the shot film The Shore and is also the youngest actress ever to play Ophelia in a Royal Shakespeare Company production of "Hamlet." ?
Answer: Kerry Condon

Question: Which of these publications was most recently published, Who Put the Bomp or Self?
Reasoning: Let's think step by step in order to determine which publication was released last. We need to id

In [20]:
compiled_dev_score = evaluate(cot_fewshot_mipro, devset=devset)

In [22]:
print(compiled_dev_score)
cot_fewshot_mipro.save("cot_fewshot_mipro.dspy")
cot_fewshot.save("cot_fewshot.dspy")

36.0
[('self', Predict(StringSignature(question -> rationale, answer
    instructions='Answer questions with short factoid answers.'
    question = Field(annotation=str required=True json_schema_extra={'__dspy_field_type': 'input', 'prefix': 'Question:', 'desc': '${question}'})
    rationale = Field(annotation=str required=True json_schema_extra={'prefix': "Reasoning: Let's think step by step in order to", 'desc': '${produce the answer}. We ...', '__dspy_field_type': 'output'})
    answer = Field(annotation=str required=True json_schema_extra={'desc': 'often between 1 and 5 words', '__dspy_field_type': 'output', 'prefix': 'Answer:'})
)))]
[('self', Predict(StringSignature(question -> rationale, answer
    instructions='Answer questions with short factoid answers.'
    question = Field(annotation=str required=True json_schema_extra={'__dspy_field_type': 'input', 'prefix': 'Question:', 'desc': '${question}'})
    rationale = Field(annotation=str required=True json_schema_extra={'prefix':

### 5] V3 Adding RAG

To improve accuracy, we'll build a retrieval-augmented pipeline for answer generation.

Given a question, we'll search for the top-3 passages in Wikipedia and then feed them as context for answer generation.

Let's start by defining this signature: `context, question --> answer`.

##### Using the Retrieval Model

Using the retriever is pretty simple. A module `dspy.Retrieve(k)` will search for the top-`k` passages that match a given query.

Let's add our retriever with embeddings, hosted on a stanford domain.

In [23]:
colbertv2_wiki17_abstracts = dspy.ColBERTv2(url='http://20.102.90.50:2017/wiki17_abstracts')

dspy.settings.configure(lm=task_model, rm=colbertv2_wiki17_abstracts)
retrieve = dspy.Retrieve(k=3)
topK_passages = retrieve(dev_example.question).passages

print(f"Top {retrieve.k} passages for question: {dev_example.question} \n", '-' * 30, '\n')

for idx, passage in enumerate(topK_passages):
    print(f'{idx+1}]', passage, '\n')

Top 3 passages for question: What is the nationality of the chef and restaurateur featured in Restaurant: Impossible? 
 ------------------------------ 

1] Restaurant: Impossible | Restaurant: Impossible is an American reality television series, featuring chef and restaurateur Robert Irvine, that aired on Food Network from 2011 to 2016. 

2] Jean Joho | Jean Joho is a French-American chef and restaurateur. He is chef/proprietor of Everest in Chicago (founded in 1986), Paris Club Bistro & Bar and Studio Paris in Chicago, The Eiffel Tower Restaurant in Las Vegas, and Brasserie JO in Boston. 

3] List of Restaurant: Impossible episodes | This is the list of the episodes for the American cooking and reality television series "Restaurant Impossible", produced by Food Network. The premise of the series is that within two days and on a budget of $10,000, celebrity chef Robert Irvine renovates a failing American restaurant with the goal of helping to restore it to profitability and prominence.

In [24]:
class GenerateAnswer(dspy.Signature):
    """Answer questions with short factoid answers."""

    context = dspy.InputField(desc="may contain relevant facts")
    question = dspy.InputField()
    answer = dspy.OutputField(desc="often between 1 and 5 words")

Now let's define the RAG program. This is a class that inherits from `dspy.Module`.

It needs two methods:

- The `__init__` method will simply declare the sub-modules it needs: `dspy.Retrieve` and `dspy.ChainOfThought`. The latter is defined to implement our `GenerateAnswer` signature.
- The `forward` method will describe the control flow of answering the question using the modules we have.

In [26]:
class RAG(dspy.Module):
    def __init__(self, num_passages=3):
        super().__init__()
        self.retrieve = dspy.Retrieve(k=num_passages)
        self.generate_answer = dspy.ChainOfThought(GenerateAnswer)
    
    def forward(self, question):
        context = self.retrieve(question).passages
        prediction = self.generate_answer(context=context, question=question)
        return dspy.Prediction(context=context, answer=prediction.answer)

##### Compiling the RAG program

Having defined this program, let's now **compile** it. We'll define a new metric for checking both the retriever answers and generated answers

In [27]:
from dspy.teleprompt import BootstrapFewShot

# Validation logic: check that the predicted answer is correct.
# Also check that the retrieved context does actually contain that answer.
def validate_context_and_answer(example, pred, trace=None):
    answer_EM = dspy.evaluate.answer_exact_match(example, pred)
    answer_PM = dspy.evaluate.answer_passage_match(example, pred)
    return answer_EM and answer_PM

# Set up a basic teleprompter, which will compile our RAG program.
teleprompter = BootstrapFewShot(metric=validate_context_and_answer)

# Compile!
compiled_rag = teleprompter.compile(RAG(), trainset=trainset)

 55%|█████▌    | 11/20 [00:18<00:15,  1.72s/it]

Bootstrapped 4 full traces after 12 examples in round 0.





Now that we've compiled our RAG program, let's try it out.

Excellent. How about we inspect the last prompt for the LM?

In [None]:
task_model.inspect_history(n=1)





Answer questions with short factoid answers.

---

Question: At My Window was released by which American singer-songwriter?
Answer: John Townes Van Zandt

Question: "Everything Has Changed" is a song from an album released under which record label ?
Answer: Big Machine Records

Question: The Victorians - Their Story In Pictures is a documentary series written by an author born in what year?
Answer: 1950

Question: Which Pakistani cricket umpire who won 3 consecutive ICC umpire of the year awards in 2009, 2010, and 2011 will be in the ICC World Twenty20?
Answer: Aleem Sarwar Dar

Question: Having the combination of excellent foot speed and bat speed helped Eric Davis, create what kind of outfield for the Los Angeles Dodgers?
Answer: "Outfield of Dreams"

Question: Who is older, Aleksandr Danilovich Aleksandrov or Anatoly Fomenko?
Answer: Aleksandr Danilovich Aleksandrov

Question: The Organisation that allows a community to influence their operation or use and to enjoy the benefits 

Even though we haven't written any of this detailed demonstrations, we see that **DSPy** was able to bootstrap this 3,000 token prompt for **3-shot retrieval augmented generation with hard negative passages and chain of thought** from our extremely simple program.

This illustrates the power of composition and learning. Of course, this was just generated by a particular teleprompter, which may or may not be perfect in each setting. As you'll see in **DSPy**, there is a large but systematic space of options you have to optimize and validate the quality and cost of your programs.

If you're so inclined, you can easily inspect the learned objects themselves.

In [29]:
for name, parameter in compiled_rag.named_predictors():
    print(name)
    print(parameter.demos[0])
    print()

generate_answer
Example({'augmented': True, 'context': ['Battle of Kursk | The Battle of Kursk was a Second World War engagement between German and Soviet forces on the Eastern Front near Kursk (450 km south-west of Moscow) in the Soviet Union during July and August 1943. The battle began with the launch of the German offensive, Operation Citadel (German: "Unternehmen Zitadelle" ), on 5 July, which had the objective of pinching off the Kursk salient with attacks on the base of the salient from north and south simultaneously. After the German offensive stalled on the northern side of the salient, on 12 July the Soviets commenced their Kursk Strategic Offensive Operation with the launch of Operation Kutuzov (Russian: Кутузов ) against the rear of the German forces in the northern side. On the southern side, the Soviets also launched powerful counterattacks the same day, one of which led to a large armoured clash, the Battle of Prokhorovka. On 3 August, the Soviets began the second phase 

##### Evaluating the Answers

We can now evaluate our `compiled_rag` program on the dev set.

In [28]:
from dspy.evaluate.evaluate import Evaluate

metric = dspy.evaluate.answer_exact_match
evaluate(compiled_rag, metric=metric)

60.0

In [30]:
cot_fewshot.save("compiled_rag.dspy")

[('self', Predict(StringSignature(question -> rationale, answer
    instructions='Answer questions with short factoid answers.'
    question = Field(annotation=str required=True json_schema_extra={'__dspy_field_type': 'input', 'prefix': 'Question:', 'desc': '${question}'})
    rationale = Field(annotation=str required=True json_schema_extra={'prefix': "Reasoning: Let's think step by step in order to", 'desc': '${produce the answer}. We ...', '__dspy_field_type': 'output'})
    answer = Field(annotation=str required=True json_schema_extra={'desc': 'often between 1 and 5 words', '__dspy_field_type': 'output', 'prefix': 'Answer:'})
)))]
