# Intro

[`DSPy`](https://github.com/stanfordnlp/dspy) kept popping up on my X timeline and I thought it looked pretty interesting, so I decided to take a few days to look into it. I didn't get super deep into it yet, but I think I have a high level understanding. The library is fairly new IMO (as of writing this). There is excitement around it though and a growing community. I am hopeful that the documentation and library will continue to improve throughout the year. If you are completely new to `DSPy` I would suggest the following resources below.

- Read through the newer documentation [here](https://dspy-docs.vercel.app/docs/intro).
- Checkout the README from [`DSPY` GitHub repo](https://github.com/stanfordnlp/dspy) and the examples there.
- Try and code up some simple examples on your own data.
- Checkout the [Discord server](https://discord.gg/s7cFzpw3Mj).
- Skim through or read some of the associated papers (see the paper links on the `DSPy` repo [README](https://github.com/stanfordnlp/dspy?tab=readme-ov-file#dspy-programmingnot-promptingfoundation-models)).
- There are also some decent videos on YouTube. Simply Search for `DSPy` LLM etc. 
- Follow [Omar Khattab](https://twitter.com/lateinteraction) 

# ENV Setup

```
python3 -m venv env
source env/bin/activate
pip install dspy-ai
pip install openai --upgrade
pip install --upgrade notebook ipywidgets
```

```python
import os
os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_API_KEY"
```

# BIG-Bench Hard Dataset - Penguins In a Table - Example

Within the [BIG-Bench Hard dataset](https://github.com/suzgunmirac/BIG-Bench-Hard) [@suzgun2022challenging] there are various tasks. You can use one of these strings when using `load_dataset` to load in the corresponding records for that task.

```python
['tracking_shuffled_objects_seven_objects', 'salient_translation_error_detection', 'tracking_shuffled_objects_three_objects', 'geometric_shapes', 'object_counting', 'word_sorting', 'logical_deduction_five_objects', 'hyperbaton', 'sports_understanding', 'logical_deduction_seven_objects', 'multistep_arithmetic_two', 'ruin_names', 'causal_judgement', 'logical_deduction_three_objects', 'formal_fallacies', 'snarks', 'boolean_expressions', 'reasoning_about_colored_objects', 'dyck_languages', 'navigate', 'disambiguation_qa', 'temporal_sequences', 'web_of_lies', 'tracking_shuffled_objects_five_objects', 'penguins_in_a_table', 'movie_recommendation', 'date_understanding']
```

We will use the `penguins_in_a_table` task.

In [1]:
from datasets import load_dataset
import dspy

ds = load_dataset("maveriq/bigbenchhard", "penguins_in_a_table")["train"]
examples = [dspy.Example({"question": r["input"], "answer": r["target"]}).with_inputs("question") for r in ds]
print(f"There are {len(examples)} examples.")
trainset = examples[0:20]
valset = examples[20:]

There are 146 examples.


In [2]:
example = trainset[10]
for k, v in example.items():
    print(f"\n{k.upper()}:\n")
    print(v)


QUESTION:

Here is a table where the first line is a header and each subsequent line is a penguin:  name, age, height (cm), weight (kg) Louis, 7, 50, 11 Bernard, 5, 80, 13 Vincent, 9, 60, 11 Gwen, 8, 70, 15  For example: the age of Louis is 7, the weight of Gwen is 15 kg, the height of Bernard is 80 cm.  We then delete the penguin named Bernard from the table.
How many penguins are more than 8 years old?
Options:
(A) 1
(B) 2
(C) 3
(D) 4
(E) 5

ANSWER:

(A)


We will use the `DSPy` OpenAI connector to make calls to gpt-3.5. Note that `DSPy` caches
API calls so that subsequent calls with the same input will read from the cache instead of calling the OpenAI API a second time.  

In [3]:
llm = dspy.OpenAI(model="gpt-3.5-turbo-0125", max_tokens=250)
dspy.settings.configure(lm=llm)

We can test that the calls to OpenAI are working:

In [4]:
llm("Testing testing, is anyone out there?")

["Hello! I'm here to help. What can I assist you with today?"]

In [5]:
llm(example.question)

['There are 2 penguins who are more than 8 years old: Vincent (9 years old) and Gwen (8 years old). \n\nTherefore, the answer is (B) 2.']

At any point we can look at the last `n` calls to the llm:

In [6]:
llm.inspect_history(n=2)





Testing testing, is anyone out there?[32m Hello! I'm here to help. What can I assist you with today?[0m







Here is a table where the first line is a header and each subsequent line is a penguin:  name, age, height (cm), weight (kg) Louis, 7, 50, 11 Bernard, 5, 80, 13 Vincent, 9, 60, 11 Gwen, 8, 70, 15  For example: the age of Louis is 7, the weight of Gwen is 15 kg, the height of Bernard is 80 cm.  We then delete the penguin named Bernard from the table.
How many penguins are more than 8 years old?
Options:
(A) 1
(B) 2
(C) 3
(D) 4
(E) 5[32m There are 2 penguins who are more than 8 years old: Vincent (9 years old) and Gwen (8 years old). 

Therefore, the answer is (B) 2.[0m




Our evaluation metric will check if the llm output contains the correct multiple choice 
answer. To define an evaluation metric in `DSPy` we create a function like the example below. The first two inputs
should be instances of `dspy.Example`. The metric function can contain any logic you need to evaluate your task. You can read more about the `trace`
argument in the [documentation](https://dspy-docs.vercel.app/docs/building-blocks/metrics#simple-metrics). It needs to be there, even if not explicitly using it.

In [7]:
import re


def eval_metric(true, prediction, trace=None):
    pred = prediction.answer
    matches = re.findall(r"\([A-Z]\)", pred)
    parsed_answer = matches[-1] if matches else ""
    return parsed_answer == true.answer

We set up an evaluation pipeline:

In [8]:
from dspy.evaluate import Evaluate

evaluate = Evaluate(devset=valset, metric=eval_metric, num_threads=6, display_progress=True, display_table=10)

Here is a simple module in `DSPy` for basic question and answer.

In [9]:
class BasicQA(dspy.Module):
    def __init__(self):
        super().__init__()
        self.prog = dspy.Predict("question -> answer")

    def forward(self, question):
        return self.prog(question=question)


basic_qa = BasicQA()

The `forward` method calls `__call__` similar to how things work in pytorch.

In [10]:
pred = basic_qa(question=example.question)
print("\nQUESTION:\n")
print(example.question)
print("\nANSWER:\n")
print(example.answer)
print("\nPREDICTION:\n")
print(pred.answer)


QUESTION:

Here is a table where the first line is a header and each subsequent line is a penguin:  name, age, height (cm), weight (kg) Louis, 7, 50, 11 Bernard, 5, 80, 13 Vincent, 9, 60, 11 Gwen, 8, 70, 15  For example: the age of Louis is 7, the weight of Gwen is 15 kg, the height of Bernard is 80 cm.  We then delete the penguin named Bernard from the table.
How many penguins are more than 8 years old?
Options:
(A) 1
(B) 2
(C) 3
(D) 4
(E) 5

ANSWER:

(A)

PREDICTION:

(B) 2


In [11]:
eval_metric(example, pred)

False

In [12]:
llm.inspect_history(n=1)





Given the fields `question`, produce the fields `answer`.

---

Follow the following format.

Question: ${question}
Answer: ${answer}

---

Question: Here is a table where the first line is a header and each subsequent line is a penguin: name, age, height (cm), weight (kg) Louis, 7, 50, 11 Bernard, 5, 80, 13 Vincent, 9, 60, 11 Gwen, 8, 70, 15 For example: the age of Louis is 7, the weight of Gwen is 15 kg, the height of Bernard is 80 cm. We then delete the penguin named Bernard from the table. How many penguins are more than 8 years old? Options: (A) 1 (B) 2 (C) 3 (D) 4 (E) 5
Answer:[32m (B) 2[0m




Now we can pass each example question through the LLM in the validation set and check if we get the correct answer:

In [13]:
# | echo: false
from tqdm.notebook import tqdm

tqdm._instances.clear()

In [14]:
# | warning: false
evaluate(basic_qa)

Average Metric: 44 / 126  (34.9): 100%|██████████| 126/126 [00:00<00:00, 1000.44it/s]

Average Metric: 44 / 126  (34.9%)



  df = df.applymap(truncate_cell)


Unnamed: 0,question,example_answer,pred_answer,eval_metric
0,"Here is a table where the first line is a header and each subsequent line is a penguin: name, age, height (cm), weight (kg) Louis,...",(A),3,False
1,"Here is a table where the first line is a header and each subsequent line is a penguin: name, age, height (cm), weight (kg) Louis,...",(D),(C) 50,False
2,"Here is a table where the first line is a header and each subsequent line is a penguin: name, age, height (cm), weight (kg) Louis,...",(A),Answer: (C) 3,False
3,"Here is a table where the first line is a header and each subsequent line is a penguin: name, age, height (cm), weight (kg) Louis,...",(A),Answer: (B) 2,False
4,"Here is a table where the first line is a header and each subsequent line is a penguin: name, age, height (cm), weight (kg) Louis,...",(B),(B) 5,✔️ [True]
5,"Here is a table where the first line is a header and each subsequent line is a penguin: name, age, height (cm), weight (kg) Louis,...",(C),(B) 2,False
6,"Here is a table where the first line is a header and each subsequent line is a penguin: name, age, height (cm), weight (kg) Louis,...",(E),James,False
7,"Here is a table where the first line is a header and each subsequent line is a penguin: name, age, height (cm), weight (kg) Louis,...",(A),(B) 2,False
8,"Here is a table where the first line is a header and each subsequent line is a penguin: name, age, height (cm), weight (kg) Louis,...",(C),Answer: Vincent,False
9,"Here is a table where the first line is a header and each subsequent line is a penguin: name, age, height (cm), weight (kg) Louis,...",(D),Answer: Donna,False


34.92

`DSPy` uses optimizers to optimize the modules. In this example, optimization is a process that will choose which demos/examples
are best to put into the prompt in order to increase the evaluation metric. At the time of writing the optimizers are called
teleprompters (prompting from a distance). I think they will change the [name](https://dspy-docs.vercel.app/docs/building-blocks/optimizers) though to optimizers in future refactoring. The DSPy documentation states that the optimizer can adjust/edit:

- Demo examples in the prompt.
- Instructions of the prompt.
- Weights of the actual LLM (for example fine tuning an open source model).

I have only played around with optimizers that optimize which demos/examples are put into the prompt.

In [15]:
# | echo: false
tqdm._instances.clear()

In [16]:
# | output: false
from dspy.teleprompt import BootstrapFewShotWithRandomSearch

config = dict(max_bootstrapped_demos=2, max_labeled_demos=4, num_candidate_programs=2, num_threads=6)

teleprompter = BootstrapFewShotWithRandomSearch(metric=eval_metric, **config)
optimized_qa = teleprompter.compile(basic_qa, trainset=trainset, valset=valset)

Going to sample between 1 and 2 traces per predictor.
Will attempt to train 2 candidate sets.


Average Metric: 44 / 126  (34.9): 100%|██████████| 126/126 [00:00<00:00, 4235.58it/s]
  df = df.applymap(truncate_cell)


Average Metric: 44 / 126  (34.9%)
Score: 34.92 for set: [0]
New best score: 34.92 for seed -3
Scores so far: [34.92]
Best score: 34.92


Average Metric: 47 / 126  (37.3): 100%|██████████| 126/126 [00:00<00:00, 1256.38it/s]


Average Metric: 47 / 126  (37.3%)
Score: 37.3 for set: [4]
New best score: 37.3 for seed -2
Scores so far: [34.92, 37.3]
Best score: 37.3


 50%|█████     | 10/20 [00:00<00:00, 874.43it/s]


Bootstrapped 2 full traces after 11 examples in round 0.


Average Metric: 48 / 126  (38.1): 100%|██████████| 126/126 [00:00<00:00, 1297.79it/s]


Average Metric: 48 / 126  (38.1%)
Score: 38.1 for set: [4]
New best score: 38.1 for seed -1
Scores so far: [34.92, 37.3, 38.1]
Best score: 38.1
Average of max per entry across top 1 scores: 0.38095238095238093
Average of max per entry across top 2 scores: 0.5476190476190477
Average of max per entry across top 3 scores: 0.6746031746031746
Average of max per entry across top 5 scores: 0.6746031746031746
Average of max per entry across top 8 scores: 0.6746031746031746
Average of max per entry across top 9999 scores: 0.6746031746031746


 20%|██        | 4/20 [00:00<00:00, 860.02it/s]


Bootstrapped 2 full traces after 5 examples in round 0.


Average Metric: 50 / 126  (39.7): 100%|██████████| 126/126 [00:00<00:00, 1245.68it/s]


Average Metric: 50 / 126  (39.7%)
Score: 39.68 for set: [4]
New best score: 39.68 for seed 0
Scores so far: [34.92, 37.3, 38.1, 39.68]
Best score: 39.68
Average of max per entry across top 1 scores: 0.3968253968253968
Average of max per entry across top 2 scores: 0.5079365079365079
Average of max per entry across top 3 scores: 0.6031746031746031
Average of max per entry across top 5 scores: 0.7063492063492064
Average of max per entry across top 8 scores: 0.7063492063492064
Average of max per entry across top 9999 scores: 0.7063492063492064


  5%|▌         | 1/20 [00:00<00:00, 402.56it/s]


Bootstrapped 1 full traces after 2 examples in round 0.


Average Metric: 54 / 126  (42.9): 100%|██████████| 126/126 [00:00<00:00, 1218.83it/s]


Average Metric: 54 / 126  (42.9%)
Score: 42.86 for set: [4]
New best score: 42.86 for seed 1
Scores so far: [34.92, 37.3, 38.1, 39.68, 42.86]
Best score: 42.86
Average of max per entry across top 1 scores: 0.42857142857142855
Average of max per entry across top 2 scores: 0.5396825396825397
Average of max per entry across top 3 scores: 0.6190476190476191
Average of max per entry across top 5 scores: 0.7619047619047619
Average of max per entry across top 8 scores: 0.7619047619047619
Average of max per entry across top 9999 scores: 0.7619047619047619
5 candidate programs found.


There is a lot of output from the above code block which I am hiding to keep things cleaner.
You can now evaluate the optimized model to see if the accuracy has improved.

In [17]:
# | warning: false
evaluate(optimized_qa)

Average Metric: 54 / 126  (42.9): 100%|██████████| 126/126 [00:00<00:00, 1685.64it/s]


Average Metric: 54 / 126  (42.9%)


Unnamed: 0,question,example_answer,pred_answer,eval_metric
0,"Here is a table where the first line is a header and each subsequent line is a penguin: name, age, height (cm), weight (kg) Louis,...",(A),(C),False
1,"Here is a table where the first line is a header and each subsequent line is a penguin: name, age, height (cm), weight (kg) Louis,...",(D),(C) 50,False
2,"Here is a table where the first line is a header and each subsequent line is a penguin: name, age, height (cm), weight (kg) Louis,...",(A),(B),False
3,"Here is a table where the first line is a header and each subsequent line is a penguin: name, age, height (cm), weight (kg) Louis,...",(A),(B),False
4,"Here is a table where the first line is a header and each subsequent line is a penguin: name, age, height (cm), weight (kg) Louis,...",(B),(B),✔️ [True]
5,"Here is a table where the first line is a header and each subsequent line is a penguin: name, age, height (cm), weight (kg) Louis,...",(C),(C),✔️ [True]
6,"Here is a table where the first line is a header and each subsequent line is a penguin: name, age, height (cm), weight (kg) Louis,...",(E),(D),False
7,"Here is a table where the first line is a header and each subsequent line is a penguin: name, age, height (cm), weight (kg) Louis,...",(A),(B),False
8,"Here is a table where the first line is a header and each subsequent line is a penguin: name, age, height (cm), weight (kg) Louis,...",(C),(D) Gwen,False
9,"Here is a table where the first line is a header and each subsequent line is a penguin: name, age, height (cm), weight (kg) Louis,...",(D),(D) Donna,✔️ [True]


42.86

In [18]:
llm.inspect_history()





Given the fields `question`, produce the fields `answer`.

---

Follow the following format.

Question: ${question}
Answer: ${answer}

---

Question: Here is a table where the first line is a header and each subsequent line is a penguin: name, age, height (cm), weight (kg) Louis, 7, 50, 11 Bernard, 5, 80, 13 Vincent, 9, 60, 11 Gwen, 8, 70, 15 For example: the age of Louis is 7, the weight of Gwen is 15 kg, the height of Bernard is 80 cm. And here is a similar table, but listing giraffes: name, age, height (cm), weight (kg) Jody, 5, 430, 620 Gladys, 10, 420, 590 Marian, 2, 310, 410 Donna, 9, 440, 650 How many giraffes are more than 5 years old? Options: (A) 1 (B) 2 (C) 3 (D) 4 (E) 5
Answer: (B)

---

Question: Here is a table where the first line is a header and each subsequent line is a penguin: name, age, height (cm), weight (kg) Louis, 7, 50, 11 Bernard, 5, 80, 13 Vincent, 9, 60, 11 Gwen, 8, 70, 15 For example: the age of Louis is 7, the weight of Gwen is 15 kg, the height of Ber

Now we can try a [Chain of Thought](https://arxiv.org/abs/2201.11903) [@wei2023chainofthought] prompt.

In [19]:
class CoT(dspy.Module):
    def __init__(self):
        super().__init__()
        self.prog = dspy.ChainOfThought("question -> answer")

    def forward(self, question):
        return self.prog(question=question)


cot_qa = CoT()

In [20]:
# | warning: false
evaluate(cot_qa)

Average Metric: 90 / 126  (71.4): 100%|██████████| 126/126 [00:00<00:00, 1375.34it/s]

Average Metric: 90 / 126  (71.4%)





Unnamed: 0,question,example_answer,rationale,pred_answer,eval_metric
0,"Here is a table where the first line is a header and each subsequent line is a penguin: name, age, height (cm), weight (kg) Louis,...",(A),"produce the answer. We first identify the penguins who are less than 8 years old. From the table, we see that Louis is 7 years...",(B) 2,False
1,"Here is a table where the first line is a header and each subsequent line is a penguin: name, age, height (cm), weight (kg) Louis,...",(D),"produce the answer. We need to add up the weights of all the penguins in the table. Louis weighs 11 kg, Bernard weighs 13 kg,...",(D) 62,✔️ [True]
2,"Here is a table where the first line is a header and each subsequent line is a penguin: name, age, height (cm), weight (kg) Louis,...",(A),produce the answer. We need to go through each penguin's age and count how many are more than 8 years old.,(C) 3,False
3,"Here is a table where the first line is a header and each subsequent line is a penguin: name, age, height (cm), weight (kg) Louis,...",(A),produce the answer. We need to identify the penguins who are both more than 5 years old and weigh more than 12 kg. Looking at...,(C) 3,False
4,"Here is a table where the first line is a header and each subsequent line is a penguin: name, age, height (cm), weight (kg) Louis,...",(B),produce the answer. We can see from the table that Bernard's age is 5.,(B) 5,✔️ [True]
5,"Here is a table where the first line is a header and each subsequent line is a penguin: name, age, height (cm), weight (kg) Louis,...",(C),"produce the answer. We first identify the penguins who are less than 10 years old. Louis is 7 years old, Bernard is 5 years old,...",(D) 4,False
6,"Here is a table where the first line is a header and each subsequent line is a penguin: name, age, height (cm), weight (kg) Louis,...",(E),"produce the answer. We need to identify the last penguin added to the table. By looking at the last entry in the penguin table, we...",James,False
7,"Here is a table where the first line is a header and each subsequent line is a penguin: name, age, height (cm), weight (kg) Louis,...",(A),produce the answer. We first need to identify the penguins who are more than 5 years old and weigh more than 12 kg. From the...,(A) 1,✔️ [True]
8,"Here is a table where the first line is a header and each subsequent line is a penguin: name, age, height (cm), weight (kg) Louis,...",(C),"produce the answer. We need to find the penguin with a height of 60 cm. Looking at the table, we see that Vincent is the...",(C) Vincent,✔️ [True]
9,"Here is a table where the first line is a header and each subsequent line is a penguin: name, age, height (cm), weight (kg) Louis,...",(D),produce the answer. We need to look at the last entry in the table listing giraffes. The last giraffe listed is Donna.,(D) Donna,✔️ [True]


71.43

In [21]:
llm.inspect_history()





Given the fields `question`, produce the fields `answer`.

---

Follow the following format.

Question: ${question}
Reasoning: Let's think step by step in order to ${produce the answer}. We ...
Answer: ${answer}

---

Question: Here is a table where the first line is a header and each subsequent line is a penguin: name, age, height (cm), weight (kg) Louis, 7, 50, 11 Bernard, 5, 80, 13 Vincent, 9, 60, 11 Gwen, 8, 70, 15 For example: the age of Louis is 7, the weight of Gwen is 15 kg, the height of Bernard is 80 cm. We now add a penguin to the table: James, 12, 90, 12 What is the name of the last penguin sorted by alphabetic order? Options: (A) Louis (B) Bernard (C) Vincent (D) Gwen (E) James
Reasoning: Let's think step by step in order to[32m produce the answer. We need to first add the new penguin James to the table and then sort the penguins by alphabetical order based on their names. The last penguin in the sorted list will be the answer.
Answer: (E) James[0m




Now we will try and optimize our chain of thought program. I am also hiding the output from this cell to keep things cleaner.

In [22]:
# | echo: false
tqdm._instances.clear()

In [23]:
# | output: false
tqdm._instances.clear()
config = dict(max_bootstrapped_demos=1, max_labeled_demos=4, num_candidate_programs=4, num_threads=6)
teleprompter = BootstrapFewShotWithRandomSearch(metric=eval_metric, **config)
optimized_cot_qa = teleprompter.compile(cot_qa, trainset=trainset, valset=valset)

Going to sample between 1 and 1 traces per predictor.
Will attempt to train 4 candidate sets.


Average Metric: 90 / 126  (71.4): 100%|██████████| 126/126 [00:00<00:00, 4723.48it/s]


Average Metric: 90 / 126  (71.4%)
Score: 71.43 for set: [0]
New best score: 71.43 for seed -3
Scores so far: [71.43]
Best score: 71.43


Average Metric: 90 / 126  (71.4): 100%|██████████| 126/126 [00:00<00:00, 691.57it/s]


Average Metric: 90 / 126  (71.4%)
Score: 71.43 for set: [4]
Scores so far: [71.43, 71.43]
Best score: 71.43


  5%|▌         | 1/20 [00:00<00:00, 746.18it/s]


Bootstrapped 1 full traces after 2 examples in round 0.


Average Metric: 74 / 126  (58.7): 100%|██████████| 126/126 [00:00<00:00, 1284.54it/s]


Average Metric: 74 / 126  (58.7%)
Score: 58.73 for set: [4]
Scores so far: [71.43, 71.43, 58.73]
Best score: 71.43
Average of max per entry across top 1 scores: 0.7142857142857143
Average of max per entry across top 2 scores: 0.873015873015873
Average of max per entry across top 3 scores: 0.9126984126984127
Average of max per entry across top 5 scores: 0.9126984126984127
Average of max per entry across top 8 scores: 0.9126984126984127
Average of max per entry across top 9999 scores: 0.9126984126984127


  5%|▌         | 1/20 [00:00<00:00, 762.74it/s]


Bootstrapped 1 full traces after 2 examples in round 0.


Average Metric: 102 / 126  (81.0): 100%|██████████| 126/126 [00:00<00:00, 892.42it/s]


Average Metric: 102 / 126  (81.0%)
Score: 80.95 for set: [4]
New best score: 80.95 for seed 0
Scores so far: [71.43, 71.43, 58.73, 80.95]
Best score: 80.95
Average of max per entry across top 1 scores: 0.8095238095238095
Average of max per entry across top 2 scores: 0.9126984126984127
Average of max per entry across top 3 scores: 0.9444444444444444
Average of max per entry across top 5 scores: 0.9682539682539683
Average of max per entry across top 8 scores: 0.9682539682539683
Average of max per entry across top 9999 scores: 0.9682539682539683


  5%|▌         | 1/20 [00:00<00:00, 539.39it/s]


Bootstrapped 1 full traces after 2 examples in round 0.


Average Metric: 94 / 126  (74.6): 100%|██████████| 126/126 [00:00<00:00, 1293.66it/s]


Average Metric: 94 / 126  (74.6%)
Score: 74.6 for set: [4]
Scores so far: [71.43, 71.43, 58.73, 80.95, 74.6]
Best score: 80.95
Average of max per entry across top 1 scores: 0.8095238095238095
Average of max per entry across top 2 scores: 0.9126984126984127
Average of max per entry across top 3 scores: 0.9444444444444444
Average of max per entry across top 5 scores: 0.9841269841269841
Average of max per entry across top 8 scores: 0.9841269841269841
Average of max per entry across top 9999 scores: 0.9841269841269841


  5%|▌         | 1/20 [00:00<00:00, 684.45it/s]


Bootstrapped 1 full traces after 2 examples in round 0.


Average Metric: 95 / 126  (75.4): 100%|██████████| 126/126 [00:00<00:00, 1383.44it/s]


Average Metric: 95 / 126  (75.4%)
Score: 75.4 for set: [4]
Scores so far: [71.43, 71.43, 58.73, 80.95, 74.6, 75.4]
Best score: 80.95
Average of max per entry across top 1 scores: 0.8095238095238095
Average of max per entry across top 2 scores: 0.9206349206349206
Average of max per entry across top 3 scores: 0.9444444444444444
Average of max per entry across top 5 scores: 0.9841269841269841
Average of max per entry across top 8 scores: 0.9841269841269841
Average of max per entry across top 9999 scores: 0.9841269841269841


  5%|▌         | 1/20 [00:00<00:00, 598.33it/s]


Bootstrapped 1 full traces after 2 examples in round 0.


Average Metric: 96 / 126  (76.2): 100%|██████████| 126/126 [00:00<00:00, 900.11it/s]


Average Metric: 96 / 126  (76.2%)
Score: 76.19 for set: [4]
Scores so far: [71.43, 71.43, 58.73, 80.95, 74.6, 75.4, 76.19]
Best score: 80.95
Average of max per entry across top 1 scores: 0.8095238095238095
Average of max per entry across top 2 scores: 0.9206349206349206
Average of max per entry across top 3 scores: 0.9603174603174603
Average of max per entry across top 5 scores: 0.9682539682539683
Average of max per entry across top 8 scores: 0.9920634920634921
Average of max per entry across top 9999 scores: 0.9920634920634921
7 candidate programs found.


In [24]:
# | warning: false
evaluate(optimized_cot_qa)

Average Metric: 102 / 126  (81.0): 100%|██████████| 126/126 [00:00<00:00, 4480.49it/s]


Average Metric: 102 / 126  (81.0%)


Unnamed: 0,question,example_answer,rationale,pred_answer,eval_metric
0,"Here is a table where the first line is a header and each subsequent line is a penguin: name, age, height (cm), weight (kg) Louis,...",(A),"produce the answer. After deleting Bernard, the penguins left are Louis, Vincent, and Gwen. Among them, Louis and Gwen are less than 8 years old.",(B) 2,False
1,"Here is a table where the first line is a header and each subsequent line is a penguin: name, age, height (cm), weight (kg) Louis,...",(D),produce the answer. We sum up the weights of all the penguins: 11 + 13 + 11 + 15 + 12 = 62.,(D) 62,✔️ [True]
2,"Here is a table where the first line is a header and each subsequent line is a penguin: name, age, height (cm), weight (kg) Louis,...",(A),"produce the answer. We know that after deleting Bernard, the penguins left are Louis, Vincent, and Gwen. Among them, only Vincent is more than 8...",(A) 1,✔️ [True]
3,"Here is a table where the first line is a header and each subsequent line is a penguin: name, age, height (cm), weight (kg) Louis,...",(A),"produce the answer. We have Louis, Vincent, Gwen, and James in the table. Among them, only James is more than 5 years old and weighs...",(A) 1,✔️ [True]
4,"Here is a table where the first line is a header and each subsequent line is a penguin: name, age, height (cm), weight (kg) Louis,...",(B),produce the answer. We know that the age of Bernard is 5 years old.,(B) 5,✔️ [True]
5,"Here is a table where the first line is a header and each subsequent line is a penguin: name, age, height (cm), weight (kg) Louis,...",(C),"produce the answer. We know that after deleting Bernard, the penguins left are Louis, Vincent, and Gwen. Among them, Louis and Gwen are less than...",(B) 2,False
6,"Here is a table where the first line is a header and each subsequent line is a penguin: name, age, height (cm), weight (kg) Louis,...",(E),produce the answer. We know that the last penguin added to the table is James.,(E) James,✔️ [True]
7,"Here is a table where the first line is a header and each subsequent line is a penguin: name, age, height (cm), weight (kg) Louis,...",(A),"produce the answer. After deleting Bernard, we are left with Louis, Vincent, and Gwen. Among them, only Gwen is more than 5 years old and...",(A) 1,✔️ [True]
8,"Here is a table where the first line is a header and each subsequent line is a penguin: name, age, height (cm), weight (kg) Louis,...",(C),produce the answer. We know that the only penguin with a height of 60 cm is Vincent.,(C) Vincent,✔️ [True]
9,"Here is a table where the first line is a header and each subsequent line is a penguin: name, age, height (cm), weight (kg) Louis,...",(D),produce the answer. We know that the last giraffe listed is Donna.,(D) Donna,✔️ [True]


80.95

In [25]:
llm.inspect_history(n=1)





Given the fields `question`, produce the fields `answer`.

---

Follow the following format.

Question: ${question}
Reasoning: Let's think step by step in order to ${produce the answer}. We ...
Answer: ${answer}

---

Question: Here is a table where the first line is a header and each subsequent line is a penguin: name, age, height (cm), weight (kg) Louis, 7, 50, 11 Bernard, 5, 80, 13 Vincent, 9, 60, 11 Gwen, 8, 70, 15 For example: the age of Louis is 7, the weight of Gwen is 15 kg, the height of Bernard is 80 cm. We then delete the penguin named Bernard from the table. How many penguins are more than 8 years old? Options: (A) 1 (B) 2 (C) 3 (D) 4 (E) 5
Reasoning: Let's think step by step in order to produce the answer. We know that after deleting Bernard, the penguins left are Louis, Vincent, and Gwen. Among them, only Vincent is more than 8 years old.
Answer: (A) 1

---

Question: Here is a table where the first line is a header and each subsequent line is a penguin: name, age, he

It's really nice that the above focused on:

- Writing small modules/programs.
- Choosing an optimizer.
- Running the compile/optimization step.
- Running an evaluation.

I really like this idea instead of manually writing prompts and hoping for the best.
