# Intro

[`DSPy`](https://github.com/stanfordnlp/dspy) kept popping up on my X timeline and I thought it looked pretty interesting, so I decided to take a few days to look into it. I didn't get super deep into it yet, but I think I have a high level understanding. The library is fairly new IMO (as of writing this). There is excitement around it though and a growing community. I am hopeful that the documentation and library will continue to improve throughout the year. If you are completely new to `DSPy` I would suggest the following resources below.

- Read through the newer documentation [here](https://dspy-docs.vercel.app/docs/intro).
- Checkout the README from [`DSPY` GitHub repo](https://github.com/stanfordnlp/dspy) and the examples there.
- Try and code up some simple examples on your own data.
- Checkout the [Discord server](https://discord.gg/s7cFzpw3Mj).
- Skim through or read some of the associated papers (see the paper links on the `DSPy` repo [README](https://github.com/stanfordnlp/dspy?tab=readme-ov-file#dspy-programmingnot-promptingfoundation-models)).
- There are also some decent videos on YouTube. Simply Search for `DSPy` LLM etc. 

# ENV Setup

```
python3 -m venv env
source env/bin/activate
pip install dspy-ai
pip install openai --upgrade
```

```python
import os
os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_API_KEY"
```

# BIG-Bench Hard Dataset - Penguins In a Table - Example

Within the [BIG-Bench Hard dataset](https://github.com/suzgunmirac/BIG-Bench-Hard) [@suzgun2022challenging] there are various tasks. You can use one of these strings when using `load_dataset` to load in the corresponding records for that task.

```python
['tracking_shuffled_objects_seven_objects', 'salient_translation_error_detection', 'tracking_shuffled_objects_three_objects', 'geometric_shapes', 'object_counting', 'word_sorting', 'logical_deduction_five_objects', 'hyperbaton', 'sports_understanding', 'logical_deduction_seven_objects', 'multistep_arithmetic_two', 'ruin_names', 'causal_judgement', 'logical_deduction_three_objects', 'formal_fallacies', 'snarks', 'boolean_expressions', 'reasoning_about_colored_objects', 'dyck_languages', 'navigate', 'disambiguation_qa', 'temporal_sequences', 'web_of_lies', 'tracking_shuffled_objects_five_objects', 'penguins_in_a_table', 'movie_recommendation', 'date_understanding']
```

We will use the `penguins_in_a_table` task.

In [1]:
from datasets import load_dataset
import dspy

ds = load_dataset("maveriq/bigbenchhard", "penguins_in_a_table")["train"]
examples = [dspy.Example({"question": r["input"], "answer": r["target"]}).with_inputs("question") for r in ds]
print(f"There are {len(examples)} examples.")
trainset = examples[0:20]
valset = examples[20:]

There are 146 examples.


In [2]:
example = trainset[10]
for k, v in example.items():
    print(f"\n{k.upper()}:\n")
    print(v)


QUESTION:

Here is a table where the first line is a header and each subsequent line is a penguin:  name, age, height (cm), weight (kg) Louis, 7, 50, 11 Bernard, 5, 80, 13 Vincent, 9, 60, 11 Gwen, 8, 70, 15  For example: the age of Louis is 7, the weight of Gwen is 15 kg, the height of Bernard is 80 cm.  We then delete the penguin named Bernard from the table.
How many penguins are more than 8 years old?
Options:
(A) 1
(B) 2
(C) 3
(D) 4
(E) 5

ANSWER:

(A)


We will use the `DSPy` OpenAI connector to make calls to gpt-3.5. Note that `DSPy` caches
API calls so that subsequent calls with the same input will read from the cache instead of calling OpenAI API a second time.  

In [3]:
llm = dspy.OpenAI(model="gpt-3.5-turbo-0125", max_tokens=250)
dspy.settings.configure(lm=llm)

We can test that the calls to OpenAI are working:

In [4]:
llm("Testing testing, is anyone out there?")

["Hello! I'm here to help. What can I assist you with today?"]

In [5]:
llm(example.question)

['There are 2 penguins who are more than 8 years old: Vincent (9 years old) and Gwen (8 years old). \n\nTherefore, the answer is (B) 2.']

At any point we can look at the last `n` calls to the llm:

In [6]:
llm.inspect_history(n=2)





Testing testing, is anyone out there?[32m Hello! I'm here to help. What can I assist you with today?[0m







Here is a table where the first line is a header and each subsequent line is a penguin:  name, age, height (cm), weight (kg) Louis, 7, 50, 11 Bernard, 5, 80, 13 Vincent, 9, 60, 11 Gwen, 8, 70, 15  For example: the age of Louis is 7, the weight of Gwen is 15 kg, the height of Bernard is 80 cm.  We then delete the penguin named Bernard from the table.
How many penguins are more than 8 years old?
Options:
(A) 1
(B) 2
(C) 3
(D) 4
(E) 5[32m There are 2 penguins who are more than 8 years old: Vincent (9 years old) and Gwen (8 years old). 

Therefore, the answer is (B) 2.[0m


Our evaluation metric will check if the llm output contains the correct multiple choice 
answer. To define an evaluation metric in `DSPy` we create a function like the example below. The first two inputs
should be instances of `dspy.Example`. The metric function can contain any logic you need to evaluate your task. You can read more about the `trace`
argument in the [documentation](https://dspy-docs.vercel.app/docs/building-blocks/metrics#simple-metrics). It needs to be there, even if not explicitly using it.

In [7]:
def eval_metric(true, prediction, trace=None):
    prediction_answer = prediction.answer
    parsed_answer = f"({prediction_answer[prediction_answer.find('(') + 1]})"
    return parsed_answer == true.answer

We set up an evaluation pipeline:

In [8]:
from dspy.evaluate import Evaluate

evaluate = Evaluate(devset=valset, metric=eval_metric, num_threads=6, display_progress=True, display_table=5)

Here is a simple module in `DSPy` for basic question and answer.

In [9]:
class BasicQA(dspy.Module):
    def __init__(self):
        super().__init__()
        self.prog = dspy.Predict("question -> answer")

    def forward(self, question):
        return self.prog(question=question)


basic_qa = BasicQA()

The `forward` method calls `__call__` similar to how things work in pytorch.

In [10]:
pred = basic_qa(question=example.question)
print("\nQUESTION:\n")
print(example.question)
print("\nANSWER:\n")
print(example.answer)
print("\nPREDICTION:\n")
print(pred.answer)


QUESTION:

Here is a table where the first line is a header and each subsequent line is a penguin:  name, age, height (cm), weight (kg) Louis, 7, 50, 11 Bernard, 5, 80, 13 Vincent, 9, 60, 11 Gwen, 8, 70, 15  For example: the age of Louis is 7, the weight of Gwen is 15 kg, the height of Bernard is 80 cm.  We then delete the penguin named Bernard from the table.
How many penguins are more than 8 years old?
Options:
(A) 1
(B) 2
(C) 3
(D) 4
(E) 5

ANSWER:

(A)

PREDICTION:

(B) 2


In [16]:
eval_metric(example, pred)

False

In [17]:
llm.inspect_history(n=1)





Given the fields `question`, produce the fields `answer`.

---

Follow the following format.

Question: ${question}
Answer: ${answer}

---

Question: Here is a table where the first line is a header and each subsequent line is a penguin: name, age, height (cm), weight (kg) Louis, 7, 50, 11 Bernard, 5, 80, 13 Vincent, 9, 60, 11 Gwen, 8, 70, 15 For example: the age of Louis is 7, the weight of Gwen is 15 kg, the height of Bernard is 80 cm. We now add a penguin to the table: James, 12, 90, 12 We then delete the penguin named Bernard from the table. How many penguins are less than 8 years old? Options: (A) 1 (B) 2 (C) 3 (D) 4 (E) 5
Answer: (A)

---

Question: Here is a table where the first line is a header and each subsequent line is a penguin: name, age, height (cm), weight (kg) Louis, 7, 50, 11 Bernard, 5, 80, 13 Vincent, 9, 60, 11 Gwen, 8, 70, 15 For example: the age of Louis is 7, the weight of Gwen is 15 kg, the height of Bernard is 80 cm. We now add a penguin to the table: James

Now we can pass each example question through the LLM in the validation set and check if we get the correct answer:

In [19]:
# | warning: false
from tqdm.notebook import tqdm
tqdm._instances.clear()
evaluate(basic_qa)

Average Metric: 46 / 126  (36.5): 100%|██████████| 126/126 [00:00<00:00, 3893.46it/s]

Average Metric: 46 / 126  (36.5%)





Unnamed: 0,question,example_answer,pred_answer,eval_metric
0,"Here is a table where the first line is a header and each subsequent line is a penguin: name, age, height (cm), weight (kg) Louis,...",(A),3,False
1,"Here is a table where the first line is a header and each subsequent line is a penguin: name, age, height (cm), weight (kg) Louis,...",(D),(C) 50,False
2,"Here is a table where the first line is a header and each subsequent line is a penguin: name, age, height (cm), weight (kg) Louis,...",(A),Answer: (C) 3,False
3,"Here is a table where the first line is a header and each subsequent line is a penguin: name, age, height (cm), weight (kg) Louis,...",(A),Answer: (B) 2,False
4,"Here is a table where the first line is a header and each subsequent line is a penguin: name, age, height (cm), weight (kg) Louis,...",(B),(B) 5,✔️ [True]


36.51

`DSPy` uses optimizers to optimize the modules. In this example, optimization is a process that will choose which demos/examples
are best to put into the prompt in order to increase the evaluation metric. At the time of writing the optimizers are called
teleprompters (prompting from a distance). I think they will change the [name](https://dspy-docs.vercel.app/docs/building-blocks/optimizers) though to optimizers in future refactoring.

In [20]:
# | output: false
tqdm._instances.clear()

from dspy.teleprompt import BootstrapFewShotWithRandomSearch

config = dict(max_bootstrapped_demos=2, max_labeled_demos=4, num_candidate_programs=2, num_threads=6)

teleprompter = BootstrapFewShotWithRandomSearch(metric=eval_metric, **config)
optimized_qa = teleprompter.compile(basic_qa, trainset=trainset, valset=valset)

Going to sample between 1 and 2 traces per predictor.
Will attempt to train 2 candidate sets.


Average Metric: 46 / 126  (36.5): 100%|██████████| 126/126 [00:00<00:00, 4592.34it/s]


Average Metric: 46 / 126  (36.5%)
Score: 36.51 for set: [0]
New best score: 36.51 for seed -3
Scores so far: [36.51]
Best score: 36.51


Average Metric: 47 / 126  (37.3): 100%|██████████| 126/126 [00:00<00:00, 4659.89it/s]


Average Metric: 47 / 126  (37.3%)
Score: 37.3 for set: [4]
New best score: 37.3 for seed -2
Scores so far: [36.51, 37.3]
Best score: 37.3


 50%|█████     | 10/20 [00:00<00:00, 5966.29it/s]


Bootstrapped 2 full traces after 11 examples in round 0.


Average Metric: 48 / 126  (38.1): 100%|██████████| 126/126 [00:09<00:00, 12.95it/s]
  df = df.applymap(truncate_cell)


Average Metric: 48 / 126  (38.1%)
Score: 38.1 for set: [4]
New best score: 38.1 for seed -1
Scores so far: [36.51, 37.3, 38.1]
Best score: 38.1
Average of max per entry across top 1 scores: 0.38095238095238093
Average of max per entry across top 2 scores: 0.5476190476190477
Average of max per entry across top 3 scores: 0.6825396825396826
Average of max per entry across top 5 scores: 0.6825396825396826
Average of max per entry across top 8 scores: 0.6825396825396826
Average of max per entry across top 9999 scores: 0.6825396825396826


 20%|██        | 4/20 [00:00<00:00, 1012.87it/s]


Bootstrapped 2 full traces after 5 examples in round 0.


Average Metric: 50 / 126  (39.7): 100%|██████████| 126/126 [00:09<00:00, 13.27it/s]


Average Metric: 50 / 126  (39.7%)
Score: 39.68 for set: [4]
New best score: 39.68 for seed 0
Scores so far: [36.51, 37.3, 38.1, 39.68]
Best score: 39.68
Average of max per entry across top 1 scores: 0.3968253968253968
Average of max per entry across top 2 scores: 0.5079365079365079
Average of max per entry across top 3 scores: 0.6031746031746031
Average of max per entry across top 5 scores: 0.7222222222222222
Average of max per entry across top 8 scores: 0.7222222222222222
Average of max per entry across top 9999 scores: 0.7222222222222222


  5%|▌         | 1/20 [00:00<00:00, 504.30it/s]


Bootstrapped 1 full traces after 2 examples in round 0.


Average Metric: 54 / 126  (42.9): 100%|██████████| 126/126 [00:00<00:00, 1320.48it/s]

Average Metric: 54 / 126  (42.9%)
Score: 42.86 for set: [4]
New best score: 42.86 for seed 1
Scores so far: [36.51, 37.3, 38.1, 39.68, 42.86]
Best score: 42.86
Average of max per entry across top 1 scores: 0.42857142857142855
Average of max per entry across top 2 scores: 0.5396825396825397
Average of max per entry across top 3 scores: 0.6190476190476191
Average of max per entry across top 5 scores: 0.7777777777777778
Average of max per entry across top 8 scores: 0.7777777777777778
Average of max per entry across top 9999 scores: 0.7777777777777778
5 candidate programs found.





In [23]:
# | warning: false
evaluate(optimized_qa)

Average Metric: 54 / 126  (42.9): 100%|██████████| 126/126 [00:00<00:00, 3928.89it/s]

Average Metric: 54 / 126  (42.9%)





Unnamed: 0,question,example_answer,pred_answer,eval_metric
0,"Here is a table where the first line is a header and each subsequent line is a penguin: name, age, height (cm), weight (kg) Louis,...",(A),(C),False
1,"Here is a table where the first line is a header and each subsequent line is a penguin: name, age, height (cm), weight (kg) Louis,...",(D),(C) 50,False
2,"Here is a table where the first line is a header and each subsequent line is a penguin: name, age, height (cm), weight (kg) Louis,...",(A),(B),False
3,"Here is a table where the first line is a header and each subsequent line is a penguin: name, age, height (cm), weight (kg) Louis,...",(A),(B),False
4,"Here is a table where the first line is a header and each subsequent line is a penguin: name, age, height (cm), weight (kg) Louis,...",(B),(B),✔️ [True]


42.86

In [24]:
llm.inspect_history()





Given the fields `question`, produce the fields `answer`.

---

Follow the following format.

Question: ${question}
Answer: ${answer}

---

Question: Here is a table where the first line is a header and each subsequent line is a penguin: name, age, height (cm), weight (kg) Louis, 7, 50, 11 Bernard, 5, 80, 13 Vincent, 9, 60, 11 Gwen, 8, 70, 15 For example: the age of Louis is 7, the weight of Gwen is 15 kg, the height of Bernard is 80 cm. And here is a similar table, but listing giraffes: name, age, height (cm), weight (kg) Jody, 5, 430, 620 Gladys, 10, 420, 590 Marian, 2, 310, 410 Donna, 9, 440, 650 How many giraffes are more than 5 years old? Options: (A) 1 (B) 2 (C) 3 (D) 4 (E) 5
Answer: (B)

---

Question: Here is a table where the first line is a header and each subsequent line is a penguin: name, age, height (cm), weight (kg) Louis, 7, 50, 11 Bernard, 5, 80, 13 Vincent, 9, 60, 11 Gwen, 8, 70, 15 For example: the age of Louis is 7, the weight of Gwen is 15 kg, the height of Ber

Now we can try a [Chain of Thought](https://arxiv.org/abs/2201.11903) [@wei2023chainofthought] prompt.

In [25]:
class CoT(dspy.Module):
    def __init__(self):
        super().__init__()
        self.prog = dspy.ChainOfThought("question -> answer")

    def forward(self, question):
        return self.prog(question=question)


cot_qa = CoT()

In [26]:
# | warning: false
evaluate(cot_qa)

Average Metric: 91 / 126  (72.2): 100%|██████████| 126/126 [00:00<00:00, 1324.21it/s]

Average Metric: 91 / 126  (72.2%)





Unnamed: 0,question,example_answer,rationale,pred_answer,eval_metric
0,"Here is a table where the first line is a header and each subsequent line is a penguin: name, age, height (cm), weight (kg) Louis,...",(A),"produce the answer. We first identify the penguins who are less than 8 years old. From the table, we see that Louis is 7 years...",(B) 2,False
1,"Here is a table where the first line is a header and each subsequent line is a penguin: name, age, height (cm), weight (kg) Louis,...",(D),"produce the answer. We need to add up the weights of all the penguins in the table. Louis weighs 11 kg, Bernard weighs 13 kg,...",(D) 62,✔️ [True]
2,"Here is a table where the first line is a header and each subsequent line is a penguin: name, age, height (cm), weight (kg) Louis,...",(A),produce the answer. We need to go through each penguin's age and count how many are more than 8 years old.,(C) 3,False
3,"Here is a table where the first line is a header and each subsequent line is a penguin: name, age, height (cm), weight (kg) Louis,...",(A),produce the answer. We need to identify the penguins who are both more than 5 years old and weigh more than 12 kg. Looking at...,(C) 3,False
4,"Here is a table where the first line is a header and each subsequent line is a penguin: name, age, height (cm), weight (kg) Louis,...",(B),produce the answer. We can see from the table that Bernard's age is 5.,(B) 5,✔️ [True]


72.22

In [27]:
llm.inspect_history()





Given the fields `question`, produce the fields `answer`.

---

Follow the following format.

Question: ${question}
Reasoning: Let's think step by step in order to ${produce the answer}. We ...
Answer: ${answer}

---

Question: Here is a table where the first line is a header and each subsequent line is a penguin: name, age, height (cm), weight (kg) Louis, 7, 50, 11 Bernard, 5, 80, 13 Vincent, 9, 60, 11 Gwen, 8, 70, 15 For example: the age of Louis is 7, the weight of Gwen is 15 kg, the height of Bernard is 80 cm. We now add a penguin to the table: James, 12, 90, 12 What is the name of the last penguin sorted by alphabetic order? Options: (A) Louis (B) Bernard (C) Vincent (D) Gwen (E) James
Reasoning: Let's think step by step in order to[32m produce the answer. We need to first add the new penguin James to the table and then sort the penguins by alphabetical order based on their names. The last penguin in the sorted list will be the answer.
Answer: (E) James[0m



Now we will try and optimize out chain of thought program.

In [28]:
# | output: false
tqdm._instances.clear()
config = dict(max_bootstrapped_demos=1, max_labeled_demos=4, num_candidate_programs=4, num_threads=6)
teleprompter = BootstrapFewShotWithRandomSearch(metric=eval_metric, **config)
optimized_cot_qa = teleprompter.compile(cot_qa, trainset=trainset, valset=valset)

Going to sample between 1 and 1 traces per predictor.
Will attempt to train 4 candidate sets.


Average Metric: 91 / 126  (72.2): 100%|██████████| 126/126 [00:00<00:00, 4615.08it/s]


Average Metric: 91 / 126  (72.2%)
Score: 72.22 for set: [0]
New best score: 72.22 for seed -3
Scores so far: [72.22]
Best score: 72.22


Average Metric: 90 / 126  (71.4): 100%|██████████| 126/126 [00:00<00:00, 1166.23it/s]


Average Metric: 90 / 126  (71.4%)
Score: 71.43 for set: [4]
Scores so far: [72.22, 71.43]
Best score: 72.22


  5%|▌         | 1/20 [00:00<00:00, 651.09it/s]


Bootstrapped 1 full traces after 2 examples in round 0.


Average Metric: 74 / 126  (58.7): 100%|██████████| 126/126 [00:00<00:00, 1205.81it/s]


Average Metric: 74 / 126  (58.7%)
Score: 58.73 for set: [4]
Scores so far: [72.22, 71.43, 58.73]
Best score: 72.22
Average of max per entry across top 1 scores: 0.7222222222222222
Average of max per entry across top 2 scores: 0.873015873015873
Average of max per entry across top 3 scores: 0.9126984126984127
Average of max per entry across top 5 scores: 0.9126984126984127
Average of max per entry across top 8 scores: 0.9126984126984127
Average of max per entry across top 9999 scores: 0.9126984126984127


  5%|▌         | 1/20 [00:00<00:00, 684.45it/s]


Bootstrapped 1 full traces after 2 examples in round 0.


Average Metric: 102 / 126  (81.0): 100%|██████████| 126/126 [00:00<00:00, 1222.16it/s]


Average Metric: 102 / 126  (81.0%)
Score: 80.95 for set: [4]
New best score: 80.95 for seed 0
Scores so far: [72.22, 71.43, 58.73, 80.95]
Best score: 80.95
Average of max per entry across top 1 scores: 0.8095238095238095
Average of max per entry across top 2 scores: 0.9126984126984127
Average of max per entry across top 3 scores: 0.9444444444444444
Average of max per entry across top 5 scores: 0.9682539682539683
Average of max per entry across top 8 scores: 0.9682539682539683
Average of max per entry across top 9999 scores: 0.9682539682539683


  5%|▌         | 1/20 [00:00<00:00, 589.25it/s]


Bootstrapped 1 full traces after 2 examples in round 0.


Average Metric: 94 / 126  (74.6): 100%|██████████| 126/126 [00:00<00:00, 1164.05it/s]


Average Metric: 94 / 126  (74.6%)
Score: 74.6 for set: [4]
Scores so far: [72.22, 71.43, 58.73, 80.95, 74.6]
Best score: 80.95
Average of max per entry across top 1 scores: 0.8095238095238095
Average of max per entry across top 2 scores: 0.9126984126984127
Average of max per entry across top 3 scores: 0.9444444444444444
Average of max per entry across top 5 scores: 0.9841269841269841
Average of max per entry across top 8 scores: 0.9841269841269841
Average of max per entry across top 9999 scores: 0.9841269841269841


  5%|▌         | 1/20 [00:00<00:00, 513.13it/s]


Bootstrapped 1 full traces after 2 examples in round 0.


Average Metric: 95 / 126  (75.4): 100%|██████████| 126/126 [00:00<00:00, 928.26it/s]


Average Metric: 95 / 126  (75.4%)
Score: 75.4 for set: [4]
Scores so far: [72.22, 71.43, 58.73, 80.95, 74.6, 75.4]
Best score: 80.95
Average of max per entry across top 1 scores: 0.8095238095238095
Average of max per entry across top 2 scores: 0.9206349206349206
Average of max per entry across top 3 scores: 0.9444444444444444
Average of max per entry across top 5 scores: 0.9841269841269841
Average of max per entry across top 8 scores: 0.9841269841269841
Average of max per entry across top 9999 scores: 0.9841269841269841


  5%|▌         | 1/20 [00:00<00:00, 654.85it/s]


Bootstrapped 1 full traces after 2 examples in round 0.


Average Metric: 96 / 126  (76.2): 100%|██████████| 126/126 [00:00<00:00, 1184.73it/s]

Average Metric: 96 / 126  (76.2%)
Score: 76.19 for set: [4]
Scores so far: [72.22, 71.43, 58.73, 80.95, 74.6, 75.4, 76.19]
Best score: 80.95
Average of max per entry across top 1 scores: 0.8095238095238095
Average of max per entry across top 2 scores: 0.9206349206349206
Average of max per entry across top 3 scores: 0.9603174603174603
Average of max per entry across top 5 scores: 0.9682539682539683
Average of max per entry across top 8 scores: 0.9920634920634921
Average of max per entry across top 9999 scores: 0.9920634920634921
7 candidate programs found.





In [29]:
# | warning: false
evaluate(optimized_cot_qa)

Average Metric: 102 / 126  (81.0): 100%|██████████| 126/126 [00:00<00:00, 4143.11it/s]

Average Metric: 102 / 126  (81.0%)





Unnamed: 0,question,example_answer,rationale,pred_answer,eval_metric
0,"Here is a table where the first line is a header and each subsequent line is a penguin: name, age, height (cm), weight (kg) Louis,...",(A),"produce the answer. After deleting Bernard, the penguins left are Louis, Vincent, and Gwen. Among them, Louis and Gwen are less than 8 years old.",(B) 2,False
1,"Here is a table where the first line is a header and each subsequent line is a penguin: name, age, height (cm), weight (kg) Louis,...",(D),produce the answer. We sum up the weights of all the penguins: 11 + 13 + 11 + 15 + 12 = 62.,(D) 62,✔️ [True]
2,"Here is a table where the first line is a header and each subsequent line is a penguin: name, age, height (cm), weight (kg) Louis,...",(A),"produce the answer. We know that after deleting Bernard, the penguins left are Louis, Vincent, and Gwen. Among them, only Vincent is more than 8...",(A) 1,✔️ [True]
3,"Here is a table where the first line is a header and each subsequent line is a penguin: name, age, height (cm), weight (kg) Louis,...",(A),"produce the answer. We have Louis, Vincent, Gwen, and James in the table. Among them, only James is more than 5 years old and weighs...",(A) 1,✔️ [True]
4,"Here is a table where the first line is a header and each subsequent line is a penguin: name, age, height (cm), weight (kg) Louis,...",(B),produce the answer. We know that the age of Bernard is 5 years old.,(B) 5,✔️ [True]


80.95

In [30]:
llm.inspect_history(n=1)





Given the fields `question`, produce the fields `answer`.

---

Follow the following format.

Question: ${question}
Reasoning: Let's think step by step in order to ${produce the answer}. We ...
Answer: ${answer}

---

Question: Here is a table where the first line is a header and each subsequent line is a penguin: name, age, height (cm), weight (kg) Louis, 7, 50, 11 Bernard, 5, 80, 13 Vincent, 9, 60, 11 Gwen, 8, 70, 15 For example: the age of Louis is 7, the weight of Gwen is 15 kg, the height of Bernard is 80 cm. We then delete the penguin named Bernard from the table. How many penguins are more than 8 years old? Options: (A) 1 (B) 2 (C) 3 (D) 4 (E) 5
Reasoning: Let's think step by step in order to produce the answer. We know that after deleting Bernard, the penguins left are Louis, Vincent, and Gwen. Among them, only Vincent is more than 8 years old.
Answer: (A) 1

---

Question: Here is a table where the first line is a header and each subsequent line is a penguin: name, age, he

# Language Detection Example

In [31]:
import pandas as pd
llm = dspy.OpenAI(model="gpt-3.5-turbo-1106", max_tokens=250)
dspy.settings.configure(lm=llm)

df = pd.read_csv("lang_df.csv")
train_set = [dspy.Example(text=row.text, language=row.language).with_inputs("text") for _, row in df[:100].iterrows()]
val_set = [dspy.Example(text=row.text, language=row.language).with_inputs("text") for _, row in df[100:200].iterrows()]

In [32]:
example = train_set[10]
for k, v in example.items():
    print(f"\n{k.upper()}:\n")
    print(v)


TEXT:

Repost from organichomebg Лесно, бързо и мнооого вкусно! Петък вечер с аромат на пица! Приготвихме я с готов био блат без глутен и комбинирахме с доматен сос по провансалски, веган сирене delishu, ароматен риган, маслини, пресен босилек и чери домати. Запечена до хрупкавост и поднесена с кетчуп и майонеза без яйца. . . . . . #ilovedelishu #delishu #rawvegancheese #nutcheese #cashewcheese #vegancheese #plantcheese #rawcheese #probiotic #probiotics #slowfood #soulfood #delicious #vegan #veganpower #vegansecrets #whatveganseat #bestofvegan #veganfoodlovers #veganfood #veganeats #veganrecipes #veganfoodie #veganfoodshare #vegansofig #veganfoodporn #veganfood #veganisme #vegansofinstagram #veganonabudget

LANGUAGE:

bg


In [33]:
class Text2Language(dspy.Signature):
    """Detect the language of the text and only return the 2 letter iso639 code."""

    text = dspy.InputField()
    language = dspy.OutputField(desc="The 2 letter iso639 code of the detected language.")


class Classifier(dspy.Module):
    def __init__(self):
        super().__init__()
        self.prog = dspy.ChainOfThought(Text2Language)

    def forward(self, text):
        return self.prog(text=text)


classifier = Classifier()

In [34]:
classifier(text=example.text)

Prediction(
    rationale='produce the language. We see that the text contains Cyrillic characters, which are commonly used in Bulgarian.',
    language='bg'
)

In [35]:
llm.inspect_history(n=1)





Detect the language of the text and only return the 2 letter iso639 code.

---

Follow the following format.

Text: ${text}
Reasoning: Let's think step by step in order to ${produce the language}. We ...
Language: The 2 letter iso639 code of the detected language.

---

Text: Repost from organichomebg Лесно, бързо и мнооого вкусно! Петък вечер с аромат на пица! Приготвихме я с готов био блат без глутен и комбинирахме с доматен сос по провансалски, веган сирене delishu, ароматен риган, маслини, пресен босилек и чери домати. Запечена до хрупкавост и поднесена с кетчуп и майонеза без яйца. . . . . . #ilovedelishu #delishu #rawvegancheese #nutcheese #cashewcheese #vegancheese #plantcheese #rawcheese #probiotic #probiotics #slowfood #soulfood #delicious #vegan #veganpower #vegansecrets #whatveganseat #bestofvegan #veganfoodlovers #veganfood #veganeats #veganrecipes #veganfoodie #veganfoodshare #vegansofig #veganfoodporn #veganfood #veganisme #vegansofinstagram #veganonabudget
Reasoning:

In [36]:
def eval_metric(true, pred, trace=None):
    return true.language == pred.language

evaluate = Evaluate(devset=val_set, metric=eval_metric, num_threads=6, display_progress=True, display_table=5)

In [37]:
evaluate(classifier)

Average Metric: 79 / 100  (79.0): 100%|██████████| 100/100 [00:00<00:00, 1027.28it/s]

Average Metric: 79 / 100  (79.0%)



  df = df.applymap(truncate_cell)


Unnamed: 0,text,example_language,rationale,pred_language,eval_metric
0,A rainbow appeared in the sky.. #rainbow #rainbowmagic #shadesinthesky #colors #soulfood #bluewhite #clouds #landscapephotography #nature #naturemagic #naturephotography #naturelovers #lovephotographing #haveaphotographyday #myhelsinki #visitfinland #thisisfinland #beautifulfinland #finland4seasons...,fi,detect the language. We will analyze the text and look for common words and patterns to determine the language.,en,False
1,Αθήνα - Πόρος σε χρόνο ρεκόρ με τον καλύτερο οδηγό. #roadtrip #ontheroad #greece #poros #greeceislands #greece_travel #traveler #audi #audiq3 #visitgreece #argosaronikos #lovestory #lovecars #carsofinstagram #travelcar...,el,detect the language of the text. We will analyze the characters and patterns to determine the language.,el,✔️ [True]
2,"Természetes hatású szálas szemöldöktetoválás bájos vendégemnek! Microblading technikával készült! Bejelentkezés:Mariann 06-30-332-4011 Árak ,infók: u #smink #tartossmink #sminktetovalas #természetessminktetoválás #szálasszemöldöktetoválás #microblading #microbladingeyebrows #szemöldök #szemoldoktetovalas #sminktetováló #pmu...",hu,"produce the language. We see that the text contains a lot of Hungarian words and hashtags, so it is likely written in Hungarian.",hu,✔️ [True]
3,#bnw #blackandwhitephotography #turkiye #selcuk #historicturkey #blackandwhite #streetsofselcuk #attaturk #ageancoast #architecture #cntraveler #palmtrees #turkey,tr,"detect the language. We can see that the hashtags include ""turkiye"" and ""attaturk"" which are related to Turkey, so the language is likely Turkish.",tr,✔️ [True]
4,"Chapter-495, అత్తvsకోడలు, Punchకి Reverse Punch, #justforfun #telugucomedy #ytshorts #instareels #reels",te,"produce the language. We see that the text contains characters from the Telugu script, which is primarily used for the Telugu language.",te,✔️ [True]


79.0

In [38]:
llm.inspect_history(n=1)





Detect the language of the text and only return the 2 letter iso639 code.

---

Follow the following format.

Text: ${text}
Reasoning: Let's think step by step in order to ${produce the language}. We ...
Language: The 2 letter iso639 code of the detected language.

---

Text: 農曆新年week 兔氣揚眉 虎盡甘來
Reasoning: Let's think step by step in order to[32m detect the language. The characters used in the text are commonly used in Chinese languages, so it is likely to be Chinese.
Language: zh[0m


In [39]:
from dspy.teleprompt import BootstrapFewShotWithRandomSearch

config = dict(max_bootstrapped_demos=1, max_labeled_demos=4, num_candidate_programs=1, num_threads=6)
teleprompter = BootstrapFewShotWithRandomSearch(metric=eval_metric, **config)
optimized_classifier = teleprompter.compile(classifier, trainset=train_set, valset=val_set)

Going to sample between 1 and 1 traces per predictor.
Will attempt to train 1 candidate sets.


Average Metric: 79 / 100  (79.0): 100%|██████████| 100/100 [00:00<00:00, 3793.07it/s]
  df = df.applymap(truncate_cell)


Average Metric: 79 / 100  (79.0%)
Score: 79.0 for set: [0]
New best score: 79.0 for seed -3
Scores so far: [79.0]
Best score: 79.0


Average Metric: 89 / 100  (89.0): 100%|██████████| 100/100 [00:00<00:00, 688.51it/s]


Average Metric: 89 / 100  (89.0%)
Score: 89.0 for set: [4]
New best score: 89.0 for seed -2
Scores so far: [79.0, 89.0]
Best score: 89.0


  1%|          | 1/100 [00:00<00:00, 303.30it/s]


Bootstrapped 1 full traces after 2 examples in round 0.


Average Metric: 87 / 100  (87.0): 100%|██████████| 100/100 [00:00<00:00, 695.76it/s]


Average Metric: 87 / 100  (87.0%)
Score: 87.0 for set: [4]
Scores so far: [79.0, 89.0, 87.0]
Best score: 89.0
Average of max per entry across top 1 scores: 0.89
Average of max per entry across top 2 scores: 0.95
Average of max per entry across top 3 scores: 0.95
Average of max per entry across top 5 scores: 0.95
Average of max per entry across top 8 scores: 0.95
Average of max per entry across top 9999 scores: 0.95


  1%|          | 1/100 [00:00<00:00, 756.96it/s]


Bootstrapped 1 full traces after 2 examples in round 0.


Average Metric: 94 / 100  (94.0): 100%|██████████| 100/100 [00:00<00:00, 625.86it/s]

Average Metric: 94 / 100  (94.0%)
Score: 94.0 for set: [4]
New best score: 94.0 for seed 0
Scores so far: [79.0, 89.0, 87.0, 94.0]
Best score: 94.0
Average of max per entry across top 1 scores: 0.94
Average of max per entry across top 2 scores: 0.96
Average of max per entry across top 3 scores: 0.96
Average of max per entry across top 5 scores: 0.96
Average of max per entry across top 8 scores: 0.96
Average of max per entry across top 9999 scores: 0.96
4 candidate programs found.





In [40]:
llm.inspect_history(n=1)





Detect the language of the text and only return the 2 letter iso639 code.

---

Follow the following format.

Text: ${text}
Reasoning: Let's think step by step in order to ${produce the language}. We ...
Language: The 2 letter iso639 code of the detected language.

---

Text: 必嫵UIU假體隆乳，成就完美曲線 選擇必嫵： 安全 舒適 自然 疼痛感少 在術前醫生會對乳房的凸度、高度、寬度、挺度、聚攏度設計，根據不同求美者的基礎情況調整各項數據！ WhatsApp：+86 17620756861 LINE： #plasticsurgery #beauty #eyelidsurgery #rhinoplasty #DoubleJaw #Breastsurgery #viewplasticsurgery #korea #seoul #shenzhen #韓國整形 #韓國整容 #整形醫院 #韓國醫美 #顴骨縮小 #輪廓手術 #無痛隆胸 #必嫵隆胸 #假體隆胸 #隆胸手術 #隆乳手術
Reasoning: Let's think step by step in order to produce the language. We see that the text contains Chinese characters and mentions Shenzhen and Korea, which are locations in China and South Korea.
Language: zh

---

Text: כל המותגים במקום אחד במחירים הכי זולים שיש הרצל 21 לוד להזמנות 0547557707 שעות פתיחה 9:00 עד 21:00 #diesel #boss #lacoste #desquared2 #paulandshark #wjeans #robertovino #gant #recardo #hugo #a

In [41]:
evaluate(optimized_classifier)

Average Metric: 94 / 100  (94.0): 100%|██████████| 100/100 [00:00<00:00, 4019.42it/s]

Average Metric: 94 / 100  (94.0%)





Unnamed: 0,text,example_language,rationale,pred_language,eval_metric
0,A rainbow appeared in the sky.. #rainbow #rainbowmagic #shadesinthesky #colors #soulfood #bluewhite #clouds #landscapephotography #nature #naturemagic #naturephotography #naturelovers #lovephotographing #haveaphotographyday #myhelsinki #visitfinland #thisisfinland #beautifulfinland #finland4seasons...,fi,"produce the language. We see that the text contains English hashtags and mentions Helsinki, Finland, and the Finnish language.",en,False
1,Αθήνα - Πόρος σε χρόνο ρεκόρ με τον καλύτερο οδηγό. #roadtrip #ontheroad #greece #poros #greeceislands #greece_travel #traveler #audi #audiq3 #visitgreece #argosaronikos #lovestory #lovecars #carsofinstagram #travelcar...,el,"produce the language. We see that the text contains Greek characters and mentions Athens, Poros, and Greece, which are locations in Greece.",el,✔️ [True]
2,"Természetes hatású szálas szemöldöktetoválás bájos vendégemnek! Microblading technikával készült! Bejelentkezés:Mariann 06-30-332-4011 Árak ,infók: u #smink #tartossmink #sminktetovalas #természetessminktetoválás #szálasszemöldöktetoválás #microblading #microbladingeyebrows #szemöldök #szemoldoktetovalas #sminktetováló #pmu...",hu,produce the language. We see that the text contains Hungarian words and mentions a phone number with a Hungarian format.,hu,✔️ [True]
3,#bnw #blackandwhitephotography #turkiye #selcuk #historicturkey #blackandwhite #streetsofselcuk #attaturk #ageancoast #architecture #cntraveler #palmtrees #turkey,tr,produce the language. We see that the text contains Turkish words and mentions Turkey.,tr,✔️ [True]
4,"Chapter-495, అత్తvsకోడలు, Punchకి Reverse Punch, #justforfun #telugucomedy #ytshorts #instareels #reels",te,"produce the language. We see that the text contains Telugu characters and mentions a chapter number, which indicates that the language is Telugu.",te,✔️ [True]


94.0

# Signatures

In [None]:
sentence = "I love that shirt!"
classify = dspy.Predict("sentence -> sentiment")
classify(sentence=sentence)

In [None]:
llm.inspect_history()

In [None]:
# Example from the XSum dataset.
document = """The 21-year-old made seven appearances for the Hammers and netted his only goal for them in a Europa League qualification round match against Andorran side FC Lustrains last season. Lee had two loan spells in League One last term, with Blackpool and then Colchester United. He scored twice for the U's but was unable to save them from relegation. The length of Lee's contract with the promoted Tykes has not been revealed. Find all the latest football transfers on our dedicated page."""

summarize = dspy.ChainOfThought("document -> summary")
response = summarize(document=document)

print(response.summary)

In [None]:
llm.inspect_history()

In [None]:
class Emotion(dspy.Signature):
    """Classify emotion among sadness, joy, love, anger, fear, surprise."""

    sentence = dspy.InputField()
    sentiment = dspy.OutputField()


sentence = "i started feeling a little vulnerable when the giant spotlight started blinding me"

classify = dspy.Predict(Emotion)
classify(sentence=sentence)

In [None]:
llm.inspect_history()

In [None]:
class CheckCitationFaithfulness(dspy.Signature):
    """Verify that the text is based on the provided context."""

    context = dspy.InputField(desc="facts here are assumed to be true")
    text = dspy.InputField()
    faithfulness = dspy.OutputField(desc="True/False indicating if text is faithful to context")


context = "The 21-year-old made seven appearances for the Hammers and netted his only goal for them in a Europa League qualification round match against Andorran side FC Lustrains last season. Lee had two loan spells in League One last term, with Blackpool and then Colchester United. He scored twice for the U's but was unable to save them from relegation. The length of Lee's contract with the promoted Tykes has not been revealed. Find all the latest football transfers on our dedicated page."

text = "Lee scored 3 goals for Colchester United."

faithfulness = dspy.ChainOfThought(CheckCitationFaithfulness)
faithfulness(context=context, text=text)

In [None]:
llm.inspect_history()

Note that we can set a different teacher LLM during the optimization like this for example:

```python
gpt4T = dspy.OpenAI(model='gpt-4-1106-preview', max_tokens=350, model_type='chat')
turbo = dspy.OpenAI(model='gpt-3.5-turbo-1106', max_tokens=250, model_type='chat')
dspy.settings.configure(lm=turbo)
bootstrap_optimizer = BootstrapFewShotWithRandomSearch(
    max_bootstrapped_demos=8,
    max_labeled_demos=8,
    num_candidate_programs=10,
    num_threads=8,
    metric=scone_accuracy,
    teacher_settings=dict(lm=gpt4T))
```

# Modules

In [None]:
question = "What's something great about the ColBERT retrieval model?"

# 1) Declare with a signature, and pass some config.
classify = dspy.ChainOfThought("question -> answer", n=5)

# 2) Call with input argument.
response = classify(question=question)

# 3) Access the outputs.
response.completions.answer

In [None]:
print(f"Rationale: {response.rationale}")
print(f"Answer: {response.answer}")

In [None]:
import dspy

colbertv2_wiki17_abstracts = dspy.ColBERTv2(url="http://20.102.90.50:2017/wiki17_abstracts")

dspy.settings.configure(lm=llm, rm=colbertv2_wiki17_abstracts)

In [None]:
from dspy.datasets import HotPotQA

# Load the dataset.
dataset = HotPotQA(train_seed=1, train_size=20, eval_seed=2023, dev_size=50, test_size=0)

# Tell DSPy that the 'question' field is the input. Any other fields are labels and/or metadata.
trainset = [x.with_inputs("question") for x in dataset.train]
devset = [x.with_inputs("question") for x in dataset.dev]

len(trainset), len(devset)

In [None]:
trainset[0]

In [None]:
class GenerateAnswer(dspy.Signature):
    """Answer questions with short factoid answers."""

    context = dspy.InputField(desc="may contain relevant facts")
    question = dspy.InputField()
    answer = dspy.OutputField(desc="often between 1 and 5 words")

In [None]:
class RAG(dspy.Module):
    def __init__(self, num_passages=3):
        super().__init__()

        self.retrieve = dspy.Retrieve(k=num_passages)
        self.generate_answer = dspy.ChainOfThought(GenerateAnswer)

    def forward(self, question):
        context = self.retrieve(question).passages
        prediction = self.generate_answer(context=context, question=question)
        return dspy.Prediction(context=context, answer=prediction.answer)
rag = RAG()

In [None]:
rag("At My Window was released by which American singer-songwriter?")

In [None]:
llm.inspect_history()

In [None]:
from dspy.teleprompt import BootstrapFewShot


# Validation logic: check that the predicted answer is correct.
# Also check that the retrieved context does actually contain that answer.
def validate_context_and_answer(example, pred, trace=None):
    answer_EM = dspy.evaluate.answer_exact_match(example, pred)
    answer_PM = dspy.evaluate.answer_passage_match(example, pred)
    return answer_EM and answer_PM


# Set up a basic teleprompter, which will compile our RAG program.
teleprompter = BootstrapFewShot(metric=validate_context_and_answer)

# Compile!
compiled_rag = teleprompter.compile(rag, trainset=trainset)

In [None]:
# Ask any question you like to this simple RAG program.
my_question = "What castle did David Gregory inherit?"

# Get the prediction. This contains `pred.context` and `pred.answer`.
pred = compiled_rag(my_question)

# Print the contexts and the answer.
print(f"Question: {my_question}")
print(f"Predicted Answer: {pred.answer}")
print(f"Retrieved Contexts (truncated): {[c[:200] + '...' for c in pred.context]}")

In [None]:
llm.inspect_history()

In [None]:
from dspy.evaluate.evaluate import Evaluate

# Set up the `evaluate_on_hotpotqa` function. We'll use this many times below.
evaluate_on_hotpotqa = Evaluate(devset=devset, num_threads=1, display_progress=True, display_table=5)

# Evaluate the `compiled_rag` program with the `answer_exact_match` metric.
metric = dspy.evaluate.answer_exact_match
evaluate_on_hotpotqa(compiled_rag, metric=metric)