# Best Demo Ensemble

###### In this notebook, we look at how to use BestDemoEnsemble, an optimizer that uses an LLM as a judge to choose between various Automatic Few-Shot Learning optimizers based on their demos. We will test it on a simple QA program.

In [None]:
%pip install evaluate
%pip install --upgrade pybind11
%pip install --force-reinstall torch evaluate

In [None]:
import dspy
import os
import importlib

from dspy.datasets.gsm8k import GSM8K
from dspy.evaluate.evaluate import Evaluate

###### Setup

In [None]:
os.environ["OPENAI_API_KEY"] = ""
os.environ["OPENAI_API_BASE"] = ""

In [3]:
lm = dspy.LM("openai/gpt-4o-mini")
lm("testing")

["It looks like you're testing the system. How can I assist you today?"]

In [16]:
dspy.settings.configure(lm=lm)

In [2]:
gsm8k = GSM8K()
gsm8k_trainset, gsm8k_devset = gsm8k.train, gsm8k.dev
gsm8k_trainset = gsm8k_trainset[:50]
gsm8k_devset = gsm8k_devset[:50]

100%|██████████| 7473/7473 [00:00<00:00, 12740.80it/s]
100%|██████████| 1319/1319 [00:00<00:00, 10745.89it/s]


In [None]:
class GenerateAnswer(dspy.Signature):
    """Answer the math problem with a single number."""

    question = dspy.InputField(desc='a math word problem')
    answer = dspy.OutputField(desc='a single number')

class QA(dspy.Module):
    def __init__(self):
        super().__init__()

        self.generate_answer = dspy.ChainOfThought(GenerateAnswer)

    def forward(self, question):
        prediction = self.generate_answer(question=question)
        return dspy.Prediction(answer=prediction.answer)

In [26]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:


  %reload_ext autoreload


In [105]:
from dspy.teleprompt import LabeledFewShot
from dspy.teleprompt import BootstrapFewShot
from dspy.teleprompt import ClusterFewShot

def validate_answer(example, pred, trace=None):
    return dspy.evaluate.answer_exact_match(example, pred)

vanilla_teleprompter = LabeledFewShot(k=6)
bootstrap_teleprompter = BootstrapFewShot(max_labeled_demos=6, metric=validate_answer)
cluster_teleprompter = ClusterFewShot(num_labeled_demos=6)

compiled_vanilla = vanilla_teleprompter.compile(QA(), trainset=gsm8k_trainset)
compiled_bootstrap = bootstrap_teleprompter.compile(QA(), trainset=gsm8k_trainset)
compiled_cluster = cluster_teleprompter.compile(QA(), trainset=gsm8k_trainset)

Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

  2%|▏         | 4/200 [00:00<00:01, 108.71it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


In [151]:
importlib.reload(dspy.teleprompt)
importlib.reload(dspy.teleprompt.cluster_fewshot)
importlib.reload(dspy.teleprompt.best_demo_ensemble)

<module 'dspy.teleprompt.best_demo_ensemble' from 'c:\\Users\\Hamza\\dspy\\dspy\\teleprompt\\best_demo_ensemble.py'>

###### The optimizer is initialized like normal, and when you compile, you pass in the programs that you want BestDemoEnsemble to choose between and the LM that will be the judge. In this case, we pass in programs compiled by LabeledFewShot, BootstrapFewShot, and ClusterFewShot respectively.

In [158]:
from dspy.teleprompt import BestDemoEnsemble

optimizer = BestDemoEnsemble()
programs = [compiled_vanilla, compiled_bootstrap, compiled_cluster] # define the programs you want the optimizer to choose between

compiled_program = optimizer.compile(programs, lm) # pass in the programs and the LM that will judge between program demos

###### Now let's run it on a random example from the devset.

In [123]:
print(gsm8k_devset[10])
compiled_program(gsm8k_devset[10].question)

Example({'question': 'Vanessa wants to buy a dress she saw at the mall, which costs $80, and she already has $20 in savings. Her parents give her $30 every week, but she also spends $10 each weekend at the arcades. How many weeks will she have to wait until she can gather enough money to buy the dress?', 'gold_reasoning': 'Vanessa needs $80 – $20 = $<<80-20=60>>60 to buy the dress. She manages to gather $30 - $10 = $<<30-10=20>>20 each week The number of weeks she has to wait is 60 ÷ 20 = <<60/20=3>>3 weeks.', 'answer': '3'}) (input_keys={'question'})


Prediction(
    answer='3'
)

###### We see that it runs properly and predicts the correct answer.

###### We can also see which program was chosen and why.

In [159]:
compiled_program.chosen_program, compiled_program.justification

(1,
 {'difficulty': 'Program 1 presents problems that are moderately challenging, requiring a good understanding of basic arithmetic and logical reasoning without being overly complex.',
  'coverage': 'This program covers a variety of scenarios, including exam scores, distance calculations, and comparisons, which provides a broad range of mathematical concepts and real-world applications.',
  'semantic_shift': 'The examples in Program 1 maintain a low semantic shift, as they all revolve around straightforward calculations and comparisons, making it easier for users to follow the logic and reasoning across different problems.'})

###### It looks like the optimizer chose Program 1 (i.e. the second program: compiled_boostrap) and it provides a justification for its choice. Neat! 
###### Now let's try it with a different set of programs.

In [None]:
cluster_teleprompter = ClusterFewShot(num_labeled_demos=16)
compiled_cluster = cluster_teleprompter.compile(QA(), trainset=gsm8k_trainset)
programs = [compiled_cluster, compiled_vanilla]

compiled_program = optimizer.compile(programs, lm)
compiled_program.chosen_program, compiled_program.justification

Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

(0,
 {'difficulty': 'Program 0 presents a variety of problems that are moderately difficult, requiring multi-step reasoning and basic arithmetic skills, which is appropriate for a wide range of learners.',
  'coverage': 'Program 0 covers a broad range of topics including basic arithmetic, percentages, discounts, and averages, providing a comprehensive set of examples that can appeal to different interests and contexts.',
  'semantic_shift': 'The examples in Program 0 maintain a low semantic shift, as they all revolve around practical, real-world scenarios involving calculations, making it easier for learners to follow the logic and reasoning across different examples.'})

###### This time, when choosing between two programs, one compiled with LabeledFewShot (compiled_vanilla) and one with ClusterFewShot with number of demos = 16, we see that it chooses the ClusterFewShot-compiled program (Program 0).