## 04_07 - Using DSPY to optimize our critique/judge prompt
We'll be creating a judge prompt to validate if our responses are strong.

### 0] Setting Up

In [1]:
%pip install -U pip dspy-ai openai~=0.28.1 transformers==4.42.0 datasets==2.20.0
import dspy
import pandas as pd
from datasets import load_dataset

Note: you may need to restart the kernel to use updated packages.


  from .autonotebook import tqdm as notebook_tqdm


In [2]:
prompt_model_name = "gpt-4o-mini"
prompt_model = dspy.OpenAI(model=prompt_model_name, max_tokens=1000, stop=["\n\n", "\n---"])
task_model = dspy.OpenAI(model=prompt_model_name, max_tokens=1000, stop=["\n\n", "\n---", "assistant"])
dspy.settings.configure(lm=task_model)

### 1] Creating our dataset
We'll be using ratings from "nvidia/HelpSteer" from huggingface to get our human annotated answers.

In [3]:
from datasets import load_dataset
from dspy.datasets.dataloader import DataLoader

train_ds, validation_ds, test_ds = load_dataset("nvidia/HelpSteer",split=['train[0:100]+train[-50:-30]', 'train[100:200]+train[-30:-0]', 'validation[:100]+validation[-25:]'] )
 

In [4]:
dataloader = DataLoader()
def prepare_datasets(datasets:list):
  new_datasets = []
  for dataset in datasets:
    dataset = pd.DataFrame(dataset)

    dataset["question"] = dataset["prompt"]
    dataset["response"] = dataset["response"]
    dataset["answer"] = dataset.apply(
        lambda row: 'yes' if row['correctness'] > 3 and all(row[col] > 2 for col in ['helpfulness', 'coherence']) else 'no',
        axis=1
    )
    dataset = dataset.drop(columns=['helpfulness', 'coherence', 'complexity', 'verbosity','correctness','prompt'])
    
    dataset = dataloader.from_pandas(df=dataset, fields=["question", "response","answer"],input_keys=["question","response"])
    new_datasets.append(dataset)
  return new_datasets 


In [5]:
train_ds, validation_ds, test_ds = prepare_datasets([train_ds, validation_ds, test_ds])

### 2] Defining our LLM Judge
We'll define our LLM judge as a program and compile it using our dataset

In [6]:
class FactJudgeSignature(dspy.Signature):
    """Judge if the answer is factually correct based on the question."""

    question = dspy.InputField(desc="Question")
    response = dspy.InputField(desc="LLM response for the question")
    answer = dspy.OutputField(desc="Is the response answer factually correct based on the question? Yes or No.")


class Judge(dspy.Module):
    def __init__(self):
        super().__init__()
        self.generate_answer = dspy.ChainOfThought(FactJudgeSignature)
    
    def forward(self,question, response):
        return self.generate_answer(question=question,response=response)

In [7]:
train_ds[0:1]

[Example({'question': 'What are the three most important things to consider when deciding what technology to use to build an assist device to help an elderly person with basic needs?', 'response': "To build an assistive device to help an elderly person with basic needs, one must consider three crucial things: safety, compatibility, and ease of use. Safety is paramount, as the device must not cause harm to the user. Compatibility with the user's environment and other devices is also essential. Finally, the device must be simple enough for the elderly person to operate.", 'answer': 'yes'}) (input_keys={'response', 'question'})]

### 3] Running our program

Now let's run our evaluation and compile our judge

In [27]:
from dspy.evaluate import Evaluate
def validate_answer(answer,pred, trace=None):
    return dspy.evaluate.answer_exact_match(answer,pred)

judge = Judge()
evaluate = Evaluate(devset=train_ds, metric=validate_answer,num_threads=4)
baseline_train_score = evaluate(judge)
baseline_val_score = evaluate(judge, devset=validation_ds)

In [28]:
print("Baseline train score:", baseline_train_score)
print("Baseline val score:", baseline_val_score)

Baseline train score: 45.0
Baseline val score: 44.0


### 4] Compiling a program with a teleprompter

So far we've just used a regular prompt to get an answer, let's have DSPY try to optimize our prompts. We'll need to define a search technique called a **teleprompter** and a teacher model to construct the demontrations. 

In [29]:
from dspy.teleprompt import BootstrapFewShotWithRandomSearch

# we'll generate an number of programs in parralel when compiling using gpt4omini as an evaluator
bootstrap_optimizer = BootstrapFewShotWithRandomSearch(
    max_bootstrapped_demos=5,
    max_labeled_demos=5,
    num_candidate_programs=5,
    num_threads=8,
    metric=validate_answer,
    teacher_settings=dict(lm=prompt_model))

Going to sample between 1 and 5 traces per predictor.
Will attempt to bootstrap 5 candidate sets.


In [30]:
cot_fewshot = bootstrap_optimizer.compile(Judge(), trainset=train_ds, valset=validation_ds)

Average Metric: 44 / 100  (44.0): 100%|██████████| 100/100 [00:00<00:00, 1165.45it/s]


Score: 44.0 for set: [0]
New best sscore: 44.0 for seed -3
Scores so far: [44.0]
Best score: 44.0


Average Metric: 47 / 100  (47.0): 100%|██████████| 100/100 [00:24<00:00,  4.03it/s]


Score: 47.0 for set: [5]
New best sscore: 47.0 for seed -2
Scores so far: [44.0, 47.0]
Best score: 47.0


 14%|█▍        | 17/120 [00:39<04:01,  2.34s/it]


Bootstrapped 5 full traces after 18 examples in round 0.


Average Metric: 54 / 100  (54.0): 100%|██████████| 100/100 [00:30<00:00,  3.32it/s]


Score: 54.0 for set: [5]
New best sscore: 54.0 for seed -1
Scores so far: [44.0, 47.0, 54.0]
Best score: 54.0
Average of max per entry across top 1 scores: 0.54
Average of max per entry across top 2 scores: 0.59
Average of max per entry across top 3 scores: 0.6
Average of max per entry across top 5 scores: 0.6
Average of max per entry across top 8 scores: 0.6
Average of max per entry across top 9999 scores: 0.6


 16%|█▌        | 19/120 [00:40<03:33,  2.11s/it]


Bootstrapped 4 full traces after 20 examples in round 0.


Average Metric: 52 / 100  (52.0): 100%|██████████| 100/100 [00:27<00:00,  3.58it/s]


Score: 52.0 for set: [5]
Scores so far: [44.0, 47.0, 54.0, 52.0]
Best score: 54.0
Average of max per entry across top 1 scores: 0.54
Average of max per entry across top 2 scores: 0.58
Average of max per entry across top 3 scores: 0.6
Average of max per entry across top 5 scores: 0.61
Average of max per entry across top 8 scores: 0.61
Average of max per entry across top 9999 scores: 0.61


  2%|▎         | 3/120 [00:06<04:13,  2.17s/it]


Bootstrapped 2 full traces after 4 examples in round 0.


Average Metric: 59 / 100  (59.0): 100%|██████████| 100/100 [00:28<00:00,  3.57it/s]


Score: 59.0 for set: [5]
New best sscore: 59.0 for seed 1
Scores so far: [44.0, 47.0, 54.0, 52.0, 59.0]
Best score: 59.0
Average of max per entry across top 1 scores: 0.59
Average of max per entry across top 2 scores: 0.62
Average of max per entry across top 3 scores: 0.64
Average of max per entry across top 5 scores: 0.65
Average of max per entry across top 8 scores: 0.65
Average of max per entry across top 9999 scores: 0.65


  1%|          | 1/120 [00:01<03:10,  1.60s/it]


Bootstrapped 1 full traces after 2 examples in round 0.


Average Metric: 57 / 100  (57.0): 100%|██████████| 100/100 [00:23<00:00,  4.21it/s]


Score: 57.0 for set: [5]
Scores so far: [44.0, 47.0, 54.0, 52.0, 59.0, 57.0]
Best score: 59.0
Average of max per entry across top 1 scores: 0.59
Average of max per entry across top 2 scores: 0.63
Average of max per entry across top 3 scores: 0.65
Average of max per entry across top 5 scores: 0.66
Average of max per entry across top 8 scores: 0.66
Average of max per entry across top 9999 scores: 0.66


  3%|▎         | 4/120 [00:07<03:40,  1.90s/it]


Bootstrapped 2 full traces after 5 examples in round 0.


Average Metric: 52 / 100  (52.0): 100%|██████████| 100/100 [00:26<00:00,  3.83it/s]


Score: 52.0 for set: [5]
Scores so far: [44.0, 47.0, 54.0, 52.0, 59.0, 57.0, 52.0]
Best score: 59.0
Average of max per entry across top 1 scores: 0.59
Average of max per entry across top 2 scores: 0.63
Average of max per entry across top 3 scores: 0.65
Average of max per entry across top 5 scores: 0.67
Average of max per entry across top 8 scores: 0.67
Average of max per entry across top 9999 scores: 0.67


  2%|▎         | 3/120 [00:05<03:32,  1.81s/it]


Bootstrapped 2 full traces after 4 examples in round 0.


Average Metric: 50 / 100  (50.0): 100%|██████████| 100/100 [00:27<00:00,  3.64it/s]

Score: 50.0 for set: [5]
Scores so far: [44.0, 47.0, 54.0, 52.0, 59.0, 57.0, 52.0, 50.0]
Best score: 59.0
Average of max per entry across top 1 scores: 0.59
Average of max per entry across top 2 scores: 0.63
Average of max per entry across top 3 scores: 0.65
Average of max per entry across top 5 scores: 0.67
Average of max per entry across top 8 scores: 0.69
Average of max per entry across top 9999 scores: 0.69
8 candidate programs found.





In [31]:
task_model.inspect_history(n=1)




Judge if the answer is factually correct based on the question.

---

Question: What is the most congested traffic intersection in the world? How bad is the traffic there?
Response: The most congested traffic intersection in the world is the Shibuya scramble crossing in Tokyo, Japan. It is estimated that over 2,500 people cross the intersection at once during peak hours.
Answer: no

Question: Background: <start of reference> As the smoke cleared, Enzi saw the Storm Queen laying on the ground. Miraculously she was still in one piece. Enzi wondered what kind of protections it took to survive such a blast. It did not matter, however. She was obviously unconscious and no longer of any help. Aldebaran grunted as he saw the new foes enter the fray. His fur was already matted with blood and sweat. Usually bovine did not noticeably sweat, but obviously something in the creation of the minotaur gave him a more human reaction to severe exertion. Eurysa had run out of arrows. Any potential sup

'\n\n\nJudge if the answer is factually correct based on the question.\n\n---\n\nQuestion: What is the most congested traffic intersection in the world? How bad is the traffic there?\nResponse: The most congested traffic intersection in the world is the Shibuya scramble crossing in Tokyo, Japan. It is estimated that over 2,500 people cross the intersection at once during peak hours.\nAnswer: no\n\nQuestion: Background: <start of reference> As the smoke cleared, Enzi saw the Storm Queen laying on the ground. Miraculously she was still in one piece. Enzi wondered what kind of protections it took to survive such a blast. It did not matter, however. She was obviously unconscious and no longer of any help. Aldebaran grunted as he saw the new foes enter the fray. His fur was already matted with blood and sweat. Usually bovine did not noticeably sweat, but obviously something in the creation of the minotaur gave him a more human reaction to severe exertion. Eurysa had run out of arrows. Any p

In [33]:
test_score = evaluate(cot_fewshot, devset=validation_ds,num_threads=4)
print(test_score)

59.0


In [35]:
cot_fewshot.save("compiled_cot.dspy")

[('generate_answer', Predict(StringSignature(question, response -> rationale, answer
    instructions='Judge if the answer is factually correct based on the question.'
    question = Field(annotation=str required=True json_schema_extra={'desc': 'Question', '__dspy_field_type': 'input', 'prefix': 'Question:'})
    response = Field(annotation=str required=True json_schema_extra={'desc': 'LLM response for the question', '__dspy_field_type': 'input', 'prefix': 'Response:'})
    rationale = Field(annotation=str required=True json_schema_extra={'prefix': "Reasoning: Let's think step by step in order to", 'desc': '${produce the answer}. We ...', '__dspy_field_type': 'output'})
    answer = Field(annotation=str required=True json_schema_extra={'desc': 'Is the response answer factually correct based on the question? Yes or No.', '__dspy_field_type': 'output', 'prefix': 'Answer:'})
)))]
