# DSPy Playground

In this document, I go through the minimal working example from the DSPy documentation.

# Role Agent

The goal of this is to extract the country role from a given text snippet. 

Examples: Text Snippet -> Role

In [9]:
import groq
import os
from dotenv import load_dotenv
import dspy
import numpy as np
from dspy.evaluate.metrics import answer_exact_match
import pandas as pd
from IPython.core.display import Markdown
from dspy import Example
from dspy.teleprompt import BootstrapFewShot
from dspy.teleprompt import BootstrapFewShotWithRandomSearch
from dspy.evaluate import Evaluate
from module_v002 import FullLLMChain
from optimize import passage_similarity_metric, custom_evaluation_function, similar_score_metric, evaluate_expectations_metric
import json

load_dotenv()
openai_api_key = os.getenv("OPENAI_API_KEY")


llama3_8b = dspy.OllamaLocal(model = "llama3:8b",
                             temperature = 0,
                             max_tokens = 800)

gpt35turbo = dspy.OpenAI(model = "gpt-3.5-turbo-0125",
                         api_key = openai_api_key,
                         temperature = 0,
                         max_tokens = 800,
                         model_type = "chat")

dspy.settings.configure(lm=llama3_8b)

    
full_llm_chain = FullLLMChain()


In [3]:
raise ValueError("Check the LM!")
    

ValueError: Check the LM!

In [4]:
data = pd.read_excel("/Users/nicolasroever/Dropbox/5_Bond yield spreads/project_python/bond_yields_llm/data/300_snippets_transcripts_all_labeled_v002.xlsx")
snippet_list = data["Snippet"].tolist()
keyword_list = data["Keyword"].tolist()
data["Relevance_Score_Yes_No"] = np.where(data["Final Relevance Score"] == 1, 'no', 'yes')

examples = [dspy.Example(excerpt=row["Snippet"], country_keyword = row["Keyword"], answer=str(row["Final Combined"])).with_inputs("excerpt","country_keyword") for idx, row in data.iloc[0:300, ].iterrows()]
trainset = examples[0:5]
valset = examples[51:55]

In [5]:
config = dict(max_bootstrapped_demos=2, max_labeled_demos=2, num_candidate_programs = 3)
teleprompter = BootstrapFewShotWithRandomSearch(metric=evaluate_expectations_metric, **config)
optimized_llm = teleprompter.compile(full_llm_chain, trainset=trainset, valset=valset)

Going to sample between 1 and 4 traces per predictor.
Will attempt to train 5 candidate sets.


Average Metric: 2 / 4  (50.0): 100%|██████████| 4/4 [00:42<00:00, 10.74s/it] 


Average Metric: 2 / 4  (50.0%)
Score: 50.0 for set: [0, 0, 0]
New best score: 50.0 for seed -3
Scores so far: [50.0]
Best score: 50.0


Average Metric: 1 / 4  (25.0): 100%|██████████| 4/4 [00:58<00:00, 14.74s/it]


Average Metric: 1 / 4  (25.0%)
Score: 25.0 for set: [2, 2, 2]
Scores so far: [50.0, 25.0]
Best score: 50.0


100%|██████████| 5/5 [01:30<00:00, 18.15s/it]


Bootstrapped 1 full traces after 5 examples in round 0.


Average Metric: 0 / 1  (0.0):  25%|██▌       | 1/4 [00:33<01:41, 33.80s/it]

None value in prediction


Average Metric: 0 / 2  (0.0):  50%|█████     | 2/4 [00:51<00:48, 24.31s/it]

None value in prediction


Average Metric: 0 / 3  (0.0):  75%|███████▌  | 3/4 [00:53<00:14, 14.14s/it]

None value in prediction


Average Metric: 0.5 / 4  (12.5): 100%|██████████| 4/4 [01:02<00:00, 15.62s/it]


Average Metric: 0.5 / 4  (12.5%)
Score: 12.5 for set: [2, 2, 2]
Scores so far: [50.0, 25.0, 12.5]
Best score: 50.0
Average of max per entry across top 1 scores: 0.5
Average of max per entry across top 2 scores: 0.5
Average of max per entry across top 3 scores: 0.625
Average of max per entry across top 5 scores: 0.625
Average of max per entry across top 8 scores: 0.625
Average of max per entry across top 9999 scores: 0.625


100%|██████████| 5/5 [01:27<00:00, 17.52s/it]


Bootstrapped 3 full traces after 5 examples in round 0.


Average Metric: 1 / 4  (25.0): 100%|██████████| 4/4 [00:53<00:00, 13.48s/it]


Average Metric: 1 / 4  (25.0%)
Score: 25.0 for set: [3, 3, 3]
Scores so far: [50.0, 25.0, 12.5, 25.0]
Best score: 50.0
Average of max per entry across top 1 scores: 0.5
Average of max per entry across top 2 scores: 0.5
Average of max per entry across top 3 scores: 0.5
Average of max per entry across top 5 scores: 0.625
Average of max per entry across top 8 scores: 0.625
Average of max per entry across top 9999 scores: 0.625


 60%|██████    | 3/5 [01:03<00:42, 21.08s/it]


Bootstrapped 2 full traces after 4 examples in round 0.


Average Metric: 1 / 4  (25.0): 100%|██████████| 4/4 [00:53<00:00, 13.49s/it] 


None value in prediction
Average Metric: 1 / 4  (25.0%)
Score: 25.0 for set: [2, 2, 2]
Scores so far: [50.0, 25.0, 12.5, 25.0, 25.0]
Best score: 50.0
Average of max per entry across top 1 scores: 0.5
Average of max per entry across top 2 scores: 0.5
Average of max per entry across top 3 scores: 0.5
Average of max per entry across top 5 scores: 0.625
Average of max per entry across top 8 scores: 0.625
Average of max per entry across top 9999 scores: 0.625


 40%|████      | 2/5 [00:37<00:56, 18.71s/it]


Bootstrapped 1 full traces after 3 examples in round 0.


Average Metric: 0.5 / 4  (12.5): 100%|██████████| 4/4 [00:37<00:00,  9.49s/it]


None value in prediction
Average Metric: 0.5 / 4  (12.5%)
Score: 12.5 for set: [2, 2, 2]
Scores so far: [50.0, 25.0, 12.5, 25.0, 25.0, 12.5]
Best score: 50.0
Average of max per entry across top 1 scores: 0.5
Average of max per entry across top 2 scores: 0.5
Average of max per entry across top 3 scores: 0.5
Average of max per entry across top 5 scores: 0.625
Average of max per entry across top 8 scores: 0.625
Average of max per entry across top 9999 scores: 0.625


 80%|████████  | 4/5 [01:06<00:16, 16.72s/it]


Bootstrapped 2 full traces after 5 examples in round 0.


Average Metric: 2.5 / 4  (62.5): 100%|██████████| 4/4 [00:52<00:00, 13.04s/it]


Average Metric: 2.5 / 4  (62.5%)
Score: 62.5 for set: [2, 2, 2]
New best score: 62.5 for seed 3
Scores so far: [50.0, 25.0, 12.5, 25.0, 25.0, 12.5, 62.5]
Best score: 62.5
Average of max per entry across top 1 scores: 0.625
Average of max per entry across top 2 scores: 0.875
Average of max per entry across top 3 scores: 0.875
Average of max per entry across top 5 scores: 0.875
Average of max per entry across top 8 scores: 0.875
Average of max per entry across top 9999 scores: 0.875


 60%|██████    | 3/5 [00:49<00:34, 17.18s/it]

None value in prediction


 80%|████████  | 4/5 [01:01<00:15, 15.45s/it]


Bootstrapped 2 full traces after 5 examples in round 0.


Average Metric: 2.5 / 4  (62.5): 100%|██████████| 4/4 [00:40<00:00, 10.18s/it]

Average Metric: 2.5 / 4  (62.5%)
Score: 62.5 for set: [2, 2, 2]
Scores so far: [50.0, 25.0, 12.5, 25.0, 25.0, 12.5, 62.5, 62.5]
Best score: 62.5
Average of max per entry across top 1 scores: 0.625
Average of max per entry across top 2 scores: 0.875
Average of max per entry across top 3 scores: 0.875
Average of max per entry across top 5 scores: 0.875
Average of max per entry across top 8 scores: 0.875
Average of max per entry across top 9999 scores: 0.875
8 candidate programs found.





In [11]:
llama_history = llama3_8b.inspect_history(n=10)
with open("llama_prompts.json", "w") as file:
     json.dump(llama_history, file, indent=4)
optimized_llm.save("optimized_llama_chain_v001.json")

In [19]:
custom_evaluation_function(validation_set=valset, 
                           llm= full_llm_chain, 
                           metric_for_evaluation="evaluate_expectations", 
                           show_examples=2)

Average Metric: 11.5 / 32  (35.9):  65%|██████▌   | 32/49 [19:05:55<10:08:46, 2148.62s/it]
Average Metric: 11.5 / 32  (35.9):  65%|██████▌   | 32/49 [10:52<05:46, 20.40s/it]


KeyboardInterrupt: 

In [None]:
evaluate = Evaluate(devset=valset, metric= evaluate_expectations_metric, num_threads=4, display_progress=True, display_table=4)

In [None]:
evaluate(full_llm_chain)

In [4]:
config = dict(max_bootstrapped_demos=4, max_labeled_demos=2, num_candidate_programs = 5)
teleprompter = BootstrapFewShotWithRandomSearch(metric=evaluate_expectations_metric, **config)
optimized_llm = teleprompter.compile(full_llm_chain, trainset=trainset, valset=valset)

Going to sample between 1 and 4 traces per predictor.
Will attempt to train 5 candidate sets.


Average Metric: 20.0 / 49  (40.8): 100%|██████████| 49/49 [09:35<00:00, 11.75s/it]


Average Metric: 20.0 / 49  (40.8%)
Score: 40.82 for set: [0, 0, 0]
New best score: 40.82 for seed -3
Scores so far: [40.82]
Best score: 40.82


Average Metric: 19.5 / 49  (39.8): 100%|██████████| 49/49 [16:00<00:00, 19.60s/it]


Average Metric: 19.5 / 49  (39.8%)
Score: 39.8 for set: [2, 2, 2]
Scores so far: [40.82, 39.8]
Best score: 40.82


 16%|█▌        | 8/50 [02:33<13:23, 19.13s/it]


Bootstrapped 4 full traces after 9 examples in round 0.


Average Metric: 0 / 1  (0.0):   2%|▏         | 1/49 [01:57<1:33:59, 117.48s/it]

None value in prediction


Average Metric: 1.0 / 7  (14.3):  14%|█▍        | 7/49 [04:10<26:15, 37.51s/it]

None value in prediction


Average Metric: 1.0 / 8  (12.5):  16%|█▋        | 8/49 [04:30<21:54, 32.07s/it]

None value in prediction


In [16]:
custom_evaluation_function(valset, optimized_llm, show_examples=5, 
metric_for_evaluation="evaluate_expectations")

NameError: name 'optimized_llm' is not defined

In [None]:
llama3_8b.inspect_history(n = 2)

In [None]:
display(Markdown(optimized_llm(excerpt="Let's now turn to Judy, who is a Senior Vice President of Investor Relations. Judy is joining us from Italy.", country_keyword="Italy").answer))

In [None]:
display(Markdown(optimized_llm(excerpt = 
                               "Now, we turn to Judy who is joining us from Belgium. She is our lead representative for sales in that country.", 
                               country_keyword = "belgium")))