# Optimization Code

In this sandbox, I optimize the prompts for the LLM chain using Dspy


## Settings

This import all the packages, separate functions and sets the LLM to the chain defined in the main module.

In [1]:
import groq
import os
from dotenv import load_dotenv
import dspy
import numpy as np
from dspy.evaluate.metrics import answer_exact_match
import pandas as pd
from IPython.core.display import Markdown
from dspy import Example
from dspy.teleprompt import BootstrapFewShot
from dspy.teleprompt import BootstrapFewShotWithRandomSearch
from dspy.evaluate import Evaluate
from module_v002 import FullLLMChain
from optimize import passage_similarity_metric, custom_evaluation_function, similar_score_metric, evaluate_expectations_metric
from data.preprocess import create_dspy_examples_train_test_validation_sets
import json
import pandas as pd

load_dotenv()
openai_api_key = os.getenv("OPENAI_API_KEY")


llama3_8b = dspy.OllamaLocal(model = "llama3:8b",
                             temperature = 0,
                             max_tokens = 800)

gpt35turbo = dspy.OpenAI(model = "gpt-3.5-turbo-0125",
                         api_key = openai_api_key,
                         temperature = 0,
                         max_tokens = 800,
                         model_type = "chat")    


# Preprocess Data

In [2]:
data = pd.read_excel("data/300_snippets_transcripts_all_labeled_v002.xlsx")

In [3]:
train_set, test_set, validation_set = create_dspy_examples_train_test_validation_sets(
    data=data
)

## Optimization

This optimizes the prompts and saves the optimized LLM as well as the last 10 instances prompted to the LLM

In [4]:
dspy.settings.configure(lm=gpt35turbo)
full_llm_chain = FullLLMChain()

In [5]:
config = dict(max_bootstrapped_demos=1, max_labeled_demos=2, num_candidate_programs = 3, num_threads = 2)
teleprompter = BootstrapFewShotWithRandomSearch(metric=evaluate_expectations_metric, **config)
optimized_llm = teleprompter.compile(full_llm_chain, trainset=train_set, valset=test_set)

Going to sample between 1 and 1 traces per predictor.
Will attempt to train 3 candidate sets.


Average Metric: 9.0 / 25  (36.0): 100%|██████████| 25/25 [01:14<00:00,  2.97s/it]


Average Metric: 9.0 / 25  (36.0%)
Score: 36.0 for set: [0, 0, 0]
New best score: 36.0 for seed -3
Scores so far: [36.0]
Best score: 36.0


Average Metric: 9.0 / 25  (36.0): 100%|██████████| 25/25 [01:07<00:00,  2.70s/it]


Average Metric: 9.0 / 25  (36.0%)
Score: 36.0 for set: [2, 2, 2]
Scores so far: [36.0, 36.0]
Best score: 36.0


  4%|▍         | 1/25 [00:05<02:07,  5.33s/it]


Bootstrapped 1 full traces after 2 examples in round 0.


Average Metric: 12.5 / 25  (50.0): 100%|██████████| 25/25 [01:16<00:00,  3.06s/it]


Average Metric: 12.5 / 25  (50.0%)
Score: 50.0 for set: [2, 2, 2]
New best score: 50.0 for seed -1
Scores so far: [36.0, 36.0, 50.0]
Best score: 50.0
Average of max per entry across top 1 scores: 0.5
Average of max per entry across top 2 scores: 0.5
Average of max per entry across top 3 scores: 0.5
Average of max per entry across top 5 scores: 0.5
Average of max per entry across top 8 scores: 0.5
Average of max per entry across top 9999 scores: 0.5


  4%|▍         | 1/25 [00:06<02:32,  6.34s/it]


Bootstrapped 1 full traces after 2 examples in round 0.


Average Metric: 11.5 / 25  (46.0): 100%|██████████| 25/25 [01:21<00:00,  3.26s/it]


Average Metric: 11.5 / 25  (46.0%)
Score: 46.0 for set: [2, 2, 2]
Scores so far: [36.0, 36.0, 50.0, 46.0]
Best score: 50.0
Average of max per entry across top 1 scores: 0.5
Average of max per entry across top 2 scores: 0.58
Average of max per entry across top 3 scores: 0.58
Average of max per entry across top 5 scores: 0.58
Average of max per entry across top 8 scores: 0.58
Average of max per entry across top 9999 scores: 0.58


  8%|▊         | 2/25 [00:12<02:20,  6.09s/it]


Bootstrapped 1 full traces after 3 examples in round 0.


Average Metric: 10.5 / 25  (42.0): 100%|██████████| 25/25 [01:17<00:00,  3.08s/it]


Average Metric: 10.5 / 25  (42.0%)
Score: 42.0 for set: [2, 2, 2]
Scores so far: [36.0, 36.0, 50.0, 46.0, 42.0]
Best score: 50.0
Average of max per entry across top 1 scores: 0.5
Average of max per entry across top 2 scores: 0.58
Average of max per entry across top 3 scores: 0.68
Average of max per entry across top 5 scores: 0.68
Average of max per entry across top 8 scores: 0.68
Average of max per entry across top 9999 scores: 0.68


 16%|█▌        | 4/25 [00:21<01:55,  5.48s/it]


Bootstrapped 1 full traces after 5 examples in round 0.


Average Metric: 11.0 / 25  (44.0): 100%|██████████| 25/25 [01:12<00:00,  2.90s/it]

Average Metric: 11.0 / 25  (44.0%)
Score: 44.0 for set: [2, 2, 2]
Scores so far: [36.0, 36.0, 50.0, 46.0, 42.0, 44.0]
Best score: 50.0
Average of max per entry across top 1 scores: 0.5
Average of max per entry across top 2 scores: 0.58
Average of max per entry across top 3 scores: 0.62
Average of max per entry across top 5 scores: 0.7
Average of max per entry across top 8 scores: 0.7
Average of max per entry across top 9999 scores: 0.7
6 candidate programs found.





In [6]:
gpt_history = gpt35turbo.inspect_history(n=10)
with open("optimized_llm_chains/gpt_prompts_v004.json", "w") as file:
     json.dump(gpt_history, file, indent=4)
optimized_llm.save("optimized_llm_chains/optimized_gpt_chain_v001.json")





---CONTEXT---
    You are an experienced financial analyst known for your ability to assess and interpret subtle indicators of financial stability and solvency in various countries.

    ---TASK---
    Your task is to assess whether or not the given text excerpt potentially reveals any expectation towards the solvency of the country mentioned.

    ---GUIDELINES---
    - Focus specifically on potential implications regarding the country's financial stability and ability to meet its obligations.
    - Financial stability and the country's ability to meet its obligations do not need to be discussed explicitly but can be inferred from the text.
    - Answer either 'yes' for relevant or 'no' for irrelevant.

---

Follow the following format.

Country Keyword: keyword that represents a country
Country Role: role of the country in the text excerpt of a finacial services company's earnings call transcript
Answer: one of: ['yes', 'no']

---

Country Keyword: stockholm
Country Role: Country

## Custom Evaluation

Here, I just manually inspect what happens to control the workflow.

In [None]:
custom_evaluation_function(validation_set=valset, 
                           llm= full_llm_chain, 
                           metric_for_evaluation="evaluate_expectations", 
                           show_examples=2)

In [None]:
gpt35turbo.inspect_history(n = 2)

In [None]:
evaluate = Evaluate(devset=valset, metric= evaluate_expectations_metric, num_threads=4, display_progress=True, display_table=4)