# DSPY
DSPy is a framework for algorithmically optimizing LM prompts and weights. DSPy can help you define your your tasks more accurately and can help you optimize your prompt for your sutaible use case.

Read More about it [here](https://dspy-docs.vercel.app/docs/intro)


Let's explore different components of DSPY with the help of the below examples.

## DSPY Components 

### Signatures 

They are the definition of input and output to your application/pipeline. They contain the definition of the task you want to solve along with inputs and outputs of your task. They can be defined in two ways. 

- Inline Signature 
- Class Based Signature 
    
[Reference](https://dspy-docs.vercel.app/docs/building-blocks/signatures)

### LM and RM Pipeline 

LM modules/Clients are the Language Models on which you want to base your pipeline on.(for eg. `gpt-3.5` ,`gpt-4` or any open source LLM). 

[Reference](https://dspy-docs.vercel.app/docs/building-blocks/language_models)

RM modules are optional and consist of a retrieval module that can provide necessary context to your LM for answer generation. 

[Reference](https://dspy-docs.vercel.app/docs/category/retrieval-model-clients)



### Metrics 

Metrics are the evaluation criteria for your pipeline. They can be used to evaluate the performance of your pipeline and optimize the pipeline based on your defined set of metrics.

[Reference](https://dspy-docs.vercel.app/docs/building-blocks/metrics)

### Optimizer/Teleprompter

It is an algorithm desgined to tune the prompts and signature of your LM models such that it can perform better on your defined metrics. 

[Reference](https://dspy-docs.vercel.app/docs/building-blocks/optimizers)


Let's understand how we can leverage DSPY to optimize our Prompts and generate better answers

In [16]:
import pandas as pd
import dspy
from fastembed import TextEmbedding
from qdrant_client import QdrantClient
from qdrant_client.http import models 
from tqdm import tqdm
import json
from typing import Optional,List,Union
from pydantic import BaseModel
from fastembed import TextEmbedding
from dsp.utils import dotdict
import os
from dotenv import load_dotenv
from qdrant_client import QdrantClient
import random

In [None]:
load_dotenv()
openai_api_key = os.getenv("OPENAI_API_KEY")

### Load the Training and Testing Data

In [17]:
test_data = pd.read_csv('../data/QA/test.csv')
train_data = pd.read_csv('../data/QA/train.csv')
train_data.head()

Unnamed: 0,question,contexts,ground_truth,exact_answer
0,What is Snord116?,"['Further analysis with array-CGH identified a mosaic 847\u2009kb deletion in 15q11-q13, including SNURF-SNRPN, the snoRNA gene clusters SNORD116 (HBII-85), SNORD115, (HBII-52), SNORD109 A and B (HBII-438A and B), SNORD64 (HBII-13), and NPAP1 (C15ORF2).', 'All three deletions included SNORD116, but only two encompassed parts of SNURF-SNRPN, implicating SNORD116 as the major contributor to the Prader-Willi phenotype. Our case adds further information about genotype-phenotype correlation and s...","['SNORD116 is a small nucleolar (sno) RNA gene cluster (HBII-85) implicated as a major contributor the Prader-Willi phenotype. \nSNORD116 genes appears to be responsible for the major features of PWS. \nSNORD116 is a paternally expressed box C/D snoRNA gene cluster.\nThe mouse C/D box snoRNA MBII-85 (SNORD116) is processed into at least five shorter RNAs using processing sites near known functional elements of C/D box snoRNAs.\nSnord116 expression in the medial hypothalamus, particularly wit...",[]
1,Are ultraconserved elements often transcribed?,"['Starting from a genome-wide expression profiling, we demonstrate for the first time a functional link between oxygen deprivation and the modulation of long noncoding transcripts from ultraconserved regions, termed transcribed-ultraconserved regions (T-UCRs)', 'Our data gives a first glimpse of a novel functional hypoxic network comprising protein-coding transcripts and noncoding RNAs (ncRNAs) from the T-UCRs category', 'Highly conserved elements discovered in vertebrates are present in non...","['Yes. Especially, a large fraction of non-exonic UCEs is transcribed across all developmental stages examined from only one DNA strand.']",['yes']
2,List metalloenzyme inhibitors.,"[' Clinically approved inhibitors were selected as well as several other reported metalloprotein inhibitors in order to represent a broad range of metal binding groups (MBGs), including hydroxamic acid, carboxylate, hydroxypyridinonate, thiol, and N-hydroxyurea functional groups.', 'A total of 21 different raltegravir-chelator derivative (RCD) compounds were prepared that differed only in the nature of the MBG. ', 'At least two compounds (RCD-4, RCD-5) containing a hydroxypyrone MBG were fou...",['Foscarnet\nVT-1129\nVT-1161 \nBB-3497\nhydroxamate molecules\nsiderophores'],"['VT-1129', 'VT-1161', 'BB-3497', 'hydroxamate molecules', 'siderophores', 'Foscarnet']"
3,"Which protein phosphatase has been found to interact with the heat shock protein, HSP20?","[' Moreover, protein phosphatase-1 activity is regulated by two binding partners, inhibitor-1 and the small heat shock protein 20, Hsp20. Indeed, human genetic variants of inhibitor-1 (G147D) or Hsp20 (P20L) result in reduced binding and inhibition of protein phosphatase-1, suggesting aberrant enzymatic regulation in human carriers. ', 'Small heat shock protein 20 interacts with protein phosphatase-1 and enhances sarcoplasmic reticulum calcium cycling.', ' Hsp20 overexpression in intact anim...","['Protein phosphatase-1 activity is regulated by two binding partners, inhibitor-1 and the small heat shock protein 20, Hsp20. Cell fractionation, coimmunoprecipitation, and coimmunolocalization studies, revealed an association between Hsp20 and PP1. Small heat shock protein 20 interacts with protein phosphatase-1 and enhances sarcoplasmic reticulum calcium cycling.', 'Moreover, protein phosphatase-1 activity is regulated by two binding partners, inhibitor-1 and the small heat shock protein ...","['Protein phosphatase 1', 'PP1']"
4,Do DNA double-strand breaks play a causal role in carcinogenesis?,"['The DNA non-homologous end-joining repair gene XRCC6/Ku70 plays an important role in the repair of DNA double-strand breaks (DSBs) induced by both exogenous and endogenous DNA-damaging agents. Defects in overall DSB repair capacity can lead to genomic instability and carcinogenesis.', 'The tumor suppressor breast cancer susceptibility protein 1 (BRCA1) protects our cells from genomic instability in part by facilitating the efficient repair of DNA double-strand breaks (DSBs). BRCA1 promotes...","['Yes. It has been demonstrated that induction of DNA double-strand breaks (DSBs) and defects in overall DSBs repair capacity can lead to an accumulation of mutations, resulting in genomic instability of cells. Given that genomic instability is the hallmark of cancer, DSBs play a causal role in carcinogenesis.']",['yes']


Each Question in the training data is assosiated with ground_truth label , we will use this to train our model and optimize the prompts. 

In [18]:
pd.set_option('display.max_colwidth', 500)
train_data[['question', 'contexts', 'ground_truth']]

Unnamed: 0,question,contexts,ground_truth
0,What is Snord116?,"['Further analysis with array-CGH identified a mosaic 847\u2009kb deletion in 15q11-q13, including SNURF-SNRPN, the snoRNA gene clusters SNORD116 (HBII-85), SNORD115, (HBII-52), SNORD109 A and B (HBII-438A and B), SNORD64 (HBII-13), and NPAP1 (C15ORF2).', 'All three deletions included SNORD116, but only two encompassed parts of SNURF-SNRPN, implicating SNORD116 as the major contributor to the Prader-Willi phenotype. Our case adds further information about genotype-phenotype correlation and s...","['SNORD116 is a small nucleolar (sno) RNA gene cluster (HBII-85) implicated as a major contributor the Prader-Willi phenotype. \nSNORD116 genes appears to be responsible for the major features of PWS. \nSNORD116 is a paternally expressed box C/D snoRNA gene cluster.\nThe mouse C/D box snoRNA MBII-85 (SNORD116) is processed into at least five shorter RNAs using processing sites near known functional elements of C/D box snoRNAs.\nSnord116 expression in the medial hypothalamus, particularly wit..."
1,Are ultraconserved elements often transcribed?,"['Starting from a genome-wide expression profiling, we demonstrate for the first time a functional link between oxygen deprivation and the modulation of long noncoding transcripts from ultraconserved regions, termed transcribed-ultraconserved regions (T-UCRs)', 'Our data gives a first glimpse of a novel functional hypoxic network comprising protein-coding transcripts and noncoding RNAs (ncRNAs) from the T-UCRs category', 'Highly conserved elements discovered in vertebrates are present in non...","['Yes. Especially, a large fraction of non-exonic UCEs is transcribed across all developmental stages examined from only one DNA strand.']"
2,List metalloenzyme inhibitors.,"[' Clinically approved inhibitors were selected as well as several other reported metalloprotein inhibitors in order to represent a broad range of metal binding groups (MBGs), including hydroxamic acid, carboxylate, hydroxypyridinonate, thiol, and N-hydroxyurea functional groups.', 'A total of 21 different raltegravir-chelator derivative (RCD) compounds were prepared that differed only in the nature of the MBG. ', 'At least two compounds (RCD-4, RCD-5) containing a hydroxypyrone MBG were fou...",['Foscarnet\nVT-1129\nVT-1161 \nBB-3497\nhydroxamate molecules\nsiderophores']
3,"Which protein phosphatase has been found to interact with the heat shock protein, HSP20?","[' Moreover, protein phosphatase-1 activity is regulated by two binding partners, inhibitor-1 and the small heat shock protein 20, Hsp20. Indeed, human genetic variants of inhibitor-1 (G147D) or Hsp20 (P20L) result in reduced binding and inhibition of protein phosphatase-1, suggesting aberrant enzymatic regulation in human carriers. ', 'Small heat shock protein 20 interacts with protein phosphatase-1 and enhances sarcoplasmic reticulum calcium cycling.', ' Hsp20 overexpression in intact anim...","['Protein phosphatase-1 activity is regulated by two binding partners, inhibitor-1 and the small heat shock protein 20, Hsp20. Cell fractionation, coimmunoprecipitation, and coimmunolocalization studies, revealed an association between Hsp20 and PP1. Small heat shock protein 20 interacts with protein phosphatase-1 and enhances sarcoplasmic reticulum calcium cycling.', 'Moreover, protein phosphatase-1 activity is regulated by two binding partners, inhibitor-1 and the small heat shock protein ..."
4,Do DNA double-strand breaks play a causal role in carcinogenesis?,"['The DNA non-homologous end-joining repair gene XRCC6/Ku70 plays an important role in the repair of DNA double-strand breaks (DSBs) induced by both exogenous and endogenous DNA-damaging agents. Defects in overall DSB repair capacity can lead to genomic instability and carcinogenesis.', 'The tumor suppressor breast cancer susceptibility protein 1 (BRCA1) protects our cells from genomic instability in part by facilitating the efficient repair of DNA double-strand breaks (DSBs). BRCA1 promotes...","['Yes. It has been demonstrated that induction of DNA double-strand breaks (DSBs) and defects in overall DSBs repair capacity can lead to an accumulation of mutations, resulting in genomic instability of cells. Given that genomic instability is the hallmark of cancer, DSBs play a causal role in carcinogenesis.']"
...,...,...,...
1654,Is thyroid hormone therapy indicated in patients with heart failure?,"['Patients with chronic heart failure and subclinical hypothyroidism significantly improved their physical performance when normal TSH levels were reached.', 'Early and sustained physiological restoration of circulating L-T3 levels after MI halves infarct scar size and prevents the progression towards heart failure. This beneficial effect is likely due to enhanced capillary formation and mitochondrial protection.', 'These data indicate that T(3) replacement to euthyroid levels improves systo...",['There are several experimental and clinical evidences of the potential benefits of Thyroid hormone replacement therapy in heart failure. Initial clinical data showed also a good safety profile and tolerance of TH replacement therapy in patients withheart failure. \nHowever currently there is no indication to treat patients with heart failure withTHreplacementtherapy.']
1655,Is protein Fbw7 a SCF type of E3 ubiquitin ligase?,"['FBW7 (F-box and WD repeat domain-containing 7) is the substrate recognition component of an evolutionary conserved SCF (complex of SKP1, CUL1 and F-box protein)-type ubiquitin ligase.', 'However, very few E3 ubiquitin ligases are known to target G-CSFR for ubiquitin-proteasome pathway. Here we identified F-box and WD repeat domain-containing 7 (Fbw7), a substrate recognizing component of Skp-Cullin-F box (SCF) E3 ubiquitin Ligase physically associates with G-CSFR and promotes its ubiquitin...","['Fbxw7 (also known as Fbw7, SEL-10, hCdc4, or hAgo) is the F-box protein subunit of an Skp1-Cul1-F-box protein (SCF)-type ubiquitin ligase complex that plays a central role in the degradation of Notch family members.The F-box protein Fbw7 (also known as Fbxw7, hCdc4 and Sel-10) functions as a substrate recognition component of a SCF-type E3 ubiquitin ligase. SCF(Fbw7) facilitates polyubiquitination and subsequent degradation of various proteins such as Notch, cyclin E, c-Myc and c-Jun.', 'T..."
1656,Is Annexin V an apoptotic marker?,"['The apoptosis of the MSCs was induced by subjecting the cells to OGD conditions for 4 h and was detected by Annexin V/PI and Hoechst 33258 staining. ', 'In addition to the antimicrobial activity, we found that treatment of the cancer cell lines, Jurkat T-cells, Granta cells, and melanoma cells, with the Pseudomonas sp. In5 crude extract increased staining with the apoptotic marker Annexin V while no staining of healthy normal cells, i.e., naïve or activated CD4 T-cells, was observed.', 'At...","['Yes, annexin V is an early apoptotic marker.', 'Yes, Annexin V is an apoptotic marker.']"
1657,Which are the clinical characteristics of Tuberous Sclerosis?,"['Prevalence and long-term outcome of epilepsy in tuberous sclerosis complex (TSC) is reported to be variable', 'Subependymal giant cell astrocytomas (SEGAs) are benign tumors, most commonly associated with tuberous sclerosis complex (TSC).', 'Lymphangioleiomyomatosis (LAM) is a rare, progressive, frequently lethal cystic lung disease that almost exclusively affects women.', 'Rhabdomyoma is the most common type of cardiac tumor in fetuses and is often associated with tuberous sclerosis compl...","['The clinical characteristics of Tuberous Sclerosis include epilepsy, subependymal giant cell astrocytomas, lymphangioleiomyomatosis, rhabdomyoma, renal angiomyolipomas, cortical tubers, neurofibromas, angiofibromas, mental retardation, and behavioral disorders.']"


In [19]:
test_data

Unnamed: 0,id,question
0,1,Which are the Proprotein Convertase Subtilisin Kexin 9 (PCSK9) inhibitors that are FDA approved?
1,2,What is the relationship between nucleosomes and exons?
2,3,What is the name for anorexia in gymnasts?
3,4,How is Lambert-Eaton myasthenic syndrome (LEMS) associated with small cell lung cancer?
4,5,Is Fibroblast Growth Factor 23 a phosphaturic hormone?
...,...,...
706,707,Is alternative splicing of apoptotic genes playing a role in the response to DNA or mitochondrial damage?
707,708,Which NADPH oxidase family member requires interaction with NOXO1 for function?
708,709,List genes that have been found mutated in CMT1A (Charcot-Marie-Tooth disease type 1 A).
709,710,What is BioASQ?


### Upload Contexts to Qdrant Vector Store

In [30]:
contexts = pd.read_csv('../data/QA/contexts.csv')
contexts.head()

Unnamed: 0,text
0,"Both 7SL genes and Alu elements are transcribed by RNA polymerase III, and we show here that the internal 7SL promoter lies within the Alu-like part of the 7SL gene"
1,"We performed a comparative analysis in vitro and in vivo of the antitumor effects of three different antibodies targeting different epitopes of ErbB2: Herceptin (trastuzumab), 2C4 (pertuzumab) and Erb-hcAb (human anti-ErbB2-compact antibody), a novel fully human compact antibody produced in our laboratory. Herein, we demonstrate that the growth of both androgen-dependent and independent prostate cancer cells was efficiently inhibited by Erb-hcAb. The antitumor effects induced by Erb-hcAb on ..."
2,"The weight-reducing property of molindone, a recently introduced antipsychotic drug, was tested in 9 hospitalized chronic schizophrenic patients. There was an average weight loss of 7.6 kg after 3 months on molindone; most of the loss occurred during the first month."
3,"Our study identifies a unique heterochromatin state marked by the presence of both H3.3 and H3K9me3, and establishes an important role for H3.3 in control of ERV retrotransposition in embryonic stem cells."
4,"Polyneuropathy, organomegaly, endocrinopathy, monoclonal gammopathy, and skin changes (POEMS) syndrome is an uncommon condition related to a paraneoplastic syndrome secondary to an underlying plasma cell disorder."


In [22]:
embedding_model = TextEmbedding("BAAI/bge-base-en-v1.5")
qdrant_client = QdrantClient(":memory:") # spin up a local instance if you require more advanced features
# qdrant_client = QdrantClient("http://localhost:6333") # uncomment if you want to use your local instance 
qdrant_client.recreate_collection('rag_contexts',vectors_config=models.VectorParams(size=768, distance=models.Distance.COSINE))
# Create and upload points to Qdrant
points = []
# contexts_sample = contexts.sample(100)
for idx, row in tqdm(contexts.iterrows(),total=contexts.shape[0]):
    point = models.PointStruct(
        id=idx,  # Use the dataframe index as the point ID
        vector=list(embedding_model.embed(row['text']))[0],  # Convert the embedding to a list
        payload={'id': idx , "text":row['text']}  # Use the label_text as the payload
    )
    points.append(point)
qdrant_client.upload_points(collection_name='rag_contexts', points=points)

Unnamed: 0,text
0,"Both 7SL genes and Alu elements are transcribed by RNA polymerase III, and we show here that the internal 7SL promoter lies within the Alu-like part of the 7SL gene"
1,"We performed a comparative analysis in vitro and in vivo of the antitumor effects of three different antibodies targeting different epitopes of ErbB2: Herceptin (trastuzumab), 2C4 (pertuzumab) and Erb-hcAb (human anti-ErbB2-compact antibody), a novel fully human compact antibody produced in our laboratory. Herein, we demonstrate that the growth of both androgen-dependent and independent prostate cancer cells was efficiently inhibited by Erb-hcAb. The antitumor effects induced by Erb-hcAb on ..."
2,"The weight-reducing property of molindone, a recently introduced antipsychotic drug, was tested in 9 hospitalized chronic schizophrenic patients. There was an average weight loss of 7.6 kg after 3 months on molindone; most of the loss occurred during the first month."
3,"Our study identifies a unique heterochromatin state marked by the presence of both H3.3 and H3K9me3, and establishes an important role for H3.3 in control of ERV retrotransposition in embryonic stem cells."
4,"Polyneuropathy, organomegaly, endocrinopathy, monoclonal gammopathy, and skin changes (POEMS) syndrome is an uncommon condition related to a paraneoplastic syndrome secondary to an underlying plasma cell disorder."


### Custom Retriever that searchs the contexts from Qdrant Vector Store. 

In [23]:
# use any embedding model 
def generate_embeddings(text):
    return list(embedding_model.embed(text))[0]


class QdrantRetriever(dspy.Retrieve):
    def __init__(self,qdrant_collection_name,qdrant_client,k=10):
        super().__init__(k=k)
        self.client = qdrant_client
        self.collection_name = qdrant_collection_name

    def forward(self, query,k:Optional[int]=10):
        # Generate embedding for the query
        query_embedding = generate_embeddings(query)
        search_results = self.client.search(
            collection_name=self.collection_name,
            query_vector=query_embedding,
            limit=k if k else self.k
        )
        passages = [result.payload['text'] for result in search_results]
        passages = [dotdict({"long_text": passage}) for passage in passages]
        return passages

Fetching 8 files:   0%|          | 0/8 [00:00<?, ?it/s]

In [25]:
turbo = dspy.OpenAI(model="gpt-3.5-turbo-0125",api_key=openai_api_key,max_tokens=1000)
qdrant_client = QdrantClient(url="http://localhost:6333")

rm = QdrantRetriever("rag_contexts",qdrant_client)

# configure dspy with a RM Model and and LM Model
dspy.settings.configure(lm=turbo,rm=rm)


In [26]:
sample = test_data['question'].iloc[0]
dspy.Retrieve(k=10)(sample).passages

['Two proprotein convertase subtilisin/kexin type 9 (PCSK9) inhibitors, evolocumab and alirocumab, have recently been approved by both the Food and Drug Administration (FDA) and the European Medicines Agency (EMA) for the treatment of hypercholesterolemia.',
 'Food and Drug Administration approved the first two proprotein convertase subtilisin/kexin type 9 (PCSK9) inhibitors, alirocumab (Praluent®; Sanofi/ Regeneron) and evolocumab (Repatha®; Amgen), for use in patients with heterozygous and homozygous familial hypercholesterolemia and for patients intolerant of statins or those with a major risk of cardiovascular disease (CVD) but unable to lower their LDL cholesterol (LDL-C) to optimal levels with statins and ezetimibe.',
 'In 2015 the U.S. Food and Drug Administration approved the first two proprotein convertase subtilisin/kexin type 9 (PCSK9) inhibitors, alirocumab (Praluent®; Sanofi/ Regeneron) and evolocumab (Repatha®; Amgen), for use in patients with heterozygous and homozygous 

### Signature Defination for Q/A System

In [None]:
# Define Signatire for the QA system
class GenerateAnswer(dspy.Signature):
    """Answer questions based on the context."""
    
    context = dspy.InputField(desc="may contain relevant facts")
    question = dspy.InputField()
    answer = dspy.OutputField()

### RM(Retrieval Model) Pipeline Creation

In [27]:

# Define a Custom RAG Pipeline
class RAG(dspy.Module):
    def __init__(self, collection_name= "rag_contexts",num_passages=10):
        super().__init__()
        self.retrieve = dspy.Retrieve(k=num_passages)
        self.generate_answer = dspy.ChainOfThought(GenerateAnswer)


    def forward(self, question):
        context = self.retrieve(question).passages
        prediction = self.generate_answer(context=context, question=question)
        return dspy.Prediction(context=context, answer=prediction.answer)



In [28]:
uncompiled_rag = RAG()

In [None]:
uncompiled_rag(sample)

In [16]:
turbo.inspect_history(n=1)





Answer questions based on the context.

---

Follow the following format.

Context: may contain relevant facts

Question: ${question}

Reasoning: Let's think step by step in order to ${produce the answer}. We ...

Answer: ${answer}

---

Context:
[1] «Two proprotein convertase subtilisin/kexin type 9 (PCSK9) inhibitors, evolocumab and alirocumab, have recently been approved by both the Food and Drug Administration (FDA) and the European Medicines Agency (EMA) for the treatment of hypercholesterolemia.»
[2] «Food and Drug Administration approved the first two proprotein convertase subtilisin/kexin type 9 (PCSK9) inhibitors, alirocumab (Praluent®; Sanofi/ Regeneron) and evolocumab (Repatha®; Amgen), for use in patients with heterozygous and homozygous familial hypercholesterolemia and for patients intolerant of statins or those with a major risk of cardiovascular disease (CVD) but unable to lower their LDL cholesterol (LDL-C) to optimal levels with statins and ezetimibe.»
[3] «In 201

### Metrics Defination and Assesment Signatures  

In [17]:
metricLM = dspy.OpenAI(model='gpt-4',api_key=openai_api_key,max_tokens=1000, model_type='chat')

# Signature for LLM assessments.
class Assess(dspy.Signature):
    """Assess the quality of an answer to a question."""
    context = dspy.InputField(desc="The context for answering the question.")
    assessed_question = dspy.InputField(desc="The evaluation criterion.")
    assessed_answer = dspy.InputField(desc="The answer to the question.")
    correct_answer = dspy.InputField(desc="The correct answer to the question.")
    assessment_answer = dspy.OutputField(desc="A rating between 1 and 5. Only output the rating and nothing else.")

def llm_metric(gold, pred, trace=None):
    predicted_answer = pred.answer
    gold_question = gold.question
    gold_answer = gold.answer

    detail = "Is the assessed answer detailed?"
    faithful = "Is the assessed text grounded in the context? Say no if it includes significant facts not in the context."
    correctness = f"Compare the given {predicted_answer} and {gold_answer} and assess how correct the answer is"

    with dspy.context(lm=metricLM):
        context = dspy.Retrieve(k=10)(gold_question).passages
        detail = dspy.ChainOfThought(Assess)(context="N/A", assessed_question=detail, assessed_answer=predicted_answer,correct_answer=gold_answer)
        faithful = dspy.ChainOfThought(Assess)(context=context, assessed_question=faithful, assessed_answer=predicted_answer,correct_answer=gold_answer)
        correctness = dspy.ChainOfThought(Assess)(context=context,assessed_question=correctness, assessed_answer=predicted_answer,correct_answer=gold_answer)

    print(f"Faithful: {faithful.assessment_answer}")
    print(f"Detail: {detail.assessment_answer}")
    print(f"Correctness: {correctness.assessment_answer}")
    
    total = float(detail.assessment_answer) + float(faithful.assessment_answer) + float(correctness.assessment_answer)
    return total / 5.0
    

Reference for the above is taken from below cited sources 
- [Reference_1](https://dspy-docs.vercel.app/docs/building-blocks/metrics#intermediate-using-ai-feedback-for-your-metric)
- [Reference_2](https://github.com/weaviate/recipes/blob/main/integrations/dspy/1.Getting-Started-with-RAG-in-DSPy.ipynb)

Let's format the data in a specific way how the DSPY modules are expecting and then use some of the data for training and evaluation. 

In [18]:
trainset_dspy = train_data.sample(frac=0.8)
valset_dspy = train_data.drop(trainset_dspy.index)

In [19]:
from ast import literal_eval
import dspy

def read_list_from_string(s):
    try:
        return literal_eval(s)
    except (ValueError, SyntaxError):
        return s.split() if isinstance(s, str) else []

def stringify_list_elements(lst):
    lst = read_list_from_string(lst)
    return " ".join(str(e) for e in lst)

trainset = [dspy.Example(question=row['question'],
                        #  contexts=stringify_list_elements(row['contexts']),
                         answer=stringify_list_elements(row['ground_truth']))
            .with_inputs("question") for i, row in trainset_dspy.iterrows()]

valset = [dspy.Example(question=row['question'],
                        # contexts=stringify_list_elements(row['contexts']),
                        answer=stringify_list_elements(row['ground_truth']))
           .with_inputs("question") for i, row in valset_dspy.iterrows()]

In [20]:
# For the purpose of demonstration let's keep it to 20. Remeber to use it wisely as the evaluation / training is all tied to API calls
devset = valset[:20]

In [None]:
from dspy.evaluate.evaluate import Evaluate

evaluate = Evaluate(devset=devset, num_threads=8, display_progress=True, display_table=5)
uncompile_k_10 = RAG(num_passages=10) 
uncompiled_10_metrics = evaluate(uncompile_k_10, metric=llm_metric,return_all_scores=True,
        return_outputs=True)

In [50]:
def create_score_dataframe(eval_output):
    # Extract questions and answers from the examples
    questions = [ex[0].question for ex in eval_output]
    answers = [ex[1].answer for ex in eval_output]
    scores = [ex[2] for ex in eval_output]
    # Create a DataFrame with questions, answers, and scores
    score_dataframe = pd.DataFrame({'question': questions, 'predicted_answer': answers, 'score': scores})
    return score_dataframe


In [55]:
pd.set_option('display.max_colwidth', 500)
pd.set_option('display.max_rows', 500)
eval_outs = uncompiled_10_metrics[1]
eval_outs_df = create_score_dataframe(eval_outs)
print(f"Mean Score for the devset is {eval_outs_df['score'].mean()}")
eval_outs_df

Mean Score for the devset is 2.0


Unnamed: 0,question,predicted_answer,score
0,Can protein coding exons originate from ALU sequences?,Yes,2.2
1,List metalloenzyme inhibitors.,"VT-1129, VT-1161, hydroxamic acid, carboxylate, hydroxypyridinonate, thiol, N-hydroxyurea, BB-3497, 8-halo-4-(3-chloro-4-fluoro-phenylamino)-6-[(1H-[1,2,3]triazol-4-ylmethyl)-amino]-quinoline-3-carbonitriles, 2,2'-dipyridylamine (DPA), triazacyclononane (TACN), 8-hydroxyquinoline, thapsigargin.",1.8
2,Are long non coding RNAs as conserved in sequence as protein coding genes?,"No, long noncoding RNAs are not as conserved in sequence as protein-coding genes.",2.4
3,Is Fibroblast Growth Factor 23 a phosphaturic hormone?,Yes,2.0
4,Do Conserved noncoding elements act as enhancers?,"Yes, conserved noncoding elements act as enhancers.",2.2
5,What is the name of the stem loop present in the 3' end of genes encoding for selenoproteins?,The name of the stem loop present in the 3' end of genes encoding for selenoproteins is the Sec insertion sequence (SECIS) element or selenocysteine incorporating sequence (SECIS).,3.0
6,Are ultraconserved elements often transcribed?,Yes,1.8
7,What is being measured with an accelerometer in back pain patients,"The accelerometer is used to measure overall physical activity, time spent in different static trunk postures, standing time, lying time, nighttime activity data, daily activities by measuring body movement, time spent upright (standing or walking), time standing, time walking, and step count in patients with chronic lower back pain.",3.0
8,Can NXY-059 be used for treatment of acute ischemic stroke patients?,"NXY-059 has shown some potential for the treatment of acute ischemic stroke patients, but it is not conclusively effective and additional research is needed.",2.2
9,Which proteins control the degradation of cryptic unstable transcripts (CUTs) in yeast?,"The proteins that control the degradation of cryptic unstable transcripts (CUTs) in yeast are Nrd1, Nab3, and the TRAMP complex.",2.2


The Mean Score of the Uncompiled RAG is 2.0. The Maximum Score possible is 3.0. Since we scaled the total score by 5. 

### Prompt from the Uncompiled Model

In [56]:
turbo.inspect_history(n=1)





Answer questions based on the context.

---

Follow the following format.

Context: may contain relevant facts

Question: ${question}

Reasoning: Let's think step by step in order to ${produce the answer}. We ...

Answer: ${answer}

---

Context:
[1] «The metabolic syndrome is a complex association of several risk factors including insulin resistance, dyslipidemia, and essential hypertension.»
[2] «The metabolic syndrome is a constellation of risk factors including glucose dysregulation, central obesity, dyslipidemia, and hypertension.»
[3] «The metabolic syndrome is a clustering of risk factors which predispose an individual to cardiovascular morbidity and mortality. There is general consensus regarding the main components of the syndrome (glucose intolerance, obesity, raised blood pressure and dyslipidaemia [elevated triglycerides, low levels of high-density lipoprotein cholesterol])»
[4] «Metabolic syndrome is a constellation of interrelated risk factors of metabolic origin incl

In [57]:
# Lets check the Metrics LLM Prompt as well
metricLM.inspect_history(n=3)





Assess the quality of an answer to a question.

---

Follow the following format.

Context: The context for answering the question.

Assessed Question: The evaluation criterion.

Assessed Answer: The answer to the question.

Correct Answer: The correct answer to the question.

Reasoning: Let's think step by step in order to ${produce the assessment_answer}. We ...

Assessment Answer: A rating between 1 and 5. Only output the rating and nothing else.

---

Context:
[1] «The metabolic syndrome is a complex association of several risk factors including insulin resistance, dyslipidemia, and essential hypertension.»
[2] «The metabolic syndrome is a constellation of risk factors including glucose dysregulation, central obesity, dyslipidemia, and hypertension.»
[3] «The metabolic syndrome is a clustering of risk factors which predispose an individual to cardiovascular morbidity and mortality. There is general consensus regarding the main components of the syndrome (glucose intolerance, ob

In [64]:
# Since 'trainset' is a list and doesn't have a 'sample' method, we will define a function to sample from it
def sample_from_list(lst, fraction):
    sample_size = int(len(lst) * fraction)
    return random.sample(lst, sample_size)

# Now we use the function to sample 2% if the dataset
trainset_truncated = sample_from_list(trainset, 0.02)
len(trainset_truncated)

26

### Optimizer : Bootstrap Random Search Optimization

In [65]:
from dspy.teleprompt import BootstrapFewShotWithRandomSearch
teleprompter = BootstrapFewShotWithRandomSearch(metric=llm_metric, 
                                                max_bootstrapped_demos=2,
                                                max_labeled_demos=4, 
                                                max_rounds=1,
                                                num_candidate_programs=2,
                                                num_threads=8)

few_shot_bootstrap_compiled_rag = teleprompter.compile(uncompile_k_10, trainset=trainset_truncated)

Going to sample between 1 and 2 traces per predictor.
Will attempt to train 2 candidate sets.


Average Metric: 4.0 / 2  (200.0):   8%|▊         | 2/26 [00:28<04:42, 11.77s/it]

Faithful: 5
Detail: 1
Correctness: 5
Faithful: 5
Detail: 2
Correctness: 2


Average Metric: 6.7 / 3  (223.3):  12%|█▏        | 3/26 [00:29<02:33,  6.68s/it]

Faithful: 5
Detail: 4
Correctness: 4.5


Average Metric: 8.7 / 4  (217.5):  15%|█▌        | 4/26 [00:30<01:40,  4.55s/it]

Faithful: 5
Detail: 2
Correctness: 3


Average Metric: 10.899999999999999 / 5  (218.0):  19%|█▉        | 5/26 [00:30<01:03,  3.00s/it]

Faithful: 4
Detail: 3
Correctness: 4


Average Metric: 12.7 / 6  (211.7):  23%|██▎       | 6/26 [00:31<00:44,  2.24s/it]              

Faithful: 5
Detail: 3
Correctness: 1


Average Metric: 14.899999999999999 / 7  (212.9):  27%|██▋       | 7/26 [00:37<01:08,  3.62s/it]

Faithful: 5
Detail: 3
Correctness: 3


Average Metric: 16.299999999999997 / 8  (203.7):  31%|███       | 8/26 [00:38<00:46,  2.56s/it]

Faithful: 5
Detail: 1
Correctness: 1


Average Metric: 20.099999999999998 / 10  (201.0):  38%|███▊      | 10/26 [00:51<01:03,  4.00s/it]

Faithful: 5
Detail: 3
Correctness: 3
Faithful: 5
Detail: 1
Correctness: 2


Average Metric: 21.9 / 11  (199.1):  42%|████▏     | 11/26 [00:55<01:02,  4.13s/it]              

Faithful: 5
Detail: 1
Correctness: 3


Average Metric: 24.299999999999997 / 12  (202.5):  46%|████▌     | 12/26 [00:56<00:44,  3.17s/it]

Faithful: 5
Detail: 3
Correctness: 4


Average Metric: 26.699999999999996 / 13  (205.4):  50%|█████     | 13/26 [00:57<00:31,  2.42s/it]

Faithful: 5
Detail: 3
Correctness: 4


Average Metric: 28.099999999999994 / 14  (200.7):  54%|█████▍    | 14/26 [00:59<00:29,  2.49s/it]

Faithful: 3
Detail: 2
Correctness: 2


Average Metric: 30.299999999999994 / 15  (202.0):  58%|█████▊    | 15/26 [01:02<00:28,  2.63s/it]

Faithful: 5
Detail: 3
Correctness: 3


Average Metric: 33.3 / 16  (208.1):  62%|██████▏   | 16/26 [01:02<00:19,  1.92s/it]              

Faithful: 5
Detail: 5
Correctness: 5


Average Metric: 35.3 / 17  (207.6):  65%|██████▌   | 17/26 [01:05<00:18,  2.07s/it]

Faithful: 5
Detail: 2
Correctness: 3


Average Metric: 36.099999999999994 / 18  (200.6):  69%|██████▉   | 18/26 [01:16<00:38,  4.75s/it]

Faithful: 2
Detail: 1
Correctness: 1


Average Metric: 38.99999999999999 / 19  (205.3):  73%|███████▎  | 19/26 [01:17<00:26,  3.73s/it] 

Faithful: 5
Detail: 5
Correctness: 4.5


Average Metric: 41.39999999999999 / 20  (207.0):  77%|███████▋  | 20/26 [01:22<00:24,  4.01s/it]

Faithful: 2
Detail: 5
Correctness: 5


Average Metric: 42.599999999999994 / 21  (202.9):  81%|████████  | 21/26 [01:24<00:16,  3.32s/it]

Faithful: 3
Detail: 2
Correctness: 1


Average Metric: 44.599999999999994 / 22  (202.7):  85%|████████▍ | 22/26 [01:25<00:10,  2.75s/it]

Faithful: 5
Detail: 2
Correctness: 3


Average Metric: 45.39999999999999 / 23  (197.4):  88%|████████▊ | 23/26 [01:30<00:10,  3.34s/it] 

Faithful: 2
Detail: 1
Correctness: 1


Average Metric: 47.39999999999999 / 24  (197.5):  92%|█████████▏| 24/26 [01:31<00:05,  2.78s/it]

Faithful: 4
Detail: 3
Correctness: 3


Average Metric: 49.599999999999994 / 25  (198.4):  96%|█████████▌| 25/26 [01:36<00:03,  3.30s/it]

Faithful: 5
Detail: 1
Correctness: 5


Average Metric: 50.39999999999999 / 26  (193.8): 100%|██████████| 26/26 [01:44<00:00,  4.01s/it] 


Faithful: 2
Detail: 1
Correctness: 1
Average Metric: 50.39999999999999 / 26  (193.8%)
Score: 193.85 for set: [0]
New best score: 193.85 for seed -3
Scores so far: [193.85]
Best score: 193.85


Average Metric: 2.2 / 1  (220.0):   4%|▍         | 1/26 [00:05<02:14,  5.36s/it]

Faithful: 4
Detail: 3
Correctness: 4


Average Metric: 4.0 / 2  (200.0):   8%|▊         | 2/26 [00:06<01:14,  3.10s/it]

Faithful: 5
Detail: 3
Correctness: 1


Average Metric: 6.0 / 3  (200.0):  12%|█▏        | 3/26 [00:11<01:24,  3.66s/it]

Faithful: 5
Detail: 2
Correctness: 3


Average Metric: 8.4 / 4  (210.0):  15%|█▌        | 4/26 [00:11<00:53,  2.43s/it]

Faithful: 5
Detail: 3
Correctness: 4


Average Metric: 10.600000000000001 / 5  (212.0):  19%|█▉        | 5/26 [00:29<02:46,  7.93s/it]

Faithful: 5
Detail: 3
Correctness: 3


Average Metric: 13.600000000000001 / 6  (226.7):  23%|██▎       | 6/26 [00:31<01:59,  5.97s/it]

Faithful: 5
Detail: 5
Correctness: 5


Average Metric: 16.0 / 7  (228.6):  27%|██▋       | 7/26 [00:34<01:35,  5.04s/it]              

Faithful: 5
Detail: 4
Correctness: 3


Average Metric: 18.2 / 8  (227.5):  31%|███       | 8/26 [00:35<01:07,  3.74s/it]

Faithful: 3
Detail: 5
Correctness: 3


Average Metric: 19.4 / 9  (215.6):  35%|███▍      | 9/26 [00:39<01:03,  3.71s/it]

Faithful: 3
Detail: 1
Correctness: 2


Average Metric: 22.299999999999997 / 10  (223.0):  38%|███▊      | 10/26 [00:39<00:42,  2.68s/it]

Faithful: 5
Detail: 5
Correctness: 4.5


Average Metric: 24.699999999999996 / 11  (224.5):  42%|████▏     | 11/26 [00:43<00:46,  3.13s/it]

Faithful: 5
Detail: 3
Correctness: 4


Average Metric: 27.099999999999994 / 12  (225.8):  46%|████▌     | 12/26 [00:44<00:32,  2.34s/it]

Faithful: 5
Detail: 3
Correctness: 4


Average Metric: 29.499999999999993 / 13  (226.9):  50%|█████     | 13/26 [00:46<00:27,  2.15s/it]

Faithful: 5
Detail: 4
Correctness: 3


Average Metric: 31.89999999999999 / 14  (227.9):  54%|█████▍    | 14/26 [00:54<00:48,  4.00s/it] 

Faithful: 5
Detail: 2
Correctness: 5


Average Metric: 34.29999999999999 / 15  (228.7):  58%|█████▊    | 15/26 [00:58<00:44,  4.02s/it]

Faithful: 5
Detail: 4
Correctness: 3


Average Metric: 36.69999999999999 / 16  (229.4):  62%|██████▏   | 16/26 [01:01<00:37,  3.78s/it]

Faithful: 5
Detail: 3
Correctness: 4


Average Metric: 38.69999999999999 / 17  (227.6):  65%|██████▌   | 17/26 [01:04<00:30,  3.36s/it]

Faithful: 5
Detail: 2
Correctness: 3


Average Metric: 40.89999999999999 / 18  (227.2):  69%|██████▉   | 18/26 [01:06<00:24,  3.00s/it]

Faithful: 5
Detail: 1
Correctness: 5


Average Metric: 43.89999999999999 / 19  (231.1):  73%|███████▎  | 19/26 [01:06<00:15,  2.22s/it]

Faithful: 5
Detail: 5
Correctness: 5


Average Metric: 45.69999999999999 / 20  (228.5):  77%|███████▋  | 20/26 [01:14<00:23,  3.84s/it]

Faithful: 4
Detail: 2
Correctness: 3


Average Metric: 48.29999999999999 / 21  (230.0):  81%|████████  | 21/26 [01:14<00:14,  2.92s/it]

Faithful: 4
Detail: 5
Correctness: 4


Average Metric: 50.49999999999999 / 22  (229.5):  85%|████████▍ | 22/26 [01:21<00:15,  3.94s/it]

Faithful: 5
Detail: 3
Correctness: 3


Average Metric: 52.699999999999996 / 23  (229.1):  88%|████████▊ | 23/26 [01:24<00:11,  3.70s/it]

Faithful: 5
Detail: 4
Correctness: 2


Average Metric: 54.699999999999996 / 24  (227.9):  92%|█████████▏| 24/26 [01:25<00:05,  2.75s/it]

Faithful: 4
Detail: 3
Correctness: 3


Average Metric: 56.699999999999996 / 25  (226.8):  96%|█████████▌| 25/26 [01:26<00:02,  2.40s/it]

Faithful: 5
Detail: 2
Correctness: 3


Average Metric: 59.3 / 26  (228.1): 100%|██████████| 26/26 [01:42<00:00,  3.93s/it]              


Faithful: 5
Detail: 3
Correctness: 5
Average Metric: 59.3 / 26  (228.1%)
Score: 228.08 for set: [4]
New best score: 228.08 for seed -2
Scores so far: [193.85, 228.08]
Best score: 228.08


  4%|▍         | 1/26 [00:00<00:09,  2.68it/s]

Faithful: 3
Detail: 5
Correctness: 3


  8%|▊         | 2/26 [00:06<01:21,  3.39s/it]


Faithful: 5
Detail: 3
Correctness: 1
Bootstrapped 2 full traces after 3 examples in round 0.


Average Metric: 1.8 / 1  (180.0):   4%|▍         | 1/26 [00:06<02:54,  7.00s/it]

Faithful: 5
Detail: 3
Correctness: 1


Average Metric: 3.8 / 2  (190.0):   8%|▊         | 2/26 [00:09<01:41,  4.23s/it]

Faithful: 5
Detail: 2
Correctness: 3


Average Metric: 6.8 / 3  (226.7):  12%|█▏        | 3/26 [00:10<01:08,  2.96s/it]

Faithful: 5
Detail: 5
Correctness: 5


Average Metric: 9.0 / 4  (225.0):  15%|█▌        | 4/26 [00:13<01:00,  2.74s/it]

Faithful: 3
Detail: 5
Correctness: 3


Average Metric: 11.4 / 5  (228.0):  19%|█▉        | 5/26 [00:20<01:30,  4.31s/it]

Faithful: 5
Detail: 3
Correctness: 4


Average Metric: 13.4 / 6  (223.3):  23%|██▎       | 6/26 [00:23<01:16,  3.81s/it]

Faithful: 4
Detail: 3
Correctness: 3


Average Metric: 15.8 / 7  (225.7):  27%|██▋       | 7/26 [00:24<00:54,  2.89s/it]

Faithful: 5
Detail: 3
Correctness: 4


Average Metric: 18.400000000000002 / 8  (230.0):  31%|███       | 8/26 [00:26<00:51,  2.84s/it]

Faithful: 5
Detail: 5
Correctness: 3


Average Metric: 21.000000000000004 / 9  (233.3):  35%|███▍      | 9/26 [00:30<00:52,  3.09s/it]

Faithful: 5
Detail: 4
Correctness: 4


Average Metric: 23.400000000000002 / 10  (234.0):  38%|███▊      | 10/26 [00:31<00:37,  2.32s/it]

Faithful: 5
Detail: 3
Correctness: 4


Average Metric: 25.6 / 11  (232.7):  42%|████▏     | 11/26 [00:33<00:35,  2.35s/it]              

Faithful: 5
Detail: 3
Correctness: 3


Average Metric: 28.0 / 12  (233.3):  46%|████▌     | 12/26 [00:36<00:37,  2.67s/it]

Faithful: 5
Detail: 3
Correctness: 4


Average Metric: 32.0 / 14  (228.6):  54%|█████▍    | 14/26 [00:39<00:21,  1.78s/it]

Faithful: 5
Detail: 3
Correctness: 3
Faithful: 4
Detail: 2
Correctness: 3


Average Metric: 34.8 / 15  (232.0):  58%|█████▊    | 15/26 [00:42<00:26,  2.42s/it]

Faithful: 5
Detail: 4
Correctness: 5


Average Metric: 37.199999999999996 / 16  (232.5):  62%|██████▏   | 16/26 [00:46<00:26,  2.66s/it]

Faithful: 2
Detail: 5
Correctness: 5


Average Metric: 38.599999999999994 / 17  (227.1):  65%|██████▌   | 17/26 [00:46<00:18,  2.06s/it]

Faithful: 3
Detail: 2
Correctness: 2


Average Metric: 40.599999999999994 / 18  (225.6):  69%|██████▉   | 18/26 [00:48<00:15,  1.99s/it]

Faithful: 5
Detail: 2
Correctness: 3


Average Metric: 42.99999999999999 / 19  (226.3):  73%|███████▎  | 19/26 [00:50<00:12,  1.85s/it] 

Faithful: 5
Detail: 4
Correctness: 3


Average Metric: 44.99999999999999 / 20  (225.0):  77%|███████▋  | 20/26 [00:50<00:08,  1.43s/it]

Faithful: 5
Detail: 2
Correctness: 3


Average Metric: 46.599999999999994 / 21  (221.9):  81%|████████  | 21/26 [00:55<00:11,  2.40s/it]

Faithful: 3
Detail: 2
Correctness: 3


Average Metric: 49.599999999999994 / 22  (225.5):  85%|████████▍ | 22/26 [00:57<00:09,  2.36s/it]

Faithful: 5
Detail: 5
Correctness: 5


Average Metric: 52.599999999999994 / 23  (228.7):  88%|████████▊ | 23/26 [01:07<00:13,  4.49s/it]

Faithful: 5
Detail: 5
Correctness: 5


Average Metric: 54.8 / 24  (228.3):  92%|█████████▏| 24/26 [01:08<00:07,  3.54s/it]              

Faithful: 5
Detail: 3
Correctness: 3


Average Metric: 57.4 / 25  (229.6):  96%|█████████▌| 25/26 [01:11<00:03,  3.37s/it]

Faithful: 5
Detail: 3
Correctness: 5


Average Metric: 59.4 / 26  (228.5): 100%|██████████| 26/26 [01:23<00:00,  3.21s/it]


Faithful: 5
Detail: 2
Correctness: 3
Average Metric: 59.4 / 26  (228.5%)
Score: 228.46 for set: [4]
New best score: 228.46 for seed -1
Scores so far: [193.85, 228.08, 228.46]
Best score: 228.46
Average of max per entry across top 1 scores: 2.2846153846153845
Average of max per entry across top 2 scores: 2.519230769230769
Average of max per entry across top 3 scores: 2.5576923076923075
Average of max per entry across top 5 scores: 2.5576923076923075
Average of max per entry across top 8 scores: 2.5576923076923075
Average of max per entry across top 9999 scores: 2.5576923076923075


  4%|▍         | 1/26 [00:05<02:20,  5.63s/it]

Faithful: 5
Detail: 3
Correctness: 4


  8%|▊         | 2/26 [00:34<06:52, 17.17s/it]


Faithful: 3
Detail: 2
Correctness: 3
Bootstrapped 2 full traces after 3 examples in round 0.


Average Metric: 2.0 / 1  (200.0):   4%|▍         | 1/26 [00:08<03:38,  8.75s/it]

Faithful: 5
Detail: 2
Correctness: 3


Average Metric: 4.2 / 2  (210.0):   8%|▊         | 2/26 [00:09<01:33,  3.88s/it]

Faithful: 4
Detail: 3
Correctness: 4


Average Metric: 5.800000000000001 / 3  (193.3):  12%|█▏        | 3/26 [00:10<01:06,  2.90s/it]

Faithful: 3
Detail: 2
Correctness: 3


Average Metric: 8.200000000000001 / 4  (205.0):  15%|█▌        | 4/26 [00:20<02:01,  5.50s/it]

Faithful: 5
Detail: 3
Correctness: 4


Average Metric: 10.600000000000001 / 5  (212.0):  19%|█▉        | 5/26 [00:22<01:26,  4.14s/it]

Faithful: 5
Detail: 3
Correctness: 4


Average Metric: 13.200000000000001 / 6  (220.0):  23%|██▎       | 6/26 [00:23<01:05,  3.30s/it]

Faithful: 5
Detail: 3
Correctness: 5


Average Metric: 14.4 / 7  (205.7):  27%|██▋       | 7/26 [00:26<00:59,  3.16s/it]              

Faithful: 3
Detail: 1
Correctness: 2


Average Metric: 16.8 / 8  (210.0):  31%|███       | 8/26 [00:28<00:46,  2.61s/it]

Faithful: 5
Detail: 3
Correctness: 4


Average Metric: 18.8 / 9  (208.9):  35%|███▍      | 9/26 [00:28<00:32,  1.92s/it]

Faithful: 5
Detail: 2
Correctness: 3


Average Metric: 21.8 / 10  (218.0):  38%|███▊      | 10/26 [00:29<00:25,  1.59s/it]

Faithful: 5
Detail: 5
Correctness: 5


Average Metric: 24.0 / 11  (218.2):  42%|████▏     | 11/26 [00:32<00:28,  1.92s/it]

Faithful: 5
Detail: 3
Correctness: 3


Average Metric: 26.4 / 12  (220.0):  46%|████▌     | 12/26 [00:33<00:24,  1.72s/it]

Faithful: 5
Detail: 3
Correctness: 4


Average Metric: 29.4 / 13  (226.2):  50%|█████     | 13/26 [00:34<00:18,  1.45s/it]

Faithful: 5
Detail: 5
Correctness: 5


Average Metric: 31.799999999999997 / 14  (227.1):  54%|█████▍    | 14/26 [00:35<00:15,  1.32s/it]

Faithful: 5
Detail: 3
Correctness: 4


Average Metric: 35.8 / 16  (223.7):  62%|██████▏   | 16/26 [00:38<00:14,  1.41s/it]              

Faithful: 4
Detail: 2
Correctness: 3
Faithful: 5
Detail: 3
Correctness: 3


Average Metric: 38.199999999999996 / 17  (224.7):  65%|██████▌   | 17/26 [00:43<00:22,  2.53s/it]

Faithful: 5
Detail: 3
Correctness: 4


Average Metric: 40.199999999999996 / 18  (223.3):  69%|██████▉   | 18/26 [00:48<00:24,  3.02s/it]

Faithful: 5
Detail: 2
Correctness: 3


Average Metric: 42.8 / 19  (225.3):  73%|███████▎  | 19/26 [00:54<00:29,  4.20s/it]              

Faithful: 5
Detail: 3
Correctness: 5


Average Metric: 45.0 / 20  (225.0):  77%|███████▋  | 20/26 [00:56<00:20,  3.36s/it]

Faithful: 5
Detail: 1
Correctness: 5


Average Metric: 46.6 / 21  (221.9):  81%|████████  | 21/26 [00:56<00:12,  2.45s/it]

Faithful: 3
Detail: 2
Correctness: 3


Average Metric: 48.6 / 22  (220.9):  85%|████████▍ | 22/26 [01:00<00:10,  2.73s/it]

Faithful: 5
Detail: 2
Correctness: 3


Average Metric: 51.2 / 23  (222.6):  88%|████████▊ | 23/26 [01:00<00:06,  2.07s/it]

Faithful: 4
Detail: 5
Correctness: 4


Average Metric: 53.2 / 24  (221.7):  92%|█████████▏| 24/26 [01:02<00:03,  2.00s/it]

Faithful: 5
Detail: 2
Correctness: 3


Average Metric: 56.2 / 25  (224.8):  96%|█████████▌| 25/26 [01:02<00:01,  1.50s/it]

Faithful: 5
Detail: 5
Correctness: 5


Average Metric: 58.2 / 26  (223.8): 100%|██████████| 26/26 [01:08<00:00,  2.62s/it]


Faithful: 4
Detail: 3
Correctness: 3
Average Metric: 58.2 / 26  (223.8%)
Score: 223.85 for set: [4]
Scores so far: [193.85, 228.08, 228.46, 223.85]
Best score: 228.46
Average of max per entry across top 1 scores: 2.2846153846153845
Average of max per entry across top 2 scores: 2.519230769230769
Average of max per entry across top 3 scores: 2.5923076923076924
Average of max per entry across top 5 scores: 2.6230769230769226
Average of max per entry across top 8 scores: 2.6230769230769226
Average of max per entry across top 9999 scores: 2.6230769230769226


  4%|▍         | 1/26 [00:04<01:58,  4.76s/it]


Faithful: 5
Detail: 1
Correctness: 5
Bootstrapped 1 full traces after 2 examples in round 0.


Average Metric: 2.2 / 1  (220.0):   4%|▍         | 1/26 [00:05<02:22,  5.69s/it]

Faithful: 4
Detail: 3
Correctness: 4


Average Metric: 4.0 / 2  (200.0):   8%|▊         | 2/26 [00:06<01:13,  3.05s/it]

Faithful: 5
Detail: 3
Correctness: 1


Average Metric: 6.0 / 3  (200.0):  12%|█▏        | 3/26 [00:08<00:58,  2.55s/it]

Faithful: 5
Detail: 2
Correctness: 3


Average Metric: 8.4 / 4  (210.0):  15%|█▌        | 4/26 [00:15<01:35,  4.35s/it]

Faithful: 5
Detail: 3
Correctness: 4


Average Metric: 10.600000000000001 / 5  (212.0):  19%|█▉        | 5/26 [00:25<02:08,  6.11s/it]

Faithful: 5
Detail: 3
Correctness: 3


Average Metric: 16.6 / 7  (237.1):  27%|██▋       | 7/26 [00:28<01:04,  3.40s/it]              

Faithful: 5
Detail: 5
Correctness: 5
Faithful: 5
Detail: 5
Correctness: 5


Average Metric: 19.6 / 8  (245.0):  31%|███       | 8/26 [00:31<01:02,  3.48s/it]

Faithful: 5
Detail: 5
Correctness: 5


Average Metric: 22.6 / 9  (251.1):  35%|███▍      | 9/26 [00:34<00:54,  3.18s/it]

Faithful: 5
Detail: 5
Correctness: 5


Average Metric: 25.0 / 10  (250.0):  38%|███▊      | 10/26 [00:34<00:38,  2.39s/it]

Faithful: 5
Detail: 3
Correctness: 4


Average Metric: 27.6 / 11  (250.9):  42%|████▏     | 11/26 [00:35<00:27,  1.83s/it]

Faithful: 5
Detail: 4
Correctness: 4


Average Metric: 29.400000000000002 / 12  (245.0):  46%|████▌     | 12/26 [00:38<00:32,  2.34s/it]

Faithful: 4
Detail: 2
Correctness: 3


Average Metric: 31.8 / 13  (244.6):  50%|█████     | 13/26 [00:40<00:25,  1.98s/it]              

Faithful: 5
Detail: 4
Correctness: 3


Average Metric: 34.2 / 14  (244.3):  54%|█████▍    | 14/26 [00:40<00:17,  1.47s/it]

Faithful: 5
Detail: 3
Correctness: 4


Average Metric: 36.0 / 15  (240.0):  58%|█████▊    | 15/26 [00:47<00:34,  3.12s/it]

Faithful: 4
Detail: 2
Correctness: 3


Average Metric: 39.0 / 16  (243.8):  62%|██████▏   | 16/26 [00:47<00:23,  2.32s/it]

Faithful: 5
Detail: 5
Correctness: 5


Average Metric: 41.4 / 17  (243.5):  65%|██████▌   | 17/26 [00:51<00:23,  2.66s/it]

Faithful: 5
Detail: 4
Correctness: 3


Average Metric: 43.4 / 18  (241.1):  69%|██████▉   | 18/26 [00:51<00:15,  1.97s/it]

Faithful: 5
Detail: 2
Correctness: 3


Average Metric: 45.6 / 19  (240.0):  73%|███████▎  | 19/26 [00:52<00:11,  1.59s/it]

Faithful: 5
Detail: 1
Correctness: 5


Average Metric: 47.6 / 20  (238.0):  77%|███████▋  | 20/26 [00:54<00:10,  1.76s/it]

Faithful: 5
Detail: 2
Correctness: 3


Average Metric: 52.6 / 22  (239.1):  81%|████████  | 21/26 [01:01<00:16,  3.34s/it]

Faithful: 5
Detail: 5
Correctness: 5
Faithful: 5
Detail: 2
Correctness: 3


Average Metric: 57.800000000000004 / 24  (240.8):  92%|█████████▏| 24/26 [01:06<00:04,  2.20s/it]

Faithful: 5
Detail: 3
Correctness: 3
Faithful: 5
Detail: 5
Correctness: 5


Average Metric: 62.400000000000006 / 26  (240.0): 100%|██████████| 26/26 [01:14<00:00,  2.87s/it]

Faithful: 4
Detail: 4
Correctness: 3
Faithful: 5
Detail: 3
Correctness: 4
Average Metric: 62.400000000000006 / 26  (240.0%)
Score: 240.0 for set: [4]
New best score: 240.0 for seed 1
Scores so far: [193.85, 228.08, 228.46, 223.85, 240.0]
Best score: 240.0
Average of max per entry across top 1 scores: 2.4000000000000004
Average of max per entry across top 2 scores: 2.6076923076923078
Average of max per entry across top 3 scores: 2.665384615384615
Average of max per entry across top 5 scores: 2.730769230769231
Average of max per entry across top 8 scores: 2.730769230769231
Average of max per entry across top 9999 scores: 2.730769230769231
5 candidate programs found.





In [66]:
# Let's check the prompt for this compiled model
turbo.inspect_history(n=1)





Answer questions based on the context.

---

Question: Are CD44 variants (CD44v) associated with poor prognosis of metastasis?
Answer: Yes, several isoforms (obtained by by usage of ten variant exons in various combinations) have been causally related to metastasis.

Question: Does triiodothyronine stimulate red blood cell sodium potassium pump?
Answer: An inverse correlation between this enzymatic action and free triiodothyronine (FT3) levels. The effect of triiodothyronine (T3) on Na+,K(+)-ATPase activity in red blood cells may be different in vivo and in vitro.

Question: What systems have been developed for the numbering of antibody residues?
Answer: The most prevalent antibody numbering systems are the Kabat system, the Chothia system as well as the IMGT numbering system.

---

Follow the following format.

Context: may contain relevant facts

Question: ${question}

Reasoning: Let's think step by step in order to ${produce the answer}. We ...

Answer: ${answer}

---

Context:


You can notice how the prompt has somewhat become more specific in handling the examples and have also added extra instructions. Let's now evaluate on the `devset` we created and see how the model performs. 

In [70]:
few_shot_bootstrap_compiled_rag_evals = evaluate(few_shot_bootstrap_compiled_rag, metric=llm_metric, return_all_scores=True, return_outputs=True)

  0%|          | 0/20 [00:00<?, ?it/s]

Average Metric: 2.2 / 1  (220.0):   5%|▌         | 1/20 [00:13<04:23, 13.85s/it]

Faithful: 5
Detail: 3
Correctness: 3


Average Metric: 5.2 / 2  (260.0):  10%|█         | 2/20 [00:14<01:45,  5.88s/it]

Faithful: 5
Detail: 5
Correctness: 5


Average Metric: 9.6 / 4  (240.0):  15%|█▌        | 3/20 [00:14<00:56,  3.35s/it]

Faithful: 5Faithful: 3
Detail: 4
Correctness: 3

Detail: 3
Correctness: 4


Average Metric: 14.7 / 6  (245.0):  25%|██▌       | 5/20 [00:14<00:23,  1.54s/it]

Faithful: 5Faithful: 5
Detail: 5
Correctness: 4.5

Detail: 3
Correctness: 3


Average Metric: 19.299999999999997 / 8  (241.2):  35%|███▌      | 7/20 [00:15<00:12,  1.06it/s]

Faithful: 5Faithful: 5
Detail: 3
Correctness: 4

Detail: 3
Correctness: 3


Average Metric: 22.899999999999995 / 10  (229.0):  45%|████▌     | 9/20 [00:20<00:18,  1.65s/it]

Faithful: 5Faithful: 1
Detail: 2
Correctness: 3

Detail: 3
Correctness: 4


Average Metric: 27.899999999999995 / 12  (232.5):  60%|██████    | 12/20 [00:21<00:08,  1.02s/it]

Faithful: 5
Detail: 3
Correctness: 4
Faithful: 5
Detail: 4
Correctness: 4


Average Metric: 34.5 / 15  (230.0):  70%|███████   | 14/20 [00:22<00:04,  1.49it/s]              

Faithful: 5
Detail: 4
Correctness: 3
Faithful: 5
Detail: 2
Correctness: 3
Faithful: 4
Detail: 3
Correctness: 4


Average Metric: 36.9 / 16  (230.6):  80%|████████  | 16/20 [00:22<00:01,  2.27it/s]

Faithful: 5
Detail: 4
Correctness: 3


Average Metric: 39.3 / 17  (231.2):  85%|████████▌ | 17/20 [00:23<00:01,  1.70it/s]

Faithful: 5
Detail: 3
Correctness: 4


Average Metric: 45.3 / 20  (226.5): 100%|██████████| 20/20 [00:24<00:00,  1.21s/it]

Faithful: 5
Detail: 2
Correctness: 3
Faithful: 5
Detail: 2
Correctness: 3
Faithful: 5
Detail: 2
Correctness: 3
Average Metric: 45.3 / 20  (226.5%)



  df.loc[:, metric_name] = df[metric_name].apply(


Unnamed: 0,question,example_answer,context,pred_answer,llm_metric
0,Are ultraconserved elements often transcribed?,"Yes. Especially, a large fraction of non-exonic UCEs is transcribed across all developmental stages examined from only one DNA strand.","['Ultraconserved elements in insect genomes', 'ultraconserved elements in plants tend to occur in clusters and locate at noncoding regions', 'Transcribed ultraconserved region in human cancers.',...","Yes, some ultraconserved elements, specifically a subset known as transcribed ultraconserved regions (T-UCRs), are often transcribed.",2.2
1,List metalloenzyme inhibitors.,Foscarnet VT-1129 VT-1161 BB-3497 hydroxamate molecules siderophores,"['the metalloenzyme inhibitors VT-1129 and VT-1161 (both Viamet Pharmaceuticals)', 'Clinically approved inhibitors were selected as well as several other reported metalloprotein inhibitors in order to...","The metalloenzyme inhibitors mentioned in the context are VT-1129, VT-1161, the N-formyl-hydroxylamine derivative BB-3497, 8-halo-4-(3-chloro-4-fluoro-phenylamino)-6-[(1H-[1,2,3]triazol-4-ylmethyl)-amino]-quinoline-3-carbonitriles, 2,2'-dipyridylamine (DPA), triazacyclononane (TACN), and the 8-hydroxyquinoline.",2.0
2,Which proteins control the degradation of cryptic unstable transcripts (CUTs) in yeast?,Termination of cryptic unstable transcripts is directed by yeast RNA-binding proteins Nrd1 and Nab3. These cryptic unstable transcripts (CUTs) are rapidly degraded by the nuclear...,"['Cryptic unstable transcripts (CUTs) were recently described as a principal class of RNA polymerase II transcripts in Saccharomyces cerevisiae', 'Cryptic unstable transcripts (CUTs) were recently...","The proteins that control the degradation of cryptic unstable transcripts (CUTs) in yeast are the Nrd1-exosome-TRAMP complexes, the RNA-binding proteins Nrd1 and Nab3, and the...",2.4
3,Can protein coding exons originate from ALU sequences?,Yes. Intronic ALUs can evolve into exons by the activation of splice signals residing within the ALU sequence. While most ALU exons do not add...,"['Transcribed Alu sequences can alter splicing patterns by generating new exons', 'Transcribed Alu sequences can alter splicing patterns by generating new exons, but other impacts...","Yes, protein coding exons can originate from ALU sequences.",2.2
4,Is Fibroblast Growth Factor 23 a phosphaturic hormone?,"Yes, fbroblast growth factor 23 (FGF23) is a phosphaturic hormone.","['fibroblast growth factor-23 (FGF23), a bone-derived phosphaturic hormone.', 'serum levels of a phosphaturic hormone, fibroblast growth factor 23 (Fgf23),', 'circulating phosphaturic hormone fibroblast growth factor-23...","Yes, Fibroblast Growth Factor 23 is a phosphaturic hormone.",2.4


In [75]:
pd.set_option('display.max_colwidth', 500)
pd.set_option('display.max_rows', 500)
evals_outs_df = create_score_dataframe(few_shot_bootstrap_compiled_rag_evals[1])
print(evals_outs_df['score'].mean())
evals_outs_df

2.265


Unnamed: 0,question,predicted_answer,score
0,Can protein coding exons originate from ALU sequences?,"Yes, protein coding exons can originate from ALU sequences.",2.2
1,What is the name of the stem loop present in the 3' end of genes encoding for selenoproteins?,The stem loop present in the 3' end of genes encoding for selenoproteins is called the Sec insertion sequence (SECIS) element.,3.0
2,List metalloenzyme inhibitors.,"The metalloenzyme inhibitors mentioned in the context are VT-1129, VT-1161, the N-formyl-hydroxylamine derivative BB-3497, 8-halo-4-(3-chloro-4-fluoro-phenylamino)-6-[(1H-[1,2,3]triazol-4-ylmethyl)-amino]-quinoline-3-carbonitriles, 2,2'-dipyridylamine (DPA), triazacyclononane (TACN), and the 8-hydroxyquinoline.",2.0
3,Which proteins control the degradation of cryptic unstable transcripts (CUTs) in yeast?,"The proteins that control the degradation of cryptic unstable transcripts (CUTs) in yeast are the Nrd1-exosome-TRAMP complexes, the RNA-binding proteins Nrd1 and Nab3, and the Trf4/5-Air1/2-Mtr4 polyadenylation (TRAMP) complex.",2.4
4,What is being measured with an accelerometer in back pain patients,"An accelerometer is used to measure overall physical activity, time spent in different static trunk postures, standing time, lying time, time upright, time walking, step count, and the number of constant postures in back pain patients.",2.9
5,Are ultraconserved elements often transcribed?,"Yes, some ultraconserved elements, specifically a subset known as transcribed ultraconserved regions (T-UCRs), are often transcribed.",2.2
6,Is Fibroblast Growth Factor 23 a phosphaturic hormone?,"Yes, Fibroblast Growth Factor 23 is a phosphaturic hormone.",2.4
7,Can NXY-059 be used for treatment of acute ischemic stroke patients?,"NXY-059 has been tested and shown some promise in the treatment of acute ischemic stroke, but its effectiveness is not definitively confirmed and additional research is needed.",2.2
8,Are there Conserved Noncoding Elements (CNEs) in plant genomes?,"Yes, there are Conserved Noncoding Elements (CNEs) in plant genomes.",1.2
9,Are long non coding RNAs as conserved in sequence as protein coding genes?,"No, long noncoding RNAs are not as conserved in sequence as protein-coding genes.",2.4


Notice How the Mean Score has increased to 2.265 from 2.0. So, the model is performing better on the devset now after being optimized. 

In [76]:
few_shot_bootstrap_compiled_rag(sample)

Prediction(
    context=['Two proprotein convertase subtilisin/kexin type 9 (PCSK9) inhibitors, evolocumab and alirocumab, have recently been approved by both the Food and Drug Administration (FDA) and the European Medicines Agency (EMA) for the treatment of hypercholesterolemia.', 'Food and Drug Administration approved the first two proprotein convertase subtilisin/kexin type 9 (PCSK9) inhibitors, alirocumab (Praluent®; Sanofi/ Regeneron) and evolocumab (Repatha®; Amgen), for use in patients with heterozygous and homozygous familial hypercholesterolemia and for patients intolerant of statins or those with a major risk of cardiovascular disease (CVD) but unable to lower their LDL cholesterol (LDL-C) to optimal levels with statins and ezetimibe.', 'In 2015 the U.S. Food and Drug Administration approved the first two proprotein convertase subtilisin/kexin type 9 (PCSK9) inhibitors, alirocumab (Praluent®; Sanofi/ Regeneron) and evolocumab (Repatha®; Amgen), for use in patients with hetero

### Signature Optmiizer

Optimizing Signature is also a way you can try to improve the performance of your model. You can either plug the above bootstrapped compiled model to this or you can even use the uncompiled model.

In [77]:

from dspy.teleprompt import BayesianSignatureOptimizer

llm_prompter = dspy.OpenAI(model='gpt-4', max_tokens=2000, model_type='chat')

teleprompter = BayesianSignatureOptimizer(task_model=dspy.settings.lm,
                                          metric=llm_metric,
                                          prompt_model=llm_prompter,
                                          n=5,
                                          verbose=False)
kwargs = dict(num_threads=8, display_progress=True, display_table=0)
signature_compiled_rag = teleprompter.compile(uncompile_k_10, devset=trainset_truncated,
                                         optuna_trials_num=3,
                                         max_bootstrapped_demos=4,
                                         max_labeled_demos=4,
                                         eval_kwargs=kwargs)


Please be advised that based on the parameters you have set, the maximum number of LM calls is projected as follows:

[93m- Task Model: [94m[1m26[0m[93m examples in dev set * [94m[1m3[0m[93m trials * [94m[1m# of LM calls in your program[0m[93m = ([94m[1m78 * # of LM calls in your program[0m[93m) task model calls[0m
[93m- Prompt Model: # data summarizer calls (max [94m[1m10[0m[93m) + [94m[1m5[0m[93m * [94m[1m1[0m[93m lm calls in program = [94m[1m15[0m[93m prompt model calls[0m

[93m[1mEstimated Cost Calculation:[0m

[93mTotal Cost = (Number of calls to task model * (Avg Input Token Length per Call * Task Model Price per Input Token + Avg Output Token Length per Call * Task Model Price per Output Token) 
            + (Number of calls to prompt model * (Avg Input Token Length per Call * Task Prompt Price per Input Token + Avg Output Token Length per Call * Prompt Model Price per Output Token).[0m

For a preliminary estimate of potential costs, we

  0%|          | 0/26 [00:00<?, ?it/s]

  4%|▍         | 1/26 [00:01<00:32,  1.28s/it]

Faithful: 5
Detail: 1
Correctness: 5


  8%|▊         | 2/26 [00:05<01:08,  2.86s/it]

Faithful: 5
Detail: 2
Correctness: 3


 12%|█▏        | 3/26 [00:11<01:43,  4.50s/it]

Faithful: 3
Detail: 2
Correctness: 2


 15%|█▌        | 4/26 [00:21<01:56,  5.30s/it]


Faithful: 5
Detail: 3
Correctness: 4
Bootstrapped 4 full traces after 5 examples in round 0.


  4%|▍         | 1/26 [00:04<01:54,  4.57s/it]

Faithful: 4
Detail: 3
Correctness: 4


  8%|▊         | 2/26 [00:33<07:35, 18.97s/it]

Faithful: 5
Detail: 5
Correctness: 5


 12%|█▏        | 3/26 [01:02<08:56, 23.32s/it]

Faithful: 5
Detail: 4
Correctness: 4.5


 15%|█▌        | 4/26 [01:09<06:22, 17.41s/it]


Faithful: 5
Detail: 2
Correctness: 3
Bootstrapped 4 full traces after 5 examples in round 0.


  4%|▍         | 1/26 [00:10<04:13, 10.14s/it]

Faithful: 5
Detail: 5
Correctness: 5


  8%|▊         | 2/26 [00:18<03:41,  9.22s/it]

Faithful: 5
Detail: 3
Correctness: 4


 12%|█▏        | 3/26 [00:24<02:50,  7.43s/it]

Faithful: 5
Detail: 1
Correctness: 5


 15%|█▌        | 4/26 [00:31<02:50,  7.77s/it]


Faithful: 5
Detail: 3
Correctness: 4
Bootstrapped 4 full traces after 5 examples in round 0.


  4%|▍         | 1/26 [00:28<11:56, 28.66s/it]

Faithful: 4
Detail: 3
Correctness: 3


  8%|▊         | 2/26 [00:36<06:28, 16.19s/it]

Faithful: 3
Detail: 1
Correctness: 2


 12%|█▏        | 3/26 [00:41<04:20, 11.31s/it]

Faithful: 2
Detail: 1
Correctness: 1


 15%|█▌        | 4/26 [00:45<04:09, 11.33s/it]

Faithful: 4
Detail: 2
Correctness: 3
Bootstrapped 4 full traces after 5 examples in round 0.



[I 2024-04-30 01:02:24,785] A new study created in memory with name: no-name-acca11e2-8e9a-41ab-8437-ad50b2e5f5e0


Starting trial #0


Average Metric: 2.2 / 1  (220.0):   4%|▍         | 1/26 [00:10<04:17, 10.31s/it]

Faithful: 4
Detail: 3
Correctness: 4


Average Metric: 4.2 / 2  (210.0):   8%|▊         | 2/26 [00:11<01:58,  4.93s/it]

Faithful: 5
Detail: 2
Correctness: 3


Average Metric: 7.2 / 3  (240.0):  12%|█▏        | 3/26 [00:12<01:12,  3.17s/it]

Faithful: 5
Detail: 5
Correctness: 5


Average Metric: 9.6 / 4  (240.0):  15%|█▌        | 4/26 [00:18<01:36,  4.39s/it]

Faithful: 5
Detail: 3
Correctness: 4


Average Metric: 12.0 / 5  (240.0):  19%|█▉        | 5/26 [00:21<01:22,  3.94s/it]

Faithful: 5
Detail: 3
Correctness: 4


Average Metric: 17.6 / 7  (251.4):  27%|██▋       | 7/26 [00:31<01:16,  4.04s/it]

Faithful: 5
Detail: 5
Correctness: 4
Faithful: 5
Detail: 5
Correctness: 4


Average Metric: 20.6 / 8  (257.5):  31%|███       | 8/26 [00:36<01:14,  4.12s/it]

Faithful: 5
Detail: 5
Correctness: 5


Average Metric: 23.0 / 9  (255.6):  35%|███▍      | 9/26 [00:38<01:02,  3.69s/it]

Faithful: 5
Detail: 3
Correctness: 4


Average Metric: 24.6 / 10  (246.0):  38%|███▊      | 10/26 [00:39<00:44,  2.75s/it]

Faithful: 3
Detail: 4
Correctness: 1


Average Metric: 26.8 / 11  (243.6):  42%|████▏     | 11/26 [00:41<00:36,  2.46s/it]

Faithful: 5
Detail: 3
Correctness: 3


Average Metric: 28.8 / 12  (240.0):  46%|████▌     | 12/26 [00:43<00:31,  2.27s/it]

Faithful: 5
Detail: 2
Correctness: 3


Average Metric: 31.0 / 13  (238.5):  50%|█████     | 13/26 [00:43<00:22,  1.72s/it]

Faithful: 5
Detail: 3
Correctness: 3


Average Metric: 33.4 / 14  (238.6):  54%|█████▍    | 14/26 [00:44<00:16,  1.34s/it]

Faithful: 5
Detail: 4
Correctness: 3


Average Metric: 35.8 / 15  (238.7):  58%|█████▊    | 15/26 [00:47<00:21,  1.97s/it]

Faithful: 5
Detail: 3
Correctness: 4


Average Metric: 38.5 / 16  (240.6):  62%|██████▏   | 16/26 [00:48<00:15,  1.56s/it]

Faithful: 5
Detail: 4
Correctness: 4.5


Average Metric: 40.7 / 17  (239.4):  65%|██████▌   | 17/26 [00:48<00:11,  1.30s/it]

Faithful: 4
Detail: 4
Correctness: 3


Average Metric: 42.7 / 18  (237.2):  69%|██████▉   | 18/26 [00:52<00:15,  1.92s/it]

Faithful: 5
Detail: 2
Correctness: 3


Average Metric: 45.7 / 19  (240.5):  73%|███████▎  | 19/26 [00:54<00:14,  2.03s/it]

Faithful: 5
Detail: 5
Correctness: 5


Average Metric: 48.300000000000004 / 20  (241.5):  77%|███████▋  | 20/26 [00:59<00:16,  2.78s/it]

Faithful: 5
Detail: 3
Correctness: 5


Average Metric: 49.7 / 21  (236.7):  81%|████████  | 21/26 [01:11<00:28,  5.73s/it]              

Faithful: 3
Detail: 1
Correctness: 3


Average Metric: 51.7 / 22  (235.0):  85%|████████▍ | 22/26 [01:14<00:19,  4.93s/it]

Faithful: 4
Detail: 3
Correctness: 3


Average Metric: 53.900000000000006 / 23  (234.3):  88%|████████▊ | 23/26 [01:15<00:11,  3.83s/it]

Faithful: 5
Detail: 3
Correctness: 3


Average Metric: 55.300000000000004 / 24  (230.4):  92%|█████████▏| 24/26 [01:19<00:07,  3.58s/it]

Faithful: 3
Detail: 2
Correctness: 2


Average Metric: 57.7 / 25  (230.8):  96%|█████████▌| 25/26 [01:19<00:02,  2.71s/it]              

Faithful: 5
Detail: 4
Correctness: 3


Average Metric: 59.7 / 26  (229.6): 100%|██████████| 26/26 [01:45<00:00,  4.06s/it]
[I 2024-04-30 01:04:10,553] Trial 0 finished with value: 229.62 and parameters: {'6311971984_predictor_instruction': 1, '6311971984_predictor_demos': 2}. Best is trial 0 with value: 229.62.


Faithful: 4
Detail: 5
Correctness: 1
Average Metric: 59.7 / 26  (229.6%)
Starting trial #1


Average Metric: 4.0 / 2  (200.0):   4%|▍         | 1/26 [00:00<00:15,  1.66it/s]

Faithful: 3
Detail: 4
Correctness: 1
Faithful: 5
Detail: 4
Correctness: 3


Average Metric: 8.600000000000001 / 4  (215.0):  15%|█▌        | 4/26 [00:01<00:04,  4.63it/s]

Faithful: 5
Detail: 3
Correctness: 4
Faithful: 5
Detail: 3
Correctness: 3


Average Metric: 10.8 / 5  (216.0):  19%|█▉        | 5/26 [00:01<00:04,  4.53it/s]             

Faithful: 4
Detail: 3
Correctness: 4


Average Metric: 16.6 / 7  (237.1):  23%|██▎       | 6/26 [00:02<00:11,  1.80it/s]              

Faithful: 5
Detail: 5
Correctness: 4
Faithful: 5
Detail: 5
Correctness: 5


Average Metric: 18.8 / 8  (235.0):  31%|███       | 8/26 [00:04<00:12,  1.49it/s]

Faithful: 4
Detail: 4
Correctness: 3


Average Metric: 24.2 / 10  (242.0):  35%|███▍      | 9/26 [00:05<00:14,  1.15it/s]

Faithful: 5
Detail: 3
Correctness: 4
Faithful: 5
Detail: 5
Correctness: 5
Faithful: 5
Detail: 2
Correctness: 3


Average Metric: 28.2 / 12  (235.0):  46%|████▌     | 12/26 [00:06<00:07,  1.93it/s]

Faithful: 5
Detail: 2
Correctness: 3


Average Metric: 33.3 / 14  (237.9):  50%|█████     | 13/26 [00:06<00:06,  2.14it/s]

Faithful: 5
Detail: 4
Correctness: 4.5
Faithful: 5
Detail: 3
Correctness: 4


Average Metric: 41.099999999999994 / 17  (241.8):  62%|██████▏   | 16/26 [00:08<00:05,  1.68it/s]

Faithful: 5
Detail: 5
Correctness: 4
Faithful: 4
Detail: 3
Correctness: 3
Faithful: 5
Detail: 5
Correctness: 5


Average Metric: 44.89999999999999 / 19  (236.3):  69%|██████▉   | 18/26 [00:09<00:03,  2.38it/s] 

Faithful: 3
Detail: 1
Correctness: 3
Faithful: 5
Detail: 3
Correctness: 4


Average Metric: 49.29999999999999 / 21  (234.8):  81%|████████  | 21/26 [00:09<00:01,  3.09it/s]

Faithful: 5
Detail: 4
Correctness: 3
Faithful: 4
Detail: 5
Correctness: 1


Average Metric: 52.69999999999999 / 23  (229.1):  85%|████████▍ | 22/26 [00:09<00:01,  3.62it/s]

Faithful: 5
Detail: 2
Correctness: 3
Faithful: 3
Detail: 2
Correctness: 2


Average Metric: 59.699999999999996 / 26  (229.6): 100%|██████████| 26/26 [00:10<00:00,  2.40it/s]
[I 2024-04-30 01:04:21,408] Trial 1 finished with value: 229.62 and parameters: {'6311971984_predictor_instruction': 1, '6311971984_predictor_demos': 2}. Best is trial 0 with value: 229.62.


Faithful: 5
Detail: 3
Correctness: 3
Faithful: 5
Detail: 3
Correctness: 3
Faithful: 5
Detail: 3
Correctness: 5
Average Metric: 59.699999999999996 / 26  (229.6%)
Starting trial #2


Average Metric: 4.0 / 2  (200.0):   4%|▍         | 1/26 [00:00<00:09,  2.53it/s]

Faithful: 5Faithful: 4
Detail: 3
Correctness: 4

Detail: 3
Correctness: 1


Average Metric: 6.2 / 3  (206.7):  12%|█▏        | 3/26 [00:01<00:12,  1.84it/s]

Faithful: 5
Detail: 3
Correctness: 3


Average Metric: 8.0 / 4  (200.0):  15%|█▌        | 4/26 [00:03<00:22,  1.01s/it]

Faithful: 5
Detail: 2
Correctness: 2


Average Metric: 10.4 / 5  (208.0):  19%|█▉        | 5/26 [00:04<00:18,  1.11it/s]

Faithful: 5
Detail: 3
Correctness: 4


Average Metric: 11.8 / 6  (196.7):  23%|██▎       | 6/26 [00:05<00:23,  1.16s/it]

Faithful: 3
Detail: 2
Correctness: 2


Average Metric: 14.200000000000001 / 7  (202.9):  27%|██▋       | 7/26 [00:06<00:19,  1.02s/it]

Faithful: 5
Detail: 3
Correctness: 4


Average Metric: 16.200000000000003 / 8  (202.5):  31%|███       | 8/26 [00:07<00:15,  1.19it/s]

Faithful: 5
Detail: 2
Correctness: 3


Average Metric: 22.8 / 11  (207.3):  42%|████▏     | 11/26 [00:07<00:05,  2.53it/s]            

Faithful: 5
Detail: 3
Correctness: 3
Faithful: 5
Detail: 3
Correctness: 3
Faithful: 5
Detail: 1
Correctness: 5


Average Metric: 24.2 / 12  (201.7):  46%|████▌     | 12/26 [00:07<00:05,  2.45it/s]

Faithful: 5
Detail: 1
Correctness: 1


Average Metric: 27.2 / 13  (209.2):  50%|█████     | 13/26 [00:08<00:06,  1.98it/s]

Faithful: 5
Detail: 5
Correctness: 5


Average Metric: 29.2 / 14  (208.6):  54%|█████▍    | 14/26 [00:09<00:07,  1.52it/s]

Faithful: 5
Detail: 2
Correctness: 3


Average Metric: 35.699999999999996 / 17  (210.0):  62%|██████▏   | 16/26 [00:11<00:08,  1.19it/s]

Faithful: 4
Detail: 3
Correctness: 3
Faithful: 5
Detail: 4
Correctness: 4.5
Faithful: 5
Detail: 1
Correctness: 3


Average Metric: 39.69999999999999 / 20  (198.5):  73%|███████▎  | 19/26 [00:12<00:03,  1.78it/s] 

Faithful: 2
Detail: 1
Correctness: 1
Faithful: 2
Detail: 5
Correctness: 5
Faithful: 2
Detail: 1
Correctness: 1


Average Metric: 47.99999999999999 / 24  (200.0):  88%|████████▊ | 23/26 [00:13<00:01,  1.91it/s]

Faithful: 5Faithful: 3
Detail: 2
Correctness: 1

Detail: 2
Correctness: 3
Faithful: 5
Detail: 5
Correctness: 4.5
Faithful: 5
Detail: 1
Correctness: 5


Average Metric: 50.39999999999999 / 26  (193.8): 100%|██████████| 26/26 [00:13<00:00,  1.86it/s] 

Faithful: 5
Detail: 1
Correctness: 2
Faithful: 2
Detail: 1
Correctness: 1
Average Metric: 50.39999999999999 / 26  (193.8%)



[I 2024-04-30 01:04:35,641] Trial 2 finished with value: 193.84999999999997 and parameters: {'6311971984_predictor_instruction': 0, '6311971984_predictor_demos': 0}. Best is trial 0 with value: 229.62.


Returning generate_answer = ChainOfThought(GenerateAnswer(context, question -> answer
    instructions='Answer questions based on the context.'
    context = Field(annotation=str required=True json_schema_extra={'desc': 'may contain relevant facts', '__dspy_field_type': 'input', 'prefix': 'Context:'})
    question = Field(annotation=str required=True json_schema_extra={'__dspy_field_type': 'input', 'prefix': 'Question:', 'desc': '${question}'})
    answer = Field(annotation=str required=True json_schema_extra={'__dspy_field_type': 'output', 'prefix': 'Answer:', 'desc': '${answer}'})
)) from continue_program


In [79]:
signature_compiled_rag_eval = evaluate(signature_compiled_rag, metric=llm_metric,return_all_scores=True, return_outputs=True)





Faithful: 5
Detail: 5
Correctness: 5



Average Metric: 15.2 / 6  (253.3):  30%|███       | 6/20 [00:01<00:02,  6.31it/s]

Faithful: 5
Detail: 5
Correctness: 5
Faithful: 5
Detail: 2
Correctness: 4
Faithful: 5
Detail: 3
Correctness: 3
Faithful: 5
Detail: 3
Correctness: 4
Faithful: 5
Detail: 3
Correctness: 4
Faithful: 4
Detail: 3
Correctness: 3


[A


Faithful: 5
Detail: 5
Correctness: 5
Faithful: 5
Detail: 4
Correctness: 3
Faithful: 5
Detail: 3
Correctness: 3


Average Metric: 24.799999999999997 / 10  (248.0):  45%|████▌     | 9/20 [00:01<00:01,  8.04it/s][A

Faithful: 5
Detail: 5
Correctness: 4




Faithful: 5
Detail: 2
Correctness: 3




Faithful: 5
Detail: 3
Correctness: 4
Faithful: 5
Detail: 2
Correctness: 3
Faithful: 5
Detail: 2
Correctness: 3




Faithful: 5
Detail: 2
Correctness: 3
Faithful: 4
Detail: 4
Correctness: 3
Faithful: 5
Detail: 4
Correctness: 3


Average Metric: 46.2 / 20  (231.0): 100%|██████████| 20/20 [00:04<00:00,  4.15it/s]

Faithful: 5
Detail: 3
Correctness: 4
Faithful: 1
Detail: 2
Correctness: 3
Average Metric: 46.2 / 20  (231.0%)



  df.loc[:, metric_name] = df[metric_name].apply(


Unnamed: 0,question,example_answer,context,pred_answer,llm_metric
0,Are ultraconserved elements often transcribed?,"Yes. Especially, a large fraction of non-exonic UCEs is transcribed across all developmental stages examined from only one DNA strand.","['Ultraconserved elements in insect genomes', 'ultraconserved elements in plants tend to occur in clusters and locate at noncoding regions', 'Transcribed ultraconserved region in human cancers.',...","Yes, ultraconserved elements are often transcribed.",2.2
1,List metalloenzyme inhibitors.,Foscarnet VT-1129 VT-1161 BB-3497 hydroxamate molecules siderophores,"['the metalloenzyme inhibitors VT-1129 and VT-1161 (both Viamet Pharmaceuticals)', 'Clinically approved inhibitors were selected as well as several other reported metalloprotein inhibitors in order to...","Metalloenzyme inhibitors include VT-1129, VT-1161, hydroxamate molecules, siderophores, the N-formyl-hydroxylamine derivative BB-3497, tissue inhibitor of metalloproteinases-1 (Timp-1), 8-halo-4-(3-chloro-4-fluoro-phenylamino)-6-[(1H-[1,2,3]triazol-4-ylmethyl)-amino]-quinoline-3-carbonitriles, 2,2'-dipyridylamine (DPA), triazacyclononane (TACN), 8-hydroxyquinoline, and the...",3.0
2,Which proteins control the degradation of cryptic unstable transcripts (CUTs) in yeast?,Termination of cryptic unstable transcripts is directed by yeast RNA-binding proteins Nrd1 and Nab3. These cryptic unstable transcripts (CUTs) are rapidly degraded by the nuclear...,"['Cryptic unstable transcripts (CUTs) were recently described as a principal class of RNA polymerase II transcripts in Saccharomyces cerevisiae', 'Cryptic unstable transcripts (CUTs) were recently...","The proteins that control the degradation of cryptic unstable transcripts (CUTs) in yeast are Nrd1, Nab3, and the TRAMP complex.",2.2
3,Can protein coding exons originate from ALU sequences?,Yes. Intronic ALUs can evolve into exons by the activation of splice signals residing within the ALU sequence. While most ALU exons do not add...,"['Transcribed Alu sequences can alter splicing patterns by generating new exons', 'Transcribed Alu sequences can alter splicing patterns by generating new exons, but other impacts...","Yes, protein coding exons can originate from Alu sequences.",2.0
4,Is Fibroblast Growth Factor 23 a phosphaturic hormone?,"Yes, fbroblast growth factor 23 (FGF23) is a phosphaturic hormone.","['fibroblast growth factor-23 (FGF23), a bone-derived phosphaturic hormone.', 'serum levels of a phosphaturic hormone, fibroblast growth factor 23 (Fgf23),', 'circulating phosphaturic hormone fibroblast growth factor-23...","Yes, Fibroblast Growth Factor 23 is a phosphaturic hormone.",2.4


In [82]:
pd.set_option('display.max_colwidth', 500)
pd.set_option('display.max_rows', 500)

signature_compiled_rag_eval_evals_df = create_score_dataframe(signature_compiled_rag_eval[1])
print(signature_compiled_rag_eval_evals_df['score'].mean())
signature_compiled_rag_eval_evals_df

2.31


Unnamed: 0,question,predicted_answer,score
0,List metalloenzyme inhibitors.,"Metalloenzyme inhibitors include VT-1129, VT-1161, hydroxamate molecules, siderophores, the N-formyl-hydroxylamine derivative BB-3497, tissue inhibitor of metalloproteinases-1 (Timp-1), 8-halo-4-(3-chloro-4-fluoro-phenylamino)-6-[(1H-[1,2,3]triazol-4-ylmethyl)-amino]-quinoline-3-carbonitriles, 2,2'-dipyridylamine (DPA), triazacyclononane (TACN), 8-hydroxyquinoline, and the sarcoplasmic/endoplasmic Ca(2+)-ATPase (SERCA) inhibitor thapsigargin.",3.0
1,What is being measured with an accelerometer in back pain patients,"In back pain patients, an accelerometer is used to measure various aspects of physical activity, including overall physical activity levels, time spent in different static trunk postures, time spent upright, standing, or walking, and step count.",3.0
2,Which proteins control the degradation of cryptic unstable transcripts (CUTs) in yeast?,"The proteins that control the degradation of cryptic unstable transcripts (CUTs) in yeast are Nrd1, Nab3, and the TRAMP complex.",2.2
3,Are ultraconserved elements often transcribed?,"Yes, ultraconserved elements are often transcribed.",2.2
4,Is Fibroblast Growth Factor 23 a phosphaturic hormone?,"Yes, Fibroblast Growth Factor 23 is a phosphaturic hormone.",2.4
5,Are long non coding RNAs as conserved in sequence as protein coding genes?,"No, long noncoding RNAs are not as conserved in sequence as protein-coding genes.",2.4
6,Can protein coding exons originate from ALU sequences?,"Yes, protein coding exons can originate from Alu sequences.",2.0
7,What is the name of the stem loop present in the 3' end of genes encoding for selenoproteins?,The stem-loop present in the 3' end of genes encoding for selenoproteins is called the Sec insertion sequence (SECIS) or selenocysteine insertion sequence (SECIS).,3.0
8,What are prions?,"Prions are proteins that can exist in a misfolded form, which is capable of converting normally folded proteins into the misfolded state. These misfolded proteins form aggregates and are associated with fatal neurodegenerative diseases in mammals. Prions are self-propagating infectious protein isoforms.",2.4
9,Do Conserved noncoding elements act as enhancers?,"Yes, Conserved noncoding elements (CNEs) do act as enhancers.",2.2


In comparision to the Uncompiled model the mean accuracy on the devset has increased to 2.3 from 2.0. 