# DSPy
DSPy is a framework for algorithmically optimizing LM prompts and weights. DSPy can help you define your your tasks more accurately and can help you optimize your prompt for your sutaible use case.
Read More about it [here](https://dspy-docs.vercel.app/docs/intro)

In [1]:
import pandas as pd
from fastembed import TextEmbedding
from qdrant_client import QdrantClient
from qdrant_client.http import models
from tqdm import tqdm
from datasets import load_dataset

from typing import Optional
import os
import random

In [2]:
import dspy
from dspy.utils import dotdict

### Load the Training and Testing Data

In [3]:
corpora = load_dataset("nirantk/geneticsQA-corpus", split="train").to_pandas()

In [4]:
train = load_dataset("nirantk/geneticsQA-train", split="train").to_pandas()

Each Question in the training data is assosiated with ground_truth label , we will use this to train our model and optimize the prompts. 

In [5]:
pd.set_option("display.max_colwidth", 500)
train_data = train[["question", "contexts", "ground_truth"]]
train_data

Unnamed: 0,question,contexts,ground_truth
0,What is Snord116?,"['Further analysis with array-CGH identified a mosaic 847\u2009kb deletion in 15q11-q13, including SNURF-SNRPN, the snoRNA gene clusters SNORD116 (HBII-85), SNORD115, (HBII-52), SNORD109 A and B (HBII-438A and B), SNORD64 (HBII-13), and NPAP1 (C15ORF2).', 'All three deletions included SNORD116, but only two encompassed parts of SNURF-SNRPN, implicating SNORD116 as the major contributor to the Prader-Willi phenotype. Our case adds further information about genotype-phenotype correlation and s...","['SNORD116 is a small nucleolar (sno) RNA gene cluster (HBII-85) implicated as a major contributor the Prader-Willi phenotype. \nSNORD116 genes appears to be responsible for the major features of PWS. \nSNORD116 is a paternally expressed box C/D snoRNA gene cluster.\nThe mouse C/D box snoRNA MBII-85 (SNORD116) is processed into at least five shorter RNAs using processing sites near known functional elements of C/D box snoRNAs.\nSnord116 expression in the medial hypothalamus, particularly wit..."
1,Are ultraconserved elements often transcribed?,"['Starting from a genome-wide expression profiling, we demonstrate for the first time a functional link between oxygen deprivation and the modulation of long noncoding transcripts from ultraconserved regions, termed transcribed-ultraconserved regions (T-UCRs)', 'Our data gives a first glimpse of a novel functional hypoxic network comprising protein-coding transcripts and noncoding RNAs (ncRNAs) from the T-UCRs category', 'Highly conserved elements discovered in vertebrates are present in non...","['Yes. Especially, a large fraction of non-exonic UCEs is transcribed across all developmental stages examined from only one DNA strand.']"
2,List metalloenzyme inhibitors.,"[' Clinically approved inhibitors were selected as well as several other reported metalloprotein inhibitors in order to represent a broad range of metal binding groups (MBGs), including hydroxamic acid, carboxylate, hydroxypyridinonate, thiol, and N-hydroxyurea functional groups.', 'A total of 21 different raltegravir-chelator derivative (RCD) compounds were prepared that differed only in the nature of the MBG. ', 'At least two compounds (RCD-4, RCD-5) containing a hydroxypyrone MBG were fou...",['Foscarnet\nVT-1129\nVT-1161 \nBB-3497\nhydroxamate molecules\nsiderophores']
3,"Which protein phosphatase has been found to interact with the heat shock protein, HSP20?","[' Moreover, protein phosphatase-1 activity is regulated by two binding partners, inhibitor-1 and the small heat shock protein 20, Hsp20. Indeed, human genetic variants of inhibitor-1 (G147D) or Hsp20 (P20L) result in reduced binding and inhibition of protein phosphatase-1, suggesting aberrant enzymatic regulation in human carriers. ', 'Small heat shock protein 20 interacts with protein phosphatase-1 and enhances sarcoplasmic reticulum calcium cycling.', ' Hsp20 overexpression in intact anim...","['Protein phosphatase-1 activity is regulated by two binding partners, inhibitor-1 and the small heat shock protein 20, Hsp20. Cell fractionation, coimmunoprecipitation, and coimmunolocalization studies, revealed an association between Hsp20 and PP1. Small heat shock protein 20 interacts with protein phosphatase-1 and enhances sarcoplasmic reticulum calcium cycling.', 'Moreover, protein phosphatase-1 activity is regulated by two binding partners, inhibitor-1 and the small heat shock protein ..."
4,Do DNA double-strand breaks play a causal role in carcinogenesis?,"['The DNA non-homologous end-joining repair gene XRCC6/Ku70 plays an important role in the repair of DNA double-strand breaks (DSBs) induced by both exogenous and endogenous DNA-damaging agents. Defects in overall DSB repair capacity can lead to genomic instability and carcinogenesis.', 'The tumor suppressor breast cancer susceptibility protein 1 (BRCA1) protects our cells from genomic instability in part by facilitating the efficient repair of DNA double-strand breaks (DSBs). BRCA1 promotes...","['Yes. It has been demonstrated that induction of DNA double-strand breaks (DSBs) and defects in overall DSBs repair capacity can lead to an accumulation of mutations, resulting in genomic instability of cells. Given that genomic instability is the hallmark of cancer, DSBs play a causal role in carcinogenesis.']"
...,...,...,...
1654,Is thyroid hormone therapy indicated in patients with heart failure?,"['Patients with chronic heart failure and subclinical hypothyroidism significantly improved their physical performance when normal TSH levels were reached.', 'Early and sustained physiological restoration of circulating L-T3 levels after MI halves infarct scar size and prevents the progression towards heart failure. This beneficial effect is likely due to enhanced capillary formation and mitochondrial protection.', 'These data indicate that T(3) replacement to euthyroid levels improves systo...",['There are several experimental and clinical evidences of the potential benefits of Thyroid hormone replacement therapy in heart failure. Initial clinical data showed also a good safety profile and tolerance of TH replacement therapy in patients withheart failure. \nHowever currently there is no indication to treat patients with heart failure withTHreplacementtherapy.']
1655,Is protein Fbw7 a SCF type of E3 ubiquitin ligase?,"['FBW7 (F-box and WD repeat domain-containing 7) is the substrate recognition component of an evolutionary conserved SCF (complex of SKP1, CUL1 and F-box protein)-type ubiquitin ligase.', 'However, very few E3 ubiquitin ligases are known to target G-CSFR for ubiquitin-proteasome pathway. Here we identified F-box and WD repeat domain-containing 7 (Fbw7), a substrate recognizing component of Skp-Cullin-F box (SCF) E3 ubiquitin Ligase physically associates with G-CSFR and promotes its ubiquitin...","['Fbxw7 (also known as Fbw7, SEL-10, hCdc4, or hAgo) is the F-box protein subunit of an Skp1-Cul1-F-box protein (SCF)-type ubiquitin ligase complex that plays a central role in the degradation of Notch family members.The F-box protein Fbw7 (also known as Fbxw7, hCdc4 and Sel-10) functions as a substrate recognition component of a SCF-type E3 ubiquitin ligase. SCF(Fbw7) facilitates polyubiquitination and subsequent degradation of various proteins such as Notch, cyclin E, c-Myc and c-Jun.', 'T..."
1656,Is Annexin V an apoptotic marker?,"['The apoptosis of the MSCs was induced by subjecting the cells to OGD conditions for 4 h and was detected by Annexin V/PI and Hoechst 33258 staining. ', 'In addition to the antimicrobial activity, we found that treatment of the cancer cell lines, Jurkat T-cells, Granta cells, and melanoma cells, with the Pseudomonas sp. In5 crude extract increased staining with the apoptotic marker Annexin V while no staining of healthy normal cells, i.e., naïve or activated CD4 T-cells, was observed.', 'At...","['Yes, annexin V is an early apoptotic marker.', 'Yes, Annexin V is an apoptotic marker.']"
1657,Which are the clinical characteristics of Tuberous Sclerosis?,"['Prevalence and long-term outcome of epilepsy in tuberous sclerosis complex (TSC) is reported to be variable', 'Subependymal giant cell astrocytomas (SEGAs) are benign tumors, most commonly associated with tuberous sclerosis complex (TSC).', 'Lymphangioleiomyomatosis (LAM) is a rare, progressive, frequently lethal cystic lung disease that almost exclusively affects women.', 'Rhabdomyoma is the most common type of cardiac tumor in fetuses and is often associated with tuberous sclerosis compl...","['The clinical characteristics of Tuberous Sclerosis include epilepsy, subependymal giant cell astrocytomas, lymphangioleiomyomatosis, rhabdomyoma, renal angiomyolipomas, cortical tubers, neurofibromas, angiofibromas, mental retardation, and behavioral disorders.']"


In [6]:
from sklearn.model_selection import train_test_split

train_data, test_data = train_test_split(train_data, test_size=0.2, random_state=42)

### Upload Contexts to Qdrant Vector Store

In [9]:
corpora.head()

Unnamed: 0,text
0,"Both 7SL genes and Alu elements are transcribed by RNA polymerase III, and we show here that the internal 7SL promoter lies within the Alu-like part of the 7SL gene"
1,"We performed a comparative analysis in vitro and in vivo of the antitumor effects of three different antibodies targeting different epitopes of ErbB2: Herceptin (trastuzumab), 2C4 (pertuzumab) and Erb-hcAb (human anti-ErbB2-compact antibody), a novel fully human compact antibody produced in our laboratory. Herein, we demonstrate that the growth of both androgen-dependent and independent prostate cancer cells was efficiently inhibited by Erb-hcAb. The antitumor effects induced by Erb-hcAb on ..."
2,"The weight-reducing property of molindone, a recently introduced antipsychotic drug, was tested in 9 hospitalized chronic schizophrenic patients. There was an average weight loss of 7.6 kg after 3 months on molindone; most of the loss occurred during the first month."
3,"Our study identifies a unique heterochromatin state marked by the presence of both H3.3 and H3K9me3, and establishes an important role for H3.3 in control of ERV retrotransposition in embryonic stem cells."
4,"Polyneuropathy, organomegaly, endocrinopathy, monoclonal gammopathy, and skin changes (POEMS) syndrome is an uncommon condition related to a paraneoplastic syndrome secondary to an underlying plasma cell disorder."


# FOR DEMONSTRATION PURPOSE WE ARE USING A VERY SMALL SUBSET OF THE DATASET

In [11]:
corpora = corpora.sample(100)
train_data = train_data.sample(100)
test_data = test_data.sample(20)

In [13]:
embedding_model = TextEmbedding("BAAI/bge-base-en-v1.5")
qdrant_client = QdrantClient(
    ":memory:"
)  # spin up a local instance if you require more advanced features
# qdrant_client = QdrantClient("http://localhost:6333") # uncomment if you want to use your local instance

if qdrant_client.collection_exists("rag_contexts"):
    qdrant_client.delete_collection("rag_contexts")

qdrant_client.create_collection(
    "rag_contexts",
    vectors_config=models.VectorParams(size=768, distance=models.Distance.COSINE),
)

Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

True

In [14]:
# Create and upload points to Qdrant
points = []
for idx, row in tqdm(corpora.iterrows(), total=corpora.shape[0]):
    point = models.PointStruct(
        id=idx,  # Use the dataframe index as the point ID
        vector=list(embedding_model.embed(row["text"]))[
            0
        ],  # Convert the embedding to a list
        payload={"id": idx, "text": row["text"]},  # Use the label_text as the payload
    )
    points.append(point)
qdrant_client.upload_points(collection_name="rag_contexts", points=points)

100%|██████████| 100/100 [00:03<00:00, 26.74it/s]


### Custom Retriever that searchs the contexts from Qdrant Vector Store. 

In [15]:
# use any embedding model
def generate_embeddings(text):
    return list(embedding_model.embed(text))[0]


class QdrantRetriever(dspy.Retrieve):
    def __init__(self, qdrant_collection_name, qdrant_client, k=10):
        super().__init__(k=k)
        self.client = qdrant_client
        self.collection_name = qdrant_collection_name

    def forward(self, query, k: Optional[int] = 10):
        # Generate embedding for the query
        query_embedding = generate_embeddings(query)
        search_results = self.client.search(
            collection_name=self.collection_name,
            query_vector=query_embedding,
            limit=k if k else self.k,
        )
        passages = [result.payload["text"] for result in search_results]
        passages = [dotdict({"long_text": passage}) for passage in passages]
        return passages

In [16]:
openai_api_key = os.environ["OPENAI_API_KEY"]

In [17]:
turbo = dspy.OpenAI(model="gpt-4o", api_key=openai_api_key, max_tokens=1000)
rm = QdrantRetriever("rag_contexts", qdrant_client)

# configure dspy with a RM Model and and LM Model
dspy.settings.configure(lm=turbo, rm=rm)

In [18]:
sample = test_data["question"].iloc[0]
dspy.Retrieve(k=10)(sample).passages

['endostatin peptide, a potent inhibitor of angiogenesis derived from type XVIII collagen,',
 'Finally, in contrast to most other ERM-binding proteins, ELMO1 binding occurred independently of the state of radixin C-terminal phosphorylation, suggesting an ELMO1 interaction with both the active and inactive forms of ERM proteins and implying a possible role of ELMO in localizing or retaining ERM proteins in certain cellular sites. Together these data suggest that ELMO1-mediated cytoskeletal changes may be coordinated with ERM protein crosslinking activity during dynamic cellular functions.',
 'Muscle LIM protein (MLP) has been proposed to be a central player in the pathogenesis of heart muscle disease.',
 'During T3-dependent amphibian metamorphosis, the digestive tract is extensively remodeled from the larval to the adult form for the adaptation of the amphibian from its aquatic herbivorous lifestyle to that of a terrestrial carnivorous frog. This involves de novo formation of ASCs that

### Signature Defination for Q/A System

In [19]:
# Define Signatire for the QA system
class GenerateAnswer(dspy.Signature):
    """Answer questions based on the context."""

    context = dspy.InputField(desc="may contain relevant facts")
    question = dspy.InputField()
    answer = dspy.OutputField()

### RM (Retrieval Model) Pipeline Creation

In [21]:
# Define a Custom RAG Pipeline
class RAG(dspy.Module):
    def __init__(self, collection_name="rag_contexts", num_passages=10):
        super().__init__()
        self.retrieve = dspy.Retrieve(k=num_passages)
        self.generate_answer = dspy.ChainOfThought(GenerateAnswer)

    def forward(self, question):
        context = self.retrieve(question).passages
        prediction = self.generate_answer(context=context, question=question)
        return dspy.Prediction(context=context, answer=prediction.answer)

In [22]:
uncompiled_rag = RAG()

In [23]:
uncompiled_rag(sample)

Prediction(
    context=['endostatin peptide, a potent inhibitor of angiogenesis derived from type XVIII collagen,', 'Finally, in contrast to most other ERM-binding proteins, ELMO1 binding occurred independently of the state of radixin C-terminal phosphorylation, suggesting an ELMO1 interaction with both the active and inactive forms of ERM proteins and implying a possible role of ELMO in localizing or retaining ERM proteins in certain cellular sites. Together these data suggest that ELMO1-mediated cytoskeletal changes may be coordinated with ERM protein crosslinking activity during dynamic cellular functions.', 'Muscle LIM protein (MLP) has been proposed to be a central player in the pathogenesis of heart muscle disease.', 'During T3-dependent amphibian metamorphosis, the digestive tract is extensively remodeled from the larval to the adult form for the adaptation of the amphibian from its aquatic herbivorous lifestyle to that of a terrestrial carnivorous frog. This involves de novo f

In [24]:
turbo.inspect_history(n=1)





Answer questions based on the context.

---

Follow the following format.

Context: may contain relevant facts

Question: ${question}

Reasoning: Let's think step by step in order to ${produce the answer}. We ...

Answer: ${answer}

---

Context:
[1] «endostatin peptide, a potent inhibitor of angiogenesis derived from type XVIII collagen,»
[2] «Finally, in contrast to most other ERM-binding proteins, ELMO1 binding occurred independently of the state of radixin C-terminal phosphorylation, suggesting an ELMO1 interaction with both the active and inactive forms of ERM proteins and implying a possible role of ELMO in localizing or retaining ERM proteins in certain cellular sites. Together these data suggest that ELMO1-mediated cytoskeletal changes may be coordinated with ERM protein crosslinking activity during dynamic cellular functions.»
[3] «Muscle LIM protein (MLP) has been proposed to be a central player in the pathogenesis of heart muscle disease.»
[4] «During T3-dependent amphib

### Metrics Defination and Assesment Signatures  

In [30]:
metricLM = dspy.OpenAI(
    model="gpt-4o", api_key=openai_api_key, max_tokens=1000, model_type="chat"
)


# Signature for LLM assessments.
class Assess(dspy.Signature):
    """Assess the quality of an answer to a question."""

    context = dspy.InputField(desc="The context for answering the question.")
    assessed_question = dspy.InputField(desc="The evaluation criterion.")
    assessed_answer = dspy.InputField(desc="The answer to the question.")
    correct_answer = dspy.InputField(desc="The correct answer to the question.")
    assessment_answer = dspy.OutputField(
        desc="A rating between 0 and 5. Only output the rating and nothing else."
    )


def llm_metric(gold, pred, trace=None):
    predicted_answer = pred.answer
    gold_question = gold.question
    gold_answer = gold.answer

    detail = "Is the assessed answer detailed?"
    faithful = "Is the assessed text grounded in the context? Say no if it includes significant facts not in the context."
    correctness = f"Compare the given {predicted_answer} and {gold_answer} and assess how correct the answer is"

    with dspy.context(lm=metricLM):
        context = dspy.Retrieve(k=10)(gold_question).passages
        detail = dspy.ChainOfThought(Assess)(
            context="N/A",
            assessed_question=detail,
            assessed_answer=predicted_answer,
            correct_answer=gold_answer,
        )
        faithful = dspy.ChainOfThought(Assess)(
            context=context,
            assessed_question=faithful,
            assessed_answer=predicted_answer,
            correct_answer=gold_answer,
        )
        correctness = dspy.ChainOfThought(Assess)(
            context=context,
            assessed_question=correctness,
            assessed_answer=predicted_answer,
            correct_answer=gold_answer,
        )

    print(f"Faithful: {faithful.assessment_answer}")
    print(f"Detail: {detail.assessment_answer}")
    print(f"Correctness: {correctness.assessment_answer}")

    total = (
        float(detail.assessment_answer)
        + float(faithful.assessment_answer)
        + float(correctness.assessment_answer)
    )
    return total / 10.0

Reference for the above is taken from below cited sources 
- [Reference_1](https://dspy-docs.vercel.app/docs/building-blocks/metrics#intermediate-using-ai-feedback-for-your-metric)
- [Reference_2](https://github.com/weaviate/recipes/blob/main/integrations/dspy/1.Getting-Started-with-RAG-in-DSPy.ipynb)

Let's format the data in a specific way how the DSPY modules are expecting and then use some of the data for training and evaluation. 

In [31]:
trainset_dspy = train_data.sample(frac=0.8)
valset_dspy = train_data.drop(trainset_dspy.index)

In [32]:
from ast import literal_eval
import dspy


def read_list_from_string(s):
    try:
        return literal_eval(s)
    except (ValueError, SyntaxError):
        return s.split() if isinstance(s, str) else []


def stringify_list_elements(lst):
    lst = read_list_from_string(lst)
    return " ".join(str(e) for e in lst)


trainset = [
    dspy.Example(
        question=row["question"],
        #  contexts=stringify_list_elements(row['contexts']),
        answer=stringify_list_elements(row["ground_truth"]),
    ).with_inputs("question")
    for i, row in trainset_dspy.iterrows()
]

valset = [
    dspy.Example(
        question=row["question"],
        # contexts=stringify_list_elements(row['contexts']),
        answer=stringify_list_elements(row["ground_truth"]),
    ).with_inputs("question")
    for i, row in valset_dspy.iterrows()
]

In [33]:
# For the purpose of demonstration let's keep it to 20. Remeber to use it wisely as the evaluation / training is all tied to API calls
devset = valset[:20]

In [34]:
from dspy.evaluate.evaluate import Evaluate

evaluate = Evaluate(
    devset=devset, num_threads=8, display_progress=True, display_table=5
)
uncompile_k_10 = RAG(num_passages=10)
uncompiled_10_metrics = evaluate(
    uncompile_k_10, metric=llm_metric, return_all_scores=True, return_outputs=True
)

  0%|          | 0/20 [00:00<?, ?it/s]

Average Metric: 1.6 / 2  (80.0):  10%|█         | 2/20 [00:00<00:03,  4.68it/s]

Faithful: 5
Detail: 2
Correctness: 1
Faithful: 5
Detail: 2
Correctness: 1


Average Metric: 2.0 / 3  (66.7):  15%|█▌        | 3/20 [00:05<00:42,  2.49s/it]

Faithful: 1
Detail: 2
Correctness: 1


Average Metric: 3.5999999999999996 / 7  (51.4):  35%|███▌      | 7/20 [00:06<00:09,  1.44it/s]

Faithful: 1Faithful: 1
Detail: 1
Correctness: 1

Detail: 1
Correctness: 1
Faithful: 5
Detail: 1
Correctness: 1
Faithful: 1
Detail: 1
Correctness: 1


Average Metric: 5.799999999999999 / 9  (64.4):  45%|████▌     | 9/20 [00:07<00:05,  1.89it/s] 

Faithful: 5
Detail: 1
Correctness: 5
Faithful: 5
Detail: 1
Correctness: 5


Average Metric: 6.499999999999999 / 10  (65.0):  50%|█████     | 10/20 [00:08<00:07,  1.38it/s]

Faithful: 5
Detail: 1
Correctness: 1


Average Metric: 6.799999999999999 / 11  (61.8):  55%|█████▌    | 11/20 [00:12<00:13,  1.50s/it]

Faithful: 1
Detail: 1
Correctness: 1


Average Metric: 9.1 / 15  (60.7):  75%|███████▌  | 15/20 [00:13<00:03,  1.57it/s]              

Faithful: 1
Detail: 1
Correctness: 1
Faithful: 1
Detail: 5
Correctness: 3
Faithful: 1
Detail: 1
Correctness: 1
Faithful: 1
Detail: 3
Correctness: 4


Average Metric: 10.100000000000001 / 18  (56.1):  85%|████████▌ | 17/20 [00:13<00:01,  1.56it/s]

Faithful: 1Faithful: 1
Detail: 1
Correctness: 1

Detail: 1
Correctness: 1
Faithful: 1
Detail: 2
Correctness: 1


Average Metric: 10.900000000000002 / 19  (57.4):  95%|█████████▌| 19/20 [00:19<00:01,  1.21s/it]

Faithful: 5
Detail: 2
Correctness: 1


Average Metric: 11.500000000000002 / 20  (57.5): 100%|██████████| 20/20 [00:20<00:00,  1.01s/it]

Faithful: 1
Detail: 3
Correctness: 2
Average Metric: 11.500000000000002 / 20  (57.5%)



  df.loc[:, metric_name] = df[metric_name].apply(


Unnamed: 0,question,example_answer,context,pred_answer,llm_metric
0,Is Rheumatoid Arthritis related to myopathy?,"Vacuolar myopathy and statin-induced myopathy have been reported in rheumatoid arthritis patients, but this association may be due to the anti-malarial treatment received. An increased...","['Muscle LIM protein (MLP) has been proposed to be a central player in the pathogenesis of heart muscle disease.', 'The Ehlers-Danlos syndromes (EDS) comprise a...","No, the provided context does not mention a relationship between Rheumatoid Arthritis and myopathy.",0.8
1,Has the protein SETMAR (Metnase) a transposase domain?,"Yes, the protein SETMAR (Metnase) has a transposase domain.","['The enzyme was capable of transferring methyl groups to selected lysine residues in a substrate protein using S-adenosyl-l-methionine (SAM) as the methyl donor.', 'FBW7 is...","The provided context does not contain information about the protein SETMAR (Metnase) or its domains. Therefore, we cannot determine from this context whether SETMAR (Metnase)...",0.4
2,What is the number of long non coding RNAs in the human genome,"Different estimates put currently the number of human long non coding RNAs between 10,000 and 20,000",['In this paper we have explored the benefits of combining predictions from already existing gene prediction programs. We have introduced three novel methods for combining...,The context does not provide information about the number of long non-coding RNAs in the human genome.,0.3
3,How does miR-1 overexpression worsen arrhythmias in coronary artery disease patients and what are the implications for antiarrhythmic treatments?,miR-1 overexpression worsens arrhythmias in coronary artery disease patients by regulating pacemaker channel genes and contributing to arrhythmogenesis. This up-regulation of miR-1 in patients with...,"['Moreover, miR-27a was demonstrated to modulate β-MHC gene regulation via thyroid hormone signaling and to be upregulated during the differentiation of mouse embryonic stem (ES)...","The provided context does not contain information about miR-1 overexpression, its effects on arrhythmias in coronary artery disease patients, or the implications for antiarrhythmic treatments.",0.8
4,Which post-translational histone modifications are characteristic of constitutive heterochromatin?,"H3K9me3 is the major marker of constitutive heterochromatin. Other histone methylation marks usually found in constitutive heterochromatin, are H4K20me3 and H3K79me3. Classical histone modifications associated...","['Covalent histone modifications (e.g. ubiquitination, phosphorylation, methylation, acetylation) and H2A variants (H2A.Z, H2A.X and H2A.W) are also discussed in view of their crucial importance in...",The context does not provide specific information about which post-translational histone modifications are characteristic of constitutive heterochromatin.,1.1


In [35]:
def create_score_dataframe(eval_output):
    # Extract questions and answers from the examples
    questions = [ex[0].question for ex in eval_output]
    answers = [ex[1].answer for ex in eval_output]
    scores = [ex[2] for ex in eval_output]
    # Create a DataFrame with questions, answers, and scores
    score_dataframe = pd.DataFrame(
        {"question": questions, "predicted_answer": answers, "score": scores}
    )
    return score_dataframe


In [36]:
pd.set_option("display.max_colwidth", 500)
pd.set_option("display.max_rows", 500)
eval_outs = uncompiled_10_metrics[1]
eval_outs_df = create_score_dataframe(eval_outs)
print(f"Mean Score for the devset is {eval_outs_df['score'].mean()}")
eval_outs_df

Mean Score for the devset is 0.5750000000000001


Unnamed: 0,question,predicted_answer,score
0,Is Rheumatoid Arthritis related to myopathy?,"No, the provided context does not mention a relationship between Rheumatoid Arthritis and myopathy.",0.8
1,How does miR-1 overexpression worsen arrhythmias in coronary artery disease patients and what are the implications for antiarrhythmic treatments?,"The provided context does not contain information about miR-1 overexpression, its effects on arrhythmias in coronary artery disease patients, or the implications for antiarrhythmic treatments.",0.8
2,Has the protein SETMAR (Metnase) a transposase domain?,"The provided context does not contain information about the protein SETMAR (Metnase) or its domains. Therefore, we cannot determine from this context whether SETMAR (Metnase) has a transposase domain.",0.4
3,What is the role of RhoA in bladder cancer?,The provided context does not contain information about the role of RhoA in bladder cancer.,0.3
4,Is cystatin C or cystatin 3 used as a biomarker of kidney function?,The context does not provide information on cystatin C or cystatin 3 as biomarkers of kidney function.,0.3
5,Are thyroid hormone receptor alpha1 mutations implicated in thyroid hormone resistance syndrome?,The context does not provide any information about thyroid hormone receptor alpha1 mutations or their implication in thyroid hormone resistance syndrome.,0.7
6,What is the number of long non coding RNAs in the human genome,The context does not provide information about the number of long non-coding RNAs in the human genome.,0.3
7,Which post-translational histone modifications are characteristic of constitutive heterochromatin?,The context does not provide specific information about which post-translational histone modifications are characteristic of constitutive heterochromatin.,1.1
8,Which are the known human transmembrane nucleoporins?,The context does not provide information about known human transmembrane nucleoporins.,1.1
9,Which gene is associated with Muenke syndrome?,The context does not provide information about the gene associated with Muenke syndrome.,0.7


### Prompt from the Uncompiled Model

In [37]:
turbo.inspect_history(n=1)





Answer questions based on the context.

---

Follow the following format.

Context: may contain relevant facts

Question: ${question}

Reasoning: Let's think step by step in order to ${produce the answer}. We ...

Answer: ${answer}

---

Context:
[1] «In the present study, we tested the hypothesis that having migraine in middle age is related to late-life parkinsonism and a related disorder, restless legs syndrome (RLS), also known as Willis-Ekbom disease (WED).The AGES-Reykjavik cohort (born 1907-1935) has been followed since 1967.»
[2] «These traits are controlled by neurotransmitters like dopamine, serotonin and norepinephrine. Monoamine oxidase A (MAOA), a mitochondrial enzyme involved in the degradation of amines, has been reported to be associated with aggression, impulsivity, depression, and mood changes.»
[3] «PTEN-induced putative kinase 1 (PINK1) is a causative gene for autosomal recessive early onset parkinsonism.»
[4] «To determine the effectiveness of gabapentin as an 

In [38]:
# Lets check the Metrics LLM Prompt as well
metricLM.inspect_history(n=3)





Assess the quality of an answer to a question.

---

Follow the following format.

Context: The context for answering the question.

Assessed Question: The evaluation criterion.

Assessed Answer: The answer to the question.

Correct Answer: The correct answer to the question.

Reasoning: Let's think step by step in order to ${produce the assessment_answer}. We ...

Assessment Answer: A rating between 1 and 5. Only output the rating and nothing else.

---

Context:
[1] «Finally, in contrast to most other ERM-binding proteins, ELMO1 binding occurred independently of the state of radixin C-terminal phosphorylation, suggesting an ELMO1 interaction with both the active and inactive forms of ERM proteins and implying a possible role of ELMO in localizing or retaining ERM proteins in certain cellular sites. Together these data suggest that ELMO1-mediated cytoskeletal changes may be coordinated with ERM protein crosslinking activity during dynamic cellular functions.»
[2] «using proximity 

In [39]:
# Since 'trainset' is a list and doesn't have a 'sample' method, we will define a function to sample from it
def sample_from_list(lst, fraction):
    sample_size = int(len(lst) * fraction)
    return random.sample(lst, sample_size)


# Now we use the function to sample 2% if the dataset
trainset_truncated = sample_from_list(trainset, 0.02)
len(trainset_truncated)

1

### Optimizer : Bootstrap Random Search Optimization

In [40]:
from dspy.teleprompt import BootstrapFewShotWithRandomSearch

teleprompter = BootstrapFewShotWithRandomSearch(
    metric=llm_metric,
    max_bootstrapped_demos=2,
    max_labeled_demos=4,
    max_rounds=1,
    num_candidate_programs=2,
    num_threads=8,
)

few_shot_bootstrap_compiled_rag = teleprompter.compile(
    uncompile_k_10, trainset=trainset_truncated
)

Going to sample between 1 and 2 traces per predictor.
Will attempt to train 2 candidate sets.


Average Metric: 0.4 / 1  (40.0): 100%|██████████| 1/1 [00:00<00:00,  4.42it/s]


Faithful: 1
Detail: 2
Correctness: 1
Average Metric: 0.4 / 1  (40.0%)
Score: 40.0 for set: [0]
New best score: 40.0 for seed -3
Scores so far: [40.0]
Best score: 40.0


Average Metric: 0.7 / 1  (70.0): 100%|██████████| 1/1 [00:06<00:00,  6.52s/it]


Faithful: 3
Detail: 2
Correctness: 2
Average Metric: 0.7 / 1  (70.0%)
Score: 70.0 for set: [1]
New best score: 70.0 for seed -2
Scores so far: [40.0, 70.0]
Best score: 70.0


100%|██████████| 1/1 [00:00<00:00,  4.30it/s]


Faithful: 1
Detail: 2
Correctness: 1
Bootstrapped 1 full traces after 1 examples in round 0.


Average Metric: 0.4 / 1  (40.0): 100%|██████████| 1/1 [00:01<00:00,  1.65s/it]


Faithful: 1
Detail: 2
Correctness: 1
Average Metric: 0.4 / 1  (40.0%)
Score: 40.0 for set: [1]
Scores so far: [40.0, 70.0, 40.0]
Best score: 70.0
Average of max per entry across top 1 scores: 0.7
Average of max per entry across top 2 scores: 0.7
Average of max per entry across top 3 scores: 0.7
Average of max per entry across top 5 scores: 0.7
Average of max per entry across top 8 scores: 0.7
Average of max per entry across top 9999 scores: 0.7


100%|██████████| 1/1 [00:00<00:00,  2.85it/s]


Faithful: 1
Detail: 2
Correctness: 1
Bootstrapped 1 full traces after 1 examples in round 0.


Average Metric: 0.4 / 1  (40.0): 100%|██████████| 1/1 [00:00<00:00,  2.92it/s]


Faithful: 1
Detail: 2
Correctness: 1
Average Metric: 0.4 / 1  (40.0%)
Score: 40.0 for set: [1]
Scores so far: [40.0, 70.0, 40.0, 40.0]
Best score: 70.0
Average of max per entry across top 1 scores: 0.7
Average of max per entry across top 2 scores: 0.7
Average of max per entry across top 3 scores: 0.7
Average of max per entry across top 5 scores: 0.7
Average of max per entry across top 8 scores: 0.7
Average of max per entry across top 9999 scores: 0.7


100%|██████████| 1/1 [00:00<00:00,  3.15it/s]


Faithful: 1
Detail: 2
Correctness: 1
Bootstrapped 1 full traces after 1 examples in round 0.


Average Metric: 0.4 / 1  (40.0): 100%|██████████| 1/1 [00:00<00:00,  3.30it/s]

Faithful: 1
Detail: 2
Correctness: 1
Average Metric: 0.4 / 1  (40.0%)
Score: 40.0 for set: [1]
Scores so far: [40.0, 70.0, 40.0, 40.0, 40.0]
Best score: 70.0
Average of max per entry across top 1 scores: 0.7
Average of max per entry across top 2 scores: 0.7
Average of max per entry across top 3 scores: 0.7
Average of max per entry across top 5 scores: 0.7
Average of max per entry across top 8 scores: 0.7
Average of max per entry across top 9999 scores: 0.7
5 candidate programs found.





In [41]:
# Let's check the prompt for this compiled model
turbo.inspect_history(n=1)





Answer questions based on the context.

---

Follow the following format.

Context: may contain relevant facts

Question: ${question}

Reasoning: Let's think step by step in order to ${produce the answer}. We ...

Answer: ${answer}

---

Context:
[1] «An incidence peak for aneurysm rupture (28 patients) was seen during the phase of new moon, which was statistically significant (p < 0.001)»
[2] «Everolimus for subependymal giant-cell astrocytomas in tuberous sclerosis.»
[3] «Results from a phase 1 study of nusinersen (ISIS-SMN(Rx)) in children with spinal muscular atrophy.»
[4] «Hydrochlorothiazide 25-200 mg daily, chlorothiazide 500 mg twice daily, and indapamide 2.5 mg daily provided long-term blood pressure reduction in patients with severe renal disease who were not on dialysis»
[5] «Cocaine use and hypertension are major risk factors for intracerebral hemorrhage in young African Americans.»
[6] «endostatin peptide, a potent inhibitor of angiogenesis derived from type XVIII coll

You can notice how the prompt has somewhat become more specific in handling the examples and have also added extra instructions. Let's now evaluate on the `devset` we created and see how the model performs. 

In [42]:
few_shot_bootstrap_compiled_rag_evals = evaluate(
    few_shot_bootstrap_compiled_rag,
    metric=llm_metric,
    return_all_scores=True,
    return_outputs=True,
)

Average Metric: 0.3 / 1  (30.0):   5%|▌         | 1/20 [00:01<00:37,  1.97s/it]

Faithful: 1
Detail: 1
Correctness: 1


Average Metric: 1.4000000000000001 / 2  (70.0):  10%|█         | 2/20 [00:03<00:30,  1.69s/it]

Faithful: 5
Detail: 1
Correctness: 5


Average Metric: 2.5 / 3  (83.3):  15%|█▌        | 3/20 [00:04<00:20,  1.18s/it]               

Faithful: 5
Detail: 1
Correctness: 5


Average Metric: 3.3 / 4  (82.5):  20%|██        | 4/20 [00:05<00:19,  1.24s/it]

Faithful: 5
Detail: 2
Correctness: 1


Average Metric: 4.1 / 5  (82.0):  25%|██▌       | 5/20 [00:06<00:15,  1.03s/it]

Faithful: 5
Detail: 2
Correctness: 1


Average Metric: 5.2 / 7  (74.3):  30%|███       | 6/20 [00:06<00:11,  1.22it/s]

Faithful: 1
Detail: 2
Correctness: 1
Faithful: 5
Detail: 1
Correctness: 1


Average Metric: 5.9 / 8  (73.8):  40%|████      | 8/20 [00:07<00:06,  1.87it/s]

Faithful: 1
Detail: 2
Correctness: 4


Average Metric: 6.7 / 9  (74.4):  45%|████▌     | 9/20 [00:10<00:14,  1.34s/it]

Faithful: 1
Detail: 3
Correctness: 4


Average Metric: 8.1 / 11  (73.6):  55%|█████▌    | 11/20 [00:10<00:06,  1.35it/s]

Faithful: 5
Detail: 1
Correctness: 1
Faithful: 1
Detail: 2
Correctness: 4


Average Metric: 8.799999999999999 / 12  (73.3):  60%|██████    | 12/20 [00:10<00:05,  1.53it/s]

Faithful: 5
Detail: 1
Correctness: 1


Average Metric: 9.6 / 13  (73.8):  65%|██████▌   | 13/20 [00:12<00:05,  1.29it/s]              

Faithful: 5
Detail: 2
Correctness: 1


Average Metric: 9.9 / 14  (70.7):  70%|███████   | 14/20 [00:12<00:04,  1.24it/s]

Faithful: 1
Detail: 1
Correctness: 1


Average Metric: 10.700000000000001 / 15  (71.3):  75%|███████▌  | 15/20 [00:13<00:03,  1.46it/s]

Faithful: 1
Detail: 3
Correctness: 4


Average Metric: 11.000000000000002 / 16  (68.8):  80%|████████  | 16/20 [00:13<00:02,  1.58it/s]

Faithful: 1
Detail: 1
Correctness: 1


Average Metric: 11.700000000000001 / 17  (68.8):  85%|████████▌ | 17/20 [00:14<00:01,  1.68it/s]

Faithful: 1
Detail: 3
Correctness: 3


Average Metric: 12.100000000000001 / 18  (67.2):  90%|█████████ | 18/20 [00:15<00:01,  1.41it/s]

Faithful: 1
Detail: 2
Correctness: 1


Average Metric: 12.600000000000001 / 19  (66.3):  95%|█████████▌| 19/20 [00:16<00:00,  1.03it/s]

Faithful: 1
Detail: 2
Correctness: 2


Average Metric: 13.3 / 20  (66.5): 100%|██████████| 20/20 [00:17<00:00,  1.15it/s]              

Faithful: 1
Detail: 3
Correctness: 3
Average Metric: 13.3 / 20  (66.5%)



  df.loc[:, metric_name] = df[metric_name].apply(


Unnamed: 0,question,example_answer,context,pred_answer,llm_metric
0,Is Rheumatoid Arthritis related to myopathy?,"Vacuolar myopathy and statin-induced myopathy have been reported in rheumatoid arthritis patients, but this association may be due to the anti-malarial treatment received. An increased...","['Muscle LIM protein (MLP) has been proposed to be a central player in the pathogenesis of heart muscle disease.', 'The Ehlers-Danlos syndromes (EDS) comprise a...",The provided context does not contain information about a relationship between Rheumatoid Arthritis and myopathy.,0.8
1,Has the protein SETMAR (Metnase) a transposase domain?,"Yes, the protein SETMAR (Metnase) has a transposase domain.","['The enzyme was capable of transferring methyl groups to selected lysine residues in a substrate protein using S-adenosyl-l-methionine (SAM) as the methyl donor.', 'FBW7 is...","The context provided does not contain any information about the protein SETMAR (Metnase) or its domains, including a transposase domain. Therefore, we cannot determine from...",0.4
2,What is the number of long non coding RNAs in the human genome,"Different estimates put currently the number of human long non coding RNAs between 10,000 and 20,000",['In this paper we have explored the benefits of combining predictions from already existing gene prediction programs. We have introduced three novel methods for combining...,The provided context does not contain information about the number of long non-coding RNAs in the human genome.,0.7
3,How does miR-1 overexpression worsen arrhythmias in coronary artery disease patients and what are the implications for antiarrhythmic treatments?,miR-1 overexpression worsens arrhythmias in coronary artery disease patients by regulating pacemaker channel genes and contributing to arrhythmogenesis. This up-regulation of miR-1 in patients with...,"['Moreover, miR-27a was demonstrated to modulate β-MHC gene regulation via thyroid hormone signaling and to be upregulated during the differentiation of mouse embryonic stem (ES)...","The context does not provide specific information about miR-1 overexpression and its effects on arrhythmias in coronary artery disease patients. However, based on general knowledge,...",0.8
4,Which post-translational histone modifications are characteristic of constitutive heterochromatin?,"H3K9me3 is the major marker of constitutive heterochromatin. Other histone methylation marks usually found in constitutive heterochromatin, are H4K20me3 and H3K79me3. Classical histone modifications associated...","['Covalent histone modifications (e.g. ubiquitination, phosphorylation, methylation, acetylation) and H2A variants (H2A.Z, H2A.X and H2A.W) are also discussed in view of their crucial importance in...",The context does not provide specific information about which post-translational histone modifications are characteristic of constitutive heterochromatin.,1.1


In [44]:
pd.set_option("display.max_colwidth", 500)
pd.set_option("display.max_rows", 500)
evals_outs_df = create_score_dataframe(few_shot_bootstrap_compiled_rag_evals[1])
print(evals_outs_df["score"].mean())
evals_outs_df

0.6649999999999999


Unnamed: 0,question,predicted_answer,score
0,What is the role of RhoA in bladder cancer?,The provided context does not contain information about the role of RhoA in bladder cancer.,0.3
1,Which post-translational histone modifications are characteristic of constitutive heterochromatin?,The context does not provide specific information about which post-translational histone modifications are characteristic of constitutive heterochromatin.,1.1
2,Which are the known human transmembrane nucleoporins?,The context does not provide information about known human transmembrane nucleoporins.,1.1
3,Are thyroid hormone receptor alpha1 mutations implicated in thyroid hormone resistance syndrome?,The context does not provide information on thyroid hormone receptor alpha1 mutations being implicated in thyroid hormone resistance syndrome.,0.8
4,Is Rheumatoid Arthritis related to myopathy?,The provided context does not contain information about a relationship between Rheumatoid Arthritis and myopathy.,0.8
5,Has the protein SETMAR (Metnase) a transposase domain?,"The context provided does not contain any information about the protein SETMAR (Metnase) or its domains, including a transposase domain. Therefore, we cannot determine from this context whether SETMAR (Metnase) has a transposase domain.",0.4
6,What is the number of long non coding RNAs in the human genome,The provided context does not contain information about the number of long non-coding RNAs in the human genome.,0.7
7,Is cystatin C or cystatin 3 used as a biomarker of kidney function?,Cystatin C is used as a biomarker of kidney function.,0.7
8,How does miR-1 overexpression worsen arrhythmias in coronary artery disease patients and what are the implications for antiarrhythmic treatments?,"The context does not provide specific information about miR-1 overexpression and its effects on arrhythmias in coronary artery disease patients. However, based on general knowledge, miR-1 is known to play a role in cardiac electrophysiology and its overexpression can lead to arrhythmias by affecting ion channel expression and function. This can worsen arrhythmias in coronary artery disease patients by promoting abnormal electrical activity in the heart. The implications for antiarrhythmic tr...",0.8
9,Which are the common symptoms of Cushing's syndrome?,The context does not provide information about the common symptoms of Cushing's syndrome.,0.7


In [45]:
few_shot_bootstrap_compiled_rag(sample)

Prediction(
    context=['endostatin peptide, a potent inhibitor of angiogenesis derived from type XVIII collagen,', 'Finally, in contrast to most other ERM-binding proteins, ELMO1 binding occurred independently of the state of radixin C-terminal phosphorylation, suggesting an ELMO1 interaction with both the active and inactive forms of ERM proteins and implying a possible role of ELMO in localizing or retaining ERM proteins in certain cellular sites. Together these data suggest that ELMO1-mediated cytoskeletal changes may be coordinated with ERM protein crosslinking activity during dynamic cellular functions.', 'Muscle LIM protein (MLP) has been proposed to be a central player in the pathogenesis of heart muscle disease.', 'During T3-dependent amphibian metamorphosis, the digestive tract is extensively remodeled from the larval to the adult form for the adaptation of the amphibian from its aquatic herbivorous lifestyle to that of a terrestrial carnivorous frog. This involves de novo f

### Signature Optmiizer

Optimizing Signature is also a way you can try to improve the performance of your model. You can either plug the above bootstrapped compiled model to this or you can even use the uncompiled model.

In [56]:
from dspy.teleprompt import MIPRO

llm_prompter = dspy.OpenAI(model="gpt-4o", max_tokens=2000, model_type="chat")

teleprompter = MIPRO(
    task_model=dspy.settings.lm,
    metric=llm_metric,
    prompt_model=llm_prompter,
    verbose=False,
)
kwargs = dict(num_threads=8, display_progress=True, display_table=0)
mipro_compiled_rag = teleprompter.compile(
    uncompile_k_10,
    eval_kwargs=kwargs,
    trainset=trainset_truncated,
    num_trials=20,
    max_bootstrapped_demos=2,
    max_labeled_demos=8,
    requires_permission_to_run=False,
)


Please be advised that based on the parameters you have set, the maximum number of LM calls is projected as follows:

[93m- Task Model: [94m[1m1[0m[93m examples in dev set * [94m[1m20[0m[93m trials * [94m[1m# of LM calls in your program[0m[93m = ([94m[1m20 * # of LM calls in your program[0m[93m) task model calls[0m
[93m- Prompt Model: # data summarizer calls (max [94m[1m10[0m[93m) + [94m[1m10[0m[93m * [94m[1m1[0m[93m lm calls in program = [94m[1m20[0m[93m prompt model calls[0m

[93m[1mEstimated Cost Calculation:[0m

[93mTotal Cost = (Number of calls to task model * (Avg Input Token Length per Call * Task Model Price per Input Token + Avg Output Token Length per Call * Task Model Price per Output Token) 
            + (Number of calls to prompt model * (Avg Input Token Length per Call * Task Prompt Price per Input Token + Avg Output Token Length per Call * Prompt Model Price per Output Token).[0m

For a preliminary estimate of potential costs, w

100%|██████████| 1/1 [00:00<00:00,  4.07it/s]


Faithful: 1
Detail: 2
Correctness: 1
Bootstrapped 1 full traces after 1 examples in round 0.


100%|██████████| 1/1 [00:00<00:00,  2.72it/s]


Faithful: 1
Detail: 2
Correctness: 1
Bootstrapped 1 full traces after 1 examples in round 0.


100%|██████████| 1/1 [00:00<00:00,  2.96it/s]


Faithful: 1
Detail: 2
Correctness: 1
Bootstrapped 1 full traces after 1 examples in round 0.


100%|██████████| 1/1 [00:00<00:00,  2.43it/s]


Faithful: 1
Detail: 2
Correctness: 1
Bootstrapped 1 full traces after 1 examples in round 0.


100%|██████████| 1/1 [00:00<00:00,  2.82it/s]


Faithful: 1
Detail: 2
Correctness: 1
Bootstrapped 1 full traces after 1 examples in round 0.


100%|██████████| 1/1 [00:00<00:00,  2.64it/s]


Faithful: 1
Detail: 2
Correctness: 1
Bootstrapped 1 full traces after 1 examples in round 0.


100%|██████████| 1/1 [00:00<00:00,  2.55it/s]


Faithful: 1
Detail: 2
Correctness: 1
Bootstrapped 1 full traces after 1 examples in round 0.


100%|██████████| 1/1 [00:00<00:00,  2.73it/s]


Faithful: 1
Detail: 2
Correctness: 1
Bootstrapped 1 full traces after 1 examples in round 0.


100%|██████████| 1/1 [00:00<00:00,  2.71it/s]
[I 2024-09-21 16:24:56,123] A new study created in memory with name: no-name-b725fb45-fe96-4e41-b527-66bfae47c947


Faithful: 1
Detail: 2
Correctness: 1
Bootstrapped 1 full traces after 1 examples in round 0.
Starting trial #0


Average Metric: 0.4 / 1  (40.0): 100%|██████████| 1/1 [00:00<00:00,  3.23it/s]
[I 2024-09-21 16:24:56,441] Trial 0 finished with value: 40.0 and parameters: {'15104646032_predictor_instruction': 1, '15104646032_predictor_demos': 1}. Best is trial 0 with value: 40.0.


Faithful: 1
Detail: 2
Correctness: 1
Average Metric: 0.4 / 1  (40.0%)
Starting trial #1


Average Metric: 0.4 / 1  (40.0): 100%|██████████| 1/1 [00:00<00:00,  3.48it/s]
[I 2024-09-21 16:24:56,735] Trial 1 finished with value: 40.0 and parameters: {'15104646032_predictor_instruction': 5, '15104646032_predictor_demos': 4}. Best is trial 0 with value: 40.0.


Faithful: 1
Detail: 2
Correctness: 1
Average Metric: 0.4 / 1  (40.0%)
Starting trial #2


Average Metric: 0.9 / 1  (90.0): 100%|██████████| 1/1 [00:12<00:00, 12.72s/it]
[I 2024-09-21 16:25:09,468] Trial 2 finished with value: 90.0 and parameters: {'15104646032_predictor_instruction': 3, '15104646032_predictor_demos': 0}. Best is trial 2 with value: 90.0.


Faithful: 1
Detail: 4
Correctness: 4
Average Metric: 0.9 / 1  (90.0%)
Starting trial #3


Average Metric: 0.4 / 1  (40.0): 100%|██████████| 1/1 [00:00<00:00,  4.40it/s]
[I 2024-09-21 16:25:09,701] Trial 3 finished with value: 40.0 and parameters: {'15104646032_predictor_instruction': 9, '15104646032_predictor_demos': 3}. Best is trial 2 with value: 90.0.


Faithful: 1
Detail: 2
Correctness: 1
Average Metric: 0.4 / 1  (40.0%)
Starting trial #4


Average Metric: 0.4 / 1  (40.0): 100%|██████████| 1/1 [00:00<00:00,  2.83it/s]
[I 2024-09-21 16:25:10,065] Trial 4 finished with value: 40.0 and parameters: {'15104646032_predictor_instruction': 8, '15104646032_predictor_demos': 4}. Best is trial 2 with value: 90.0.


Faithful: 1
Detail: 2
Correctness: 1
Average Metric: 0.4 / 1  (40.0%)
Starting trial #5


Average Metric: 0.4 / 1  (40.0): 100%|██████████| 1/1 [00:00<00:00,  3.01it/s]
[I 2024-09-21 16:25:10,407] Trial 5 finished with value: 40.0 and parameters: {'15104646032_predictor_instruction': 4, '15104646032_predictor_demos': 2}. Best is trial 2 with value: 90.0.


Faithful: 1
Detail: 2
Correctness: 1
Average Metric: 0.4 / 1  (40.0%)
Starting trial #6


Average Metric: 0.4 / 1  (40.0): 100%|██████████| 1/1 [00:00<00:00,  3.15it/s]
[I 2024-09-21 16:25:10,732] Trial 6 finished with value: 40.0 and parameters: {'15104646032_predictor_instruction': 1, '15104646032_predictor_demos': 9}. Best is trial 2 with value: 90.0.


Faithful: 1
Detail: 2
Correctness: 1
Average Metric: 0.4 / 1  (40.0%)
Starting trial #7


Average Metric: 0.4 / 1  (40.0): 100%|██████████| 1/1 [00:00<00:00,  3.90it/s]
[I 2024-09-21 16:25:10,996] Trial 7 finished with value: 40.0 and parameters: {'15104646032_predictor_instruction': 0, '15104646032_predictor_demos': 4}. Best is trial 2 with value: 90.0.


Faithful: 1
Detail: 2
Correctness: 1
Average Metric: 0.4 / 1  (40.0%)
Starting trial #8


Average Metric: 0.4 / 1  (40.0): 100%|██████████| 1/1 [00:00<00:00,  2.67it/s]
[I 2024-09-21 16:25:11,377] Trial 8 finished with value: 40.0 and parameters: {'15104646032_predictor_instruction': 5, '15104646032_predictor_demos': 8}. Best is trial 2 with value: 90.0.


Faithful: 1
Detail: 2
Correctness: 1
Average Metric: 0.4 / 1  (40.0%)
Starting trial #9


Average Metric: 0.4 / 1  (40.0): 100%|██████████| 1/1 [00:00<00:00,  3.61it/s]
[I 2024-09-21 16:25:11,662] Trial 9 finished with value: 40.0 and parameters: {'15104646032_predictor_instruction': 2, '15104646032_predictor_demos': 2}. Best is trial 2 with value: 90.0.


Faithful: 1
Detail: 2
Correctness: 1
Average Metric: 0.4 / 1  (40.0%)
Starting trial #10


Average Metric: 0.9 / 1  (90.0): 100%|██████████| 1/1 [00:00<00:00,  3.76it/s]
[I 2024-09-21 16:25:11,941] Trial 10 finished with value: 90.0 and parameters: {'15104646032_predictor_instruction': 3, '15104646032_predictor_demos': 0}. Best is trial 2 with value: 90.0.


Faithful: 1
Detail: 4
Correctness: 4
Average Metric: 0.9 / 1  (90.0%)
Starting trial #11


Average Metric: 0.9 / 1  (90.0): 100%|██████████| 1/1 [00:00<00:00,  3.55it/s]
[I 2024-09-21 16:25:12,231] Trial 11 finished with value: 90.0 and parameters: {'15104646032_predictor_instruction': 3, '15104646032_predictor_demos': 0}. Best is trial 2 with value: 90.0.


Faithful: 1
Detail: 4
Correctness: 4
Average Metric: 0.9 / 1  (90.0%)
Starting trial #12


Average Metric: 0.9 / 1  (90.0): 100%|██████████| 1/1 [00:00<00:00,  3.32it/s]
[I 2024-09-21 16:25:12,542] Trial 12 finished with value: 90.0 and parameters: {'15104646032_predictor_instruction': 3, '15104646032_predictor_demos': 0}. Best is trial 2 with value: 90.0.


Faithful: 1
Detail: 4
Correctness: 4
Average Metric: 0.9 / 1  (90.0%)
Starting trial #13


Average Metric: 0.9 / 1  (90.0): 100%|██████████| 1/1 [00:00<00:00,  3.23it/s]
[I 2024-09-21 16:25:12,861] Trial 13 finished with value: 90.0 and parameters: {'15104646032_predictor_instruction': 3, '15104646032_predictor_demos': 0}. Best is trial 2 with value: 90.0.


Faithful: 1
Detail: 4
Correctness: 4
Average Metric: 0.9 / 1  (90.0%)
Starting trial #14


Average Metric: 0.4 / 1  (40.0): 100%|██████████| 1/1 [00:00<00:00,  3.44it/s]
[I 2024-09-21 16:25:13,159] Trial 14 finished with value: 40.0 and parameters: {'15104646032_predictor_instruction': 6, '15104646032_predictor_demos': 6}. Best is trial 2 with value: 90.0.


Faithful: 1
Detail: 2
Correctness: 1
Average Metric: 0.4 / 1  (40.0%)
Starting trial #15


Average Metric: 0.4 / 1  (40.0): 100%|██████████| 1/1 [00:00<00:00,  3.05it/s]
[I 2024-09-21 16:25:13,496] Trial 15 finished with value: 40.0 and parameters: {'15104646032_predictor_instruction': 7, '15104646032_predictor_demos': 5}. Best is trial 2 with value: 90.0.


Faithful: 1
Detail: 2
Correctness: 1
Average Metric: 0.4 / 1  (40.0%)
Starting trial #16


Average Metric: 0.4 / 1  (40.0): 100%|██████████| 1/1 [00:00<00:00,  3.82it/s]
[I 2024-09-21 16:25:13,771] Trial 16 finished with value: 40.0 and parameters: {'15104646032_predictor_instruction': 3, '15104646032_predictor_demos': 7}. Best is trial 2 with value: 90.0.


Faithful: 1
Detail: 2
Correctness: 1
Average Metric: 0.4 / 1  (40.0%)
Starting trial #17


Average Metric: 0.9 / 1  (90.0): 100%|██████████| 1/1 [00:00<00:00,  3.60it/s]
[I 2024-09-21 16:25:14,061] Trial 17 finished with value: 90.0 and parameters: {'15104646032_predictor_instruction': 3, '15104646032_predictor_demos': 0}. Best is trial 2 with value: 90.0.


Faithful: 1
Detail: 4
Correctness: 4
Average Metric: 0.9 / 1  (90.0%)
Starting trial #18


Average Metric: 0.9 / 1  (90.0): 100%|██████████| 1/1 [00:00<00:00,  4.35it/s]
[I 2024-09-21 16:25:14,300] Trial 18 finished with value: 90.0 and parameters: {'15104646032_predictor_instruction': 6, '15104646032_predictor_demos': 0}. Best is trial 2 with value: 90.0.


Faithful: 1
Detail: 4
Correctness: 4
Average Metric: 0.9 / 1  (90.0%)
Starting trial #19


Average Metric: 0.4 / 1  (40.0): 100%|██████████| 1/1 [00:00<00:00,  4.12it/s]
[I 2024-09-21 16:25:14,551] Trial 19 finished with value: 40.0 and parameters: {'15104646032_predictor_instruction': 8, '15104646032_predictor_demos': 1}. Best is trial 2 with value: 90.0.


Faithful: 1
Detail: 2
Correctness: 1
Average Metric: 0.4 / 1  (40.0%)
Returning generate_answer = ChainOfThought(GenerateAnswer(context, question -> answer
    instructions='Answer questions based on the context.'
    context = Field(annotation=str required=True json_schema_extra={'desc': 'may contain relevant facts', '__dspy_field_type': 'input', 'prefix': 'Context:'})
    question = Field(annotation=str required=True json_schema_extra={'__dspy_field_type': 'input', 'prefix': 'Question:', 'desc': '${question}'})
    answer = Field(annotation=str required=True json_schema_extra={'__dspy_field_type': 'output', 'prefix': 'Answer:', 'desc': '${answer}'})
)) from continue_program


In [57]:
mipro_compiled_rag_eval = evaluate(
    mipro_compiled_rag,
    metric=llm_metric,
    return_all_scores=True,
    return_outputs=True,
)

Average Metric: 0.5 / 1  (50.0):   5%|▌         | 1/20 [00:13<04:07, 13.05s/it]

Faithful: 1
Detail: 3
Correctness: 1


Average Metric: 2.5 / 3  (83.3):  15%|█▌        | 3/20 [00:13<00:53,  3.14s/it]

Faithful: 1
Detail: 5
Correctness: 4
Faithful: 3
Detail: 4
Correctness: 3


Average Metric: 4.2 / 5  (84.0):  20%|██        | 4/20 [00:14<00:36,  2.25s/it]

Faithful: 3
Detail: 4
Correctness: 4
Faithful: 1
Detail: 2
Correctness: 3


Average Metric: 5.300000000000001 / 6  (88.3):  30%|███       | 6/20 [00:15<00:17,  1.24s/it]

Faithful: 1
Detail: 5
Correctness: 5


Average Metric: 6.200000000000001 / 7  (88.6):  35%|███▌      | 7/20 [00:16<00:15,  1.16s/it]

Faithful: 2
Detail: 3
Correctness: 4


Average Metric: 6.800000000000001 / 8  (85.0):  40%|████      | 8/20 [00:16<00:11,  1.07it/s]

Faithful: 1
Detail: 2
Correctness: 3


Average Metric: 8.700000000000001 / 10  (87.0):  45%|████▌     | 9/20 [00:27<00:40,  3.66s/it]

Faithful: 1
Detail: 5
Correctness: 4
Faithful: 3
Detail: 3
Correctness: 3


Average Metric: 9.000000000000002 / 11  (81.8):  55%|█████▌    | 11/20 [00:27<00:19,  2.15s/it]

Faithful: 1
Detail: 1
Correctness: 1


Average Metric: 9.300000000000002 / 12  (77.5):  60%|██████    | 12/20 [00:28<00:14,  1.78s/it]

Faithful: 1
Detail: 1
Correctness: 1


Average Metric: 11.300000000000002 / 14  (80.7):  70%|███████   | 14/20 [00:29<00:07,  1.22s/it]

Faithful: 4
Detail: 3
Correctness: 5
Faithful: 2
Detail: 3
Correctness: 3


Average Metric: 12.100000000000003 / 15  (80.7):  75%|███████▌  | 15/20 [00:31<00:06,  1.38s/it]

Faithful: 1
Detail: 3
Correctness: 4


Average Metric: 12.800000000000002 / 16  (80.0):  80%|████████  | 16/20 [00:32<00:05,  1.32s/it]

Faithful: 1
Detail: 1
Correctness: 5


Average Metric: 13.500000000000002 / 17  (79.4):  85%|████████▌ | 17/20 [00:40<00:09,  3.15s/it]

Faithful: 1
Detail: 3
Correctness: 3


Average Metric: 13.800000000000002 / 18  (76.7):  90%|█████████ | 18/20 [00:41<00:05,  2.53s/it]

Faithful: 1
Detail: 1
Correctness: 1


Average Metric: 14.300000000000002 / 19  (75.3):  95%|█████████▌| 19/20 [00:42<00:02,  2.16s/it]

Faithful: 1
Detail: 3
Correctness: 1


Average Metric: 14.800000000000002 / 20  (74.0): 100%|██████████| 20/20 [00:43<00:00,  2.20s/it]

Faithful: 1
Detail: 1
Correctness: 3
Average Metric: 14.800000000000002 / 20  (74.0%)



  df.loc[:, metric_name] = df[metric_name].apply(


Unnamed: 0,question,example_answer,context,pred_answer,llm_metric
0,Is Rheumatoid Arthritis related to myopathy?,"Vacuolar myopathy and statin-induced myopathy have been reported in rheumatoid arthritis patients, but this association may be due to the anti-malarial treatment received. An increased...","['Muscle LIM protein (MLP) has been proposed to be a central player in the pathogenesis of heart muscle disease.', 'The Ehlers-Danlos syndromes (EDS) comprise a...",${answer} --- Context: [1] «Muscle LIM protein (MLP) has been proposed to be a central player in the pathogenesis of heart muscle disease.» [2] «The...,1.0
1,Has the protein SETMAR (Metnase) a transposase domain?,"Yes, the protein SETMAR (Metnase) has a transposase domain.","['The enzyme was capable of transferring methyl groups to selected lysine residues in a substrate protein using S-adenosyl-l-methionine (SAM) as the methyl donor.', 'FBW7 is...","Answer based on the provided context: The provided context does not directly mention SETMAR (Metnase) or its domains. However, SETMAR, also known as Metnase, is...",1.0
2,What is the number of long non coding RNAs in the human genome,"Different estimates put currently the number of human long non coding RNAs between 10,000 and 20,000",['In this paper we have explored the benefits of combining predictions from already existing gene prediction programs. We have introduced three novel methods for combining...,Answer based on the provided context: The provided context does not contain specific information about the number of long non-coding RNAs (lncRNAs) in the human...,1.1
3,How does miR-1 overexpression worsen arrhythmias in coronary artery disease patients and what are the implications for antiarrhythmic treatments?,miR-1 overexpression worsens arrhythmias in coronary artery disease patients by regulating pacemaker channel genes and contributing to arrhythmogenesis. This up-regulation of miR-1 in patients with...,"['Moreover, miR-27a was demonstrated to modulate β-MHC gene regulation via thyroid hormone signaling and to be upregulated during the differentiation of mouse embryonic stem (ES)...","**Proposed Instruction:** Your task is to generate thorough, accurate, and evidence-based answers to medical and pharmaceutical questions using the provided context. Each answer should be...",0.6
4,Which post-translational histone modifications are characteristic of constitutive heterochromatin?,"H3K9me3 is the major marker of constitutive heterochromatin. Other histone methylation marks usually found in constitutive heterochromatin, are H4K20me3 and H3K79me3. Classical histone modifications associated...","['Covalent histone modifications (e.g. ubiquitination, phosphorylation, methylation, acetylation) and H2A variants (H2A.Z, H2A.X and H2A.W) are also discussed in view of their crucial importance in...","Answer based on the provided context: Constitutive heterochromatin is typically characterized by specific post-translational histone modifications, including histone H3 lysine 9 trimethylation (H3K9me3) and histone...",0.9


In [58]:
pd.set_option("display.max_colwidth", 500)
pd.set_option("display.max_rows", 500)

mipro_compiled_rag_eval_evals_df = create_score_dataframe(
    mipro_compiled_rag_eval[1]
)
print(mipro_compiled_rag_eval_evals_df["score"].mean())
mipro_compiled_rag_eval_evals_df


0.74


Unnamed: 0,question,predicted_answer,score
0,Are thyroid hormone receptor alpha1 mutations implicated in thyroid hormone resistance syndrome?,"**Proposed Instruction:** Your task is to generate thorough, accurate, and evidence-based answers to medical and pharmaceutical questions using the provided context. Each answer should be clear and concise, prioritizing direct responses first. Ensure to include supporting information from the context, referencing clinical studies, guidelines, or related evidence, particularly when the direct response is not available or applicable. Your objective is to create a comprehensive medical knowledg...",0.5
1,Has the protein SETMAR (Metnase) a transposase domain?,"Answer based on the provided context: The provided context does not directly mention SETMAR (Metnase) or its domains. However, SETMAR, also known as Metnase, is known in the scientific literature to possess a transposase domain. This domain is derived from a mariner-family transposase, which is fused to a SET domain that has histone methyltransferase activity. This fusion allows SETMAR to participate in DNA repair and other genomic functions. For a more detailed and specific confirmation, fu...",1.0
2,Is Rheumatoid Arthritis related to myopathy?,"${answer}\n\n---\n\nContext:\n[1] «Muscle LIM protein (MLP) has been proposed to be a central player in the pathogenesis of heart muscle disease.»\n[2] «The Ehlers-Danlos syndromes (EDS) comprise a heterogenous group of heritable disorders of connective tissue, characterized by joint hypermobility, skin hyperextensibility and tissue fragility.»\n[3] «Serum CRP levels were related to severity of aSAH. Patients with lower GCS scores and higher Hunt and Hess and Fisher grades presented statisti...",1.0
3,What is the number of long non coding RNAs in the human genome,"Answer based on the provided context: The provided context does not contain specific information about the number of long non-coding RNAs (lncRNAs) in the human genome. To provide a comprehensive answer based on general knowledge in genomics: The human genome contains thousands of long non-coding RNAs (lncRNAs). According to recent estimates, there are over 16,000 lncRNAs identified in the human genome, although the exact number may vary as new lncRNAs are discovered and annotated. These lnc...",1.1
4,How does miR-1 overexpression worsen arrhythmias in coronary artery disease patients and what are the implications for antiarrhythmic treatments?,"**Proposed Instruction:** Your task is to generate thorough, accurate, and evidence-based answers to medical and pharmaceutical questions using the provided context. Each answer should be clear and concise, prioritizing direct responses first. Ensure to include supporting information from the context, referencing clinical studies, guidelines, or related evidence, particularly when the direct response is not available or applicable. Your objective is to create a comprehensive medical knowledg...",0.6
5,Is cystatin C or cystatin 3 used as a biomarker of kidney function?,"Answer based on the provided context: Cystatin C is used as a biomarker of kidney function. It is a protein that is produced by all nucleated cells and is found in body fluids. Its levels in the blood are used to assess kidney function because it is filtered out of the blood by the kidneys. Elevated levels of cystatin C in the blood can indicate impaired kidney function. There is no mention of cystatin 3 in the provided context, and it is not commonly recognized as a biomarker for kidney fun...",1.1
6,Which post-translational histone modifications are characteristic of constitutive heterochromatin?,"Answer based on the provided context: Constitutive heterochromatin is typically characterized by specific post-translational histone modifications, including histone H3 lysine 9 trimethylation (H3K9me3) and histone H4 lysine 20 trimethylation (H4K20me3). These modifications are crucial for the formation and maintenance of heterochromatin, contributing to its transcriptionally repressive state. While the provided context does not explicitly mention these specific modifications, it does highli...",0.9
7,What is the role of RhoA in bladder cancer?,"Answer based on the provided context: The provided context does not directly address the role of RhoA in bladder cancer. However, RhoA is a small GTPase known to regulate the actin cytoskeleton, cell motility, and proliferation. In the context of cancer, including bladder cancer, RhoA is often implicated in processes such as tumor cell invasion, metastasis, and angiogenesis. Dysregulation of RhoA signaling pathways can contribute to the aggressive behavior of cancer cells. For specific detai...",0.6
8,Which are the common symptoms of Cushing's syndrome?,"Answer based on the provided context: The common symptoms of Cushing's syndrome include weight gain, particularly around the abdomen and face (leading to a ""moon face"" appearance), thinning of the skin that bruises easily, purple or pink stretch marks (striae) on the skin, muscle weakness, fatigue, high blood pressure, and high blood sugar levels. Other symptoms may include osteoporosis, mood changes, and irregular menstrual periods in women. These symptoms are due to prolonged exposure to h...",1.0
9,Describe Hot water reflex epilepsy.,"**Proposed Instruction:** Your task is to generate thorough, accurate, and evidence-based answers to medical and pharmaceutical questions using the provided context. Each answer should be clear and concise, prioritizing direct responses first. Ensure to include supporting information from the context, referencing clinical studies, guidelines, or related evidence, particularly when the direct response is not available or applicable. Your objective is to create a comprehensive medical knowledg...",0.9
