In [1]:
from pathlib import Path
import sys
hotpot_root = Path.cwd().parents[1]
sys.path.insert(0, str(hotpot_root))
from application import generate_output, answer_dataset

  from .autonotebook import tqdm as notebook_tqdm


In [None]:
meta_prompt = """
You are an expert in multi-hop question-answering and prompt optimization. 
Your job is to improve a question-answering system that operates on the HotpotQA benchmark ‚Äî a dataset that requires multi-hop reasoning across multiple Wikipedia articles.

====================
üèóÔ∏è SYSTEM DESCRIPTION
====================
The system answers complex factual questions by chaining together multiple reasoning and retrieval steps. 
It operates as follows:

1. **Input Question:** The system begins with a natural language question from the HotpotQA dataset.
2. **Query Generation:** The system generates a first query ({query_1}) to retrieve relevant documents from a fixed corpus.
3. **Retrieval (BM25-based Search):**
   The search tool is a static BM25 retriever implemented as:

stemmer = Stemmer.Stemmer("english")
retriever = bm25s.BM25.load("/Users/priyanjindal/prompt-learning/benchmarks/hotpotQA/wiki17_abstracts", corpus_name="wiki17_abstracts_corpus.jsonl", load_corpus=True)
corpus = retriever.corpus

def search(query: str, k: int) -> list[dict]:
tokens = bm25s.tokenize(query, stopwords="en", stemmer=stemmer, show_progress=False)
results, scores = retriever.retrieve(tokens, k=k, n_threads=1, show_progress=False)
formatted_results = []
for doc in results[0]:
text = doc['text']
if " | " not in text:
return []
title, content = text.split(" | ", 1)
formatted_results.append({"title": title, "content": content})
return formatted_results

- This search function retrieves top-{k} Wikipedia abstracts.
- Each document has a `title` (the Wikipedia page title) and `content` (the text of the abstract).
- The retriever is *static* ‚Äî meaning only the **queries** can be improved, not the retrieval algorithm itself.

4. **Summarization:** The retrieved passages are summarized to highlight key facts ({summary_1}, {summary_2}).
5. **Second-Hop Query Generation:** The system generates a follow-up query ({query_2}) based on the question and first summary to gather additional evidence.
6. **Final Answer Generation:** The model combines all retrieved evidence to produce a final answer ({final_answer}).

====================
üéØ OPTIMIZATION GOAL
====================
Your task is to optimize the *prompts* used in each reasoning component (query generation, summarization, and answer synthesis) so that the system retrieves the correct evidence and produces accurate final answers. 

In particular:
- The search component cannot change ‚Äî only the *language and structure of the queries* affect retrieval quality.
- Good prompts guide the model to generate **precise, entity-rich, multi-hop-aware queries** that retrieve all *supporting facts*.
- Summaries should **preserve factual links** across hops, not just paraphrase content.
- The final answer prompt should encourage **faithful synthesis** of retrieved information.

====================
üìÑ YOUR INPUTS
====================
Below are the baseline prompts currently being used, along with example runs and feedback. 
Use these to identify weaknesses and propose improvements.

************* start prompts *************
{baseline_prompt}
************* end prompts *************

************* start example data *************
{examples}
************* end example data *************

HERE ARE SOME ANNOTATIONS THAT MAY BE HELPFUL:
{annotations}

====================
üîß FINAL INSTRUCTIONS
====================
Iterate on the baseline prompts to produce a **new, improved prompts** that:
- Retains all variable placeholders (e.g., {question}, {query_1}, {summary_1}, etc.).
- Produces clearer, more factually grounded reasoning and retrieval.
- Encourages entity completeness (e.g., names, dates, titles, relations) and multi-hop connections.
- Remains faithful to the output schema and return format from the original prompt.
- Includes short, high-quality few-shot examples or guidelines if relevant.
- Return the prompts in the same formatting that they were given.
Note: Make sure to include the variables from the original prompt, which are wrapped in either single brackets or double brackets (e.g.
{var}). If you fail to include these variables, the LLM will not be able to access the required data.

NEW PROMPTS:
"""

In [3]:
eval_prompt = """
You are evaluating a multi-hop question-answering system composed of several reasoning and retrieval components.

Below is the full execution trace of the system for one example.

=====================
üü© INPUT QUESTION
{question}

=====================
üü¶ STEP 1 ‚Äî FIRST QUERY & PASSAGES
Query #1:
{query_1}

Retrieved Passages (Hop 1):
{passages_1}

Summary #1 (based on the above passages and question):
{summary_1}

=====================
üü™ STEP 2 ‚Äî SECOND QUERY & PASSAGES
Query #2:
{query_2}

Retrieved Passages (Hop 2):
{passages_2}

Summary #2 (based on the above passages, question, and previous summary):
{summary_2}

=====================
üü® FINAL ANSWER
{final_answer}

=====================
üü• GROUND TRUTH
Supporting Facts (Wikipedia titles): {supporting_facts}
Gold Answer: {gold_answer}

=====================
üß† EVALUATION TASK

Your task is to carefully analyze the full reasoning chain and provide *diagnostic feedback* about this system‚Äôs performance.

Please:
1. **Assess correctness** ‚Äî Is the final answer correct relative to the gold answer and evidence?
2. **Explain reasoning quality** ‚Äî Did the system logically connect relevant facts across both hops?
3. **Identify failure points** ‚Äî If wrong, where did the error arise (query formulation, retrieval precision/recall, summarization accuracy, or synthesis quality)?
4. **Propose actionable improvements** ‚Äî Offer *specific, constructive* feedback for each editable module:
   - **Query Generation:** Clarity, completeness, relevance, or missing entities/relations.
   - **Passage Retrieval:** Recall or precision gaps (even if retrieval itself is static, point out how better queries could fix it).
   - **Summary Generation:** Information loss, factual errors, or misinterpretations.
   - **Final Answer Generation:** Completeness, factual grounding, consistency with summaries.

Output your response strictly in this JSON-like format (no markdown, no extra text):
"correctness" :"<whether the final answer and passage retrieval is correct or incorrect>",
"explanation": "<A detailed, structured analysis of why the system was correct or incorrect, highlighting causal chains and reasoning quality.>",
"suggestions": "<Actionable improvement ideas for each component, concise but specific enough for an optimizer to learn from.>"
"""


In [4]:
import pandas as pd
from pathlib import Path
import os

# Add project root to path
project_root = Path.cwd().parents[3]
sys.path.insert(0, str(project_root))


# Paths
hotpot_root = Path.cwd().parents[1]  # .../hotpotqa
sys.path.insert(0, str(hotpot_root))
questions_path = Path.cwd() / "questions_train_150.json"

In [5]:
import importlib
import hotpot_evaluate_v1
importlib.reload(hotpot_evaluate_v1)
import hotpot_evaluate_v1


In [None]:
from openai import OpenAI
import json
from hotpot_evaluate_v1 import eval

def run_train(prompts) -> tuple[pd.DataFrame, dict]:
    """
    runs prompts on training set. returns results and result metrics
    """
    hotpot_root = Path.cwd().parents[1]
    train_data = pd.read_json(hotpot_root / "data/hotpot_train_v1.json")
    train_dataset = train_data.sample(150, random_state=42)
    train_dataset.to_json(hotpot_root / "data/hotpot_train_sample_150.json", orient="records")    

    client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

    results, predictions = answer_dataset(train_dataset,
                                                client,
                                                prompts["create_query_1_prompt"],
                                                prompts["summarize_1_prompt"],
                                                prompts["create_query_2_prompt"],
                                                prompts["summarize_2_prompt"],
                                                prompts["final_answer_prompt"]
    )

    with open(hotpot_root / "predictions/predictions_train_150.json", "w") as f:
        json.dump(predictions, f, indent=2)

    return results, eval(hotpot_root / "predictions/predictions_train_150.json", hotpot_root / "data/hotpot_train_sample_150.json")

def run_dev(prompts, dev_dataset, dev_path) -> dict:
    """
    runs prompts on dev set. returns result metrics
    """
    dev_dataset.to_json(dev_path, orient="records")
    client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

    questions, predictions = answer_dataset(dev_dataset,
                                                client,
                                                prompts["create_query_1_prompt"],
                                                prompts["summarize_1_prompt"],
                                                prompts["create_query_2_prompt"],
                                                prompts["summarize_2_prompt"],
                                                prompts["final_answer_prompt"]
    )
    
    with open(hotpot_root / "predictions/predictions_dev_300.json", "w") as f:
        json.dump(predictions, f, indent=2)

    return eval(hotpot_root / "predictions/predictions_dev_300.json", dev_path)
    

In [7]:
from phoenix.evals import OpenAIModel, llm_generate
from optimizer_sdk.prompt_learning_optimizer import PromptLearningOptimizer


def evaluate_optimize(questions_df, prompts):
    model = OpenAIModel(
        model="gpt-4.1",
    )

    eval_results = llm_generate(
        dataframe=questions_df,
        template=eval_prompt,
        model=model,
        concurrency=40,
        verbose=True
    )

    questions_df["evals"] = eval_results["output"]

    prompts = f"""
        create_query_1_prompt = {prompts["create_query_1_prompt"]}
        summarize_1_prompt = {prompts["summarize_1_prompt"]}
        create_query_2_prompt = {prompts["create_query_2_prompt"]}
        summarize_2_prompt = {prompts["summarize_2_prompt"]}
        final_answer_prompt = {prompts["final_answer_prompt"]}
    """
    questions_df["queries"] = questions_df["query_1"] + questions_df["query_2"]

    optimizer = PromptLearningOptimizer(
        prompt=prompts,
        model_choice="gpt-5",
        openai_api_key=os.getenv("OPENAI_API_KEY"),
        meta_prompt=meta_prompt
    )

    optimized_prompt = optimizer.optimize(
        dataset=questions_df,
        output_column="final_answer",
        feedback_columns=["query_1", "query_2", "passages_1", "passages_2", "evals"],
        context_size_k=100000
    )

    print(optimized_prompt)
    return optimized_prompt

In [8]:
import re, textwrap

def parse_prompts(s: str, keys=None):
    if keys is None:
        keys = [
            "create_query_1_prompt",
            "summarize_1_prompt",
            "create_query_2_prompt",
            "summarize_2_prompt",
            "final_answer_prompt",
        ]
    key_group = "|".join(map(re.escape, keys))
    pattern = rf'(?sm)^\s*({key_group})\s*=\s*(.*?)(?=^\s*(?:{key_group})\s*=|\Z)'
    out = {}
    for key, block in re.findall(pattern, s):
        b = block.strip()
        if (b.startswith(('"""',"'''")) and b.endswith(('"""',"'''"))):
            b = b[3:-3]
        elif (b.startswith('"') and b.endswith('"')) or (b.startswith("'") and b.endswith("'")):
            b = b[1:-1]
        out[key] = textwrap.dedent(b).strip()
    return out


In [9]:
import nest_asyncio
nest_asyncio.apply()

In [10]:
prompts = {
    "create_query_1_prompt": "Given the fields {question}, produce a query.",
    "summarize_1_prompt": "Given the fields {question}, {passages_1}, produce a summary.",
    "create_query_2_prompt": "Given the fields {question}, {summary_1}, produce a query.",
    "summarize_2_prompt": "Given the fields {question}, {summary_1}, {passages_2}, produce a summary.",
    "final_answer_prompt": """Given the fields {question}, {summary_1}, {summary_2}, produce a concise and precise answer that directly and minimally responds to the question.  
    - Provide the shortest possible answer that fully addresses the question.  
    - If the question expects a specific name, date, number, phrase, or list, output only that without any additional explanation, context, or full sentences.  
    - For yes/no questions, answer with \"yes\" or \"no\" only, without elaboration.  
    - When a comparative judgment is required but explicit counts are missing, prefer the best-supported conclusion from the summaries (e.g., one entity has 'numerous named subsidiaries' vs a single example), and output only the winning entity.  
    - Avoid adding extra context, background information, or verbose sentences.  
    ...
    - Keep answers minimal (e.g., for 'Where is X located about 80 km southwest of ?', answer 'Paris' only).  

    Answer:"""
}
dev_path = hotpot_root / "data/hotpot_dev_sample_300.json"
dev_dataset = pd.read_json(dev_path).sample(300, random_state=42)
loops = {0: {"prompts": prompts, "train_results": None, "train_metrics": None, "dev_metrics": None}}
for i in range(5):
    train_results, train_metrics = run_train(prompts)
    optimized_prompt_string = evaluate_optimize(train_results, prompts)
    optimized_prompts = parse_prompts(optimized_prompt_string)
    dev_metrics = run_dev(optimized_prompts, dev_dataset, dev_path)
    loops[i+1] = {"prompts": optimized_prompts, "train_results": train_results, "train_metrics": train_metrics, "dev_metrics": dev_metrics}

    prompts = optimized_prompts
    

llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 150/150 (100.0%) | ‚è≥ 00:39<00:00 |  3.77it/s
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 150/150 (100.0%) | ‚è≥ 00:17<00:00 |  8.79it/s
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 150/150 (100.0%) | ‚è≥ 00:15<00:00 |  9.60it/s
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 150/150 (100.0%) | ‚è≥ 00:25<00:00 |  5.97it/s
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 150/150 (100.0%) | ‚è≥ 00:32<00:00 |  4.57it/s


f1: 0.658888222888223, em: 0.5866666666666667, prec: 0.6722433862433863, recall: 0.6628232323232323


llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 150/150 (100.0%) | ‚è≥ 00:34<00:00 |  4.32it/s


['question', 'question_id', 'gold_answer', 'supporting_facts', 'query_1', 'passages_1', 'summary_1', 'query_2', 'passages_2', 'summary_2', 'final_answer', 'evals', 'queries']

üîß Creating batches with 100,000 token limit
üìä Processing 150 examples in 3 batches

You are an expert in multi-hop question-answering and prompt optimization. 
Your job is to improve a question-answering system that operates on the HotpotQA benchmark ‚Äî a dataset that requires multi-hop reasoning across multiple Wikipedia articles.

üèóÔ∏è SYSTEM DESCRIPTION
The system answers complex factual questions by chaining together multiple reasoning and retrieval steps. 
It operates as follows:

1. **Input Question:** The system begins with a natural language question from the HotpotQA dataset.
2. **Query Generation:** The system generates a first query ({query_1}) to retrieve relevant documents from a fixed corpus.
3. **Retrieval (BM25-based Search):**
   The search tool is a static BM25 retriever implemented as

llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñâ| 299/300 (99.7%) | ‚è≥ 00:42<00:00 | 10.44it/s 

Exception in worker on attempt 1: raised APITimeoutError('Request timed out.')
Requeuing...


llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 300/300 (100.0%) | ‚è≥ 00:38<00:00 |  7.82it/s
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 300/300 (100.0%) | ‚è≥ 00:27<00:00 | 10.74it/s
                                                                       
                                                                       
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 300/300 (100.0%) | ‚è≥ 02:33<00:00 | 10.44it/s

Exception in worker on attempt 1: raised APITimeoutError('Request timed out.')
Requeuing...



[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 300/300 (100.0%) | ‚è≥ 00:28<00:00 | 10.52it/s


f1: 0.5788519791755087, em: 0.47333333333333333, prec: 0.5767152657284238, recall: 0.6198631553631554


llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 300/300 (100.0%) | ‚è≥ 03:10<00:00 |  1.58it/s
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 300/300 (100.0%) | ‚è≥ 01:16<00:00 |  3.94it/s
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 150/150 (100.0%) | ‚è≥ 00:15<00:00 |  9.67it/s
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 150/150 (100.0%) | ‚è≥ 00:18<00:00 |  8.24it/s
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 150/150 (100.0%) | ‚è≥ 00:19<00:00 |  7.75it/s
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 150/150 (100.0%) | ‚è≥ 00:21<00:00 |  6.99it/s
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 150/150 (100.0%) | ‚è≥ 00:15<00:00 |  9.61it/s


f1: 0.6935193325193325, em: 0.58, prec: 0.6891111111111111, recall: 0.7247222222222222


llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 150/150 (100.0%) | ‚è≥ 00:36<00:00 |  4.10it/s


['question', 'question_id', 'gold_answer', 'supporting_facts', 'query_1', 'passages_1', 'summary_1', 'query_2', 'passages_2', 'summary_2', 'final_answer', 'evals', 'queries']

üîß Creating batches with 100,000 token limit
üìä Processing 150 examples in 3 batches

You are an expert in multi-hop question-answering and prompt optimization. 
Your job is to improve a question-answering system that operates on the HotpotQA benchmark ‚Äî a dataset that requires multi-hop reasoning across multiple Wikipedia articles.

üèóÔ∏è SYSTEM DESCRIPTION
The system answers complex factual questions by chaining together multiple reasoning and retrieval steps. 
It operates as follows:

1. **Input Question:** The system begins with a natural language question from the HotpotQA dataset.
2. **Query Generation:** The system generates a first query ({query_1}) to retrieve relevant documents from a fixed corpus.
3. **Retrieval (BM25-based Search):**
   The search tool is a static BM25 retriever implemented as

llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 300/300 (100.0%) | ‚è≥ 00:28<00:00 | 10.34it/s
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 300/300 (100.0%) | ‚è≥ 00:32<00:00 |  9.33it/s
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 300/300 (100.0%) | ‚è≥ 00:27<00:00 | 10.72it/s
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 300/300 (100.0%) | ‚è≥ 00:32<00:00 |  9.34it/s
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 300/300 (100.0%) | ‚è≥ 00:27<00:00 | 10.82it/s


f1: 0.5163326036392959, em: 0.42, prec: 0.5122939421689422, recall: 0.5511450216450218


llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 150/150 (100.0%) | ‚è≥ 00:16<00:00 |  9.25it/s
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 150/150 (100.0%) | ‚è≥ 00:25<00:00 |  5.82it/s
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 150/150 (100.0%) | ‚è≥ 00:16<00:00 |  9.36it/s
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 150/150 (100.0%) | ‚è≥ 00:22<00:00 |  6.64it/s
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 150/150 (100.0%) | ‚è≥ 00:15<00:00 |  9.49it/s


f1: 0.6146190476190476, em: 0.54, prec: 0.6160331890331892, recall: 0.6360555555555556


llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñâ| 149/150 (99.3%) | ‚è≥ 00:54<00:02 |  2.30s/it 

Exception in worker on attempt 1: raised APITimeoutError('Request timed out.')
Requeuing...


llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 150/150 (100.0%) | ‚è≥ 01:00<00:00 |  5.90s/it

['question', 'question_id', 'gold_answer', 'supporting_facts', 'query_1', 'passages_1', 'summary_1', 'query_2', 'passages_2', 'summary_2', 'final_answer', 'evals', 'queries']

üîß Creating batches with 100,000 token limit
üìä Processing 150 examples in 4 batches

You are an expert in multi-hop question-answering and prompt optimization. 
Your job is to improve a question-answering system that operates on the HotpotQA benchmark ‚Äî a dataset that requires multi-hop reasoning across multiple Wikipedia articles.

üèóÔ∏è SYSTEM DESCRIPTION
The system answers complex factual questions by chaining together multiple reasoning and retrieval steps. 
It operates as follows:

1. **Input Question:** The system begins with a natural language question from the HotpotQA dataset.
2. **Query Generation:** The system generates a first query ({query_1}) to retrieve relevant documents from a fixed corpus.
3. **Retrieval (BM25-based Search):**
   The search tool is a static BM25 retriever implemented as

llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 300/300 (100.0%) | ‚è≥ 00:28<00:00 | 10.62it/s
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 300/300 (100.0%) | ‚è≥ 00:33<00:00 |  9.02it/s
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 300/300 (100.0%) | ‚è≥ 00:30<00:00 |  9.79it/s
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 300/300 (100.0%) | ‚è≥ 00:31<00:00 |  9.40it/s

llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 150/150 (100.0%) | ‚è≥ 12:26<00:00 |  5.90s/it 
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 150/150 (100.0%) | ‚è≥ 12:26<00:00 |  5.90s/it 

Exception in worker on attempt 1: raised APITimeoutError('Request timed out.')
Requeuing...




f1: 0.5420433112884726, em: 0.4533333333333333, prec: 0.5433691678691679, recall: 0.5714175084175086


llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 150/150 (100.0%) | ‚è≥ 12:29<00:00 |  5.00s/it
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 300/300 (100.0%) | ‚è≥ 00:53<00:00 |  5.62it/s
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 150/150 (100.0%) | ‚è≥ 00:16<00:00 |  9.08it/s
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñâ| 149/150 (99.3%) | ‚è≥ 00:39<00:00 |  1.51it/s 

Exception in worker on attempt 1: raised APITimeoutError('Request timed out.')
Requeuing...


llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 150/150 (100.0%) | ‚è≥ 00:15<00:00 |  9.46it/s
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 150/150 (100.0%) | ‚è≥ 00:21<00:00 |  6.83it/s
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 150/150 (100.0%) | ‚è≥ 00:15<00:00 |  9.71it/s


f1: 0.642608309990663, em: 0.5666666666666667, prec: 0.6423174603174603, recall: 0.6632777777777777



llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 150/150 (100.0%) | ‚è≥ 02:30<00:00 |  4.98s/it 
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 150/150 (100.0%) | ‚è≥ 02:30<00:00 |  4.98s/it 

Exception in worker on attempt 1: raised APITimeoutError('Request timed out.')
Requeuing...




['question', 'question_id', 'gold_answer', 'supporting_facts', 'query_1', 'passages_1', 'summary_1', 'query_2', 'passages_2', 'summary_2', 'final_answer', 'evals', 'queries']

üîß Creating batches with 100,000 token limit
üìä Processing 150 examples in 4 batches

You are an expert in multi-hop question-answering and prompt optimization. 
Your job is to improve a question-answering system that operates on the HotpotQA benchmark ‚Äî a dataset that requires multi-hop reasoning across multiple Wikipedia articles.

üèóÔ∏è SYSTEM DESCRIPTION
The system answers complex factual questions by chaining together multiple reasoning and retrieval steps. 
It operates as follows:

1. **Input Question:** The system begins with a natural language question from the HotpotQA dataset.
2. **Query Generation:** The system generates a first query ({query_1}) to retrieve relevant documents from a fixed corpus.
3. **Retrieval (BM25-based Search):**
   The search tool is a static BM25 retriever implemented as


[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 300/300 (100.0%) | ‚è≥ 00:28<00:00 | 10.61it/s

[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A


f1: 0.48325742446144165, em: 0.37666666666666665, prec: 0.480700176865618, recall: 0.5227323232323234


llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 150/150 (100.0%) | ‚è≥ 12:45<00:00 |  5.10s/it
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 150/150 (100.0%) | ‚è≥ 11:06<00:00 |  4.45s/it
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 150/150 (100.0%) | ‚è≥ 00:16<00:00 |  9.29it/s
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 150/150 (100.0%) | ‚è≥ 00:23<00:00 |  6.49it/s
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 150/150 (100.0%) | ‚è≥ 00:16<00:00 |  9.19it/s
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 150/150 (100.0%) | ‚è≥ 00:25<00:00 |  5.84it/s
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 150/150 (100.0%) | ‚è≥ 00:15<00:00 |  9.64it/s


f1: 0.5967907647907649, em: 0.5, prec: 0.5957293806411453, recall: 0.6321666666666667


llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñâ| 149/150 (99.3%) | ‚è≥ 00:50<00:00 |  1.31it/s 

Exception in worker on attempt 1: raised APITimeoutError('Request timed out.')
Requeuing...


llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 150/150 (100.0%) | ‚è≥ 00:58<00:00 |  7.55s/it

['question', 'question_id', 'gold_answer', 'supporting_facts', 'query_1', 'passages_1', 'summary_1', 'query_2', 'passages_2', 'summary_2', 'final_answer', 'evals', 'queries']

üîß Creating batches with 100,000 token limit
üìä Processing 150 examples in 4 batches

You are an expert in multi-hop question-answering and prompt optimization. 
Your job is to improve a question-answering system that operates on the HotpotQA benchmark ‚Äî a dataset that requires multi-hop reasoning across multiple Wikipedia articles.

üèóÔ∏è SYSTEM DESCRIPTION
The system answers complex factual questions by chaining together multiple reasoning and retrieval steps. 
It operates as follows:

1. **Input Question:** The system begins with a natural language question from the HotpotQA dataset.
2. **Query Generation:** The system generates a first query ({query_1}) to retrieve relevant documents from a fixed corpus.
3. **Retrieval (BM25-based Search):**
   The search tool is a static BM25 retriever implemented as

llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 150/150 (100.0%) | ‚è≥ 08:24<00:00 |  3.37s/it
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 300/300 (100.0%) | ‚è≥ 00:27<00:00 | 10.96it/s
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 300/300 (100.0%) | ‚è≥ 00:36<00:00 |  8.17it/s
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 300/300 (100.0%) | ‚è≥ 00:28<00:00 | 10.56it/s
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 300/300 (100.0%) | ‚è≥ 00:31<00:00 |  9.44it/s
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà   | 213/300 (71.0%) | ‚è≥ 00:19<00:07 | 11.33it/s 

Exception in worker on attempt 1: raised APIConnectionError('Connection error.')
Requeuing...


llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñâ| 299/300 (99.7%) | ‚è≥ 00:26<00:00 | 13.03it/s

f1: 0.5242068887586792, em: 0.41, prec: 0.5213418354521295, recall: 0.5724848484848486
