In [1]:
meta_prompt = """
You are an expert in prompt optimization for fact verification and multi-hop reasoning tasks. 
Your goal is to improve the end-to-end performance of a claim verification system that operates on the **HoVer benchmark** ‚Äî 
a dataset designed to test multi-hop evidence retrieval and reasoning over Wikipedia.

====================
üèóÔ∏è SYSTEM DESCRIPTION
====================
The HoVer system verifies factual claims by chaining together several reasoning and retrieval components.

1. **Input Claim:** 
   A natural-language factual claim (question) that must be verified as either SUPPORTED, NOT_SUPPORTED, or NOT ENOUGH INFO.

2. **Multi-Hop Query Generation and Retrieval:**
   - The system performs three iterative hops of reasoning and retrieval:
     - Hop 1 ‚Üí query_1 ‚Üí passages_1 ‚Üí summary_1
     - Hop 2 ‚Üí query_2 ‚Üí passages_2 ‚Üí summary_2
     - Hop 3 ‚Üí query_3 ‚Üí passages_3 ‚Üí summary_3
   - Each query is executed by a **static BM25 retriever** that searches the Wikipedia abstracts corpus:

stemmer = Stemmer.Stemmer("english")
retriever = bm25s.BM25.load("/Users/priyanjindal/prompt-learning/benchmarks/hover/wiki17_abstracts",
corpus_name="wiki17_abstracts_corpus.jsonl",
load_corpus=True)
corpus = retriever.corpus

def search(query: str, k: int) -> list[dict]:
    tokens = bm25s.tokenize(query, stopwords="en", stemmer=stemmer, show_progress=False)
    results, scores = retriever.retrieve(tokens, k=k, n_threads=1, show_progress=False)
    formatted_results = []
    for doc in results[0]:
        text = doc['text']
        if " | " not in text:
            return []
        title, content = text.split(" | ", 1)
        formatted_results.append({"title": title, "content": content})
    return formatted_results

   - The retriever is **static**; it cannot be trained or modified.
   - Improvements must come from **better query, summary, and reasoning prompts** that guide the model to retrieve the right evidence.

3. **Summarization (Evidence Aggregation):**
   - Each hop produces summaries ({summary_1}, {summary_2}, {summary_3}) that consolidate evidence from retrieved passages.
   - These summaries feed into the next query, so factual precision and context retention are critical.

4. **Final Verdict Generation:**
   - After three hops, the model produces {final_answer}, classifying the claim as:
     - **SUPPORTED**
     - **NOT_SUPPORTED**
     - **NOT ENOUGH INFO**

====================
üéØ OPTIMIZATION OBJECTIVE
====================
You are optimizing **all prompts together** ‚Äî the entire reasoning chain ‚Äî including:
- Query generation prompts
- Summarization prompts
- Final verdict prompt

Your task is to propose improved versions that work *coherently* across hops.  
The goal is to maximize factual accuracy, evidence recall, and logical consistency across the whole pipeline.

Specifically, your improved prompt set should:
- Encourage **entity-complete and relation-aware queries** that retrieve all supporting evidence.
- Ensure **summaries** preserve factual links between entities and accurately reflect retrieved passages.
- Guide the **final verdict** toward the correct SUPPORTED/NOT_SUPPORTED/NOT ENOUGH INFO classification based on evidence strength.
- Maintain tight alignment between hops ‚Äî information extracted or clarified in earlier hops should directly inform later hops.

The retriever is fixed ‚Äî optimization must come from *how the model expresses reasoning and structures retrieval requests.*

====================
üìÑ YOUR INPUTS
====================
Below are the **current baseline prompts** for each module in the system, followed by example runs and evaluation feedback.

************* start prompts *************
{baseline_prompt}
************* end prompts *************

************* start example data *************
{examples}
************* end example data *************

HERE ARE SOME ANNOTATIONS THAT MAY BE HELPFUL:
{annotations}

====================
üîß FINAL INSTRUCTIONS
====================
Iterate on the entire prompt set and produce **new improved versions** that:
- Preserve all variable placeholders (e.g., {question}, {query_1}, {summary_2}, etc.).
- Strengthen inter-module coordination ‚Äî ensure each hop builds meaningfully on prior ones.
- Promote explicit reasoning about factual relationships, entities, and temporal or causal links.
- Encourage concise but evidence-rich summaries that enhance retrieval quality in later hops.
- Maintain consistent formatting and output schema across all modules.
- Optionally include a few short, high-quality examples to illustrate improved behavior.
- Return the new prompt set in the **same structure and formatting** as the baseline prompt block.

Do **not** wrap anything other than variable names in curly braces.  
Do **not** modify or remove any output-format sections from the original prompts.

NEW PROMPTS:
"""


In [2]:
from application import run_pipeline
import pandas as pd
from hover_evaluate import compute_attach_correctness, attach_evals


def run_train(prompts, df):
    df = run_pipeline(prompts, df)
    df, accuracy = compute_attach_correctness(df)
    df = attach_evals(df)
    return df, accuracy

def run_dev(prompts, df):
    df = run_pipeline(prompts, df)
    df, accuracy = compute_attach_correctness(df)
    return df, accuracy



  from .autonotebook import tqdm as notebook_tqdm


In [5]:
from pathlib import Path
import sys
project_root = Path.cwd().parents[1]
sys.path.insert(0, str(project_root))
from optimizer_sdk.prompt_learning_optimizer import PromptLearningOptimizer
import ast

def parse_prompts(s: str) -> dict:
    # First parse: removes the outer quotes if present
    obj = ast.literal_eval(s)
    # If the first parse yielded a string (i.e., still quoted), parse again
    if isinstance(obj, str):
        obj = ast.literal_eval(obj)
    if not isinstance(obj, dict):
        raise ValueError("Parsed object is not a dict")
    return obj

def optimize_prompts(prompts, train_df, meta_prompt):
    prompts_concatenated = str(prompts)
    optimizer = PromptLearningOptimizer(
        prompt=prompts_concatenated,
        model_choice="gpt-5",
        meta_prompt = meta_prompt
    )
    optimized_prompts = optimizer.optimize(
        dataset=train_df,
        output_column="final_answer",
        feedback_columns=["ground_truth_wikipedia_titles", "ground_truth_label", "evaluation"],
    )
    return parse_prompts(optimized_prompts)


In [4]:
import nest_asyncio
nest_asyncio.apply()

In [None]:
prompts = {
    "create_query_1_prompt": "Given the fields {claim}, produce the field 'query_1'.",
    "summarize_1_prompt": "Given the fields {claim}, {passages_1}, produce the field 'summary_1'.",
    "create_query_2_prompt": "Given the fields {claim}, {summary_1}, produce the field 'query_2'.",
    "summarize_2_prompt": "Given the fields {claim}, {summary_1}, {passages_2}, produce the field 'summary_2'.",
    "create_query_3_prompt": "Given the fields {claim}, {summary_1}, {summary_2}, produce the field 'query_3'.",
    "summarize_3_prompt": "Given the fields {claim}, {summary_1}, {summary_2}, {passages_3}, produce the field 'summary_3'.",
    "final_answer_prompt": "Given the fields {claim}, {summary_1}, {summary_2}, {summary_3}, return either 'SUPPORTED' or 'NOT_SUPPORTED'."
}

train_df = pd.read_json("hover_train_release_v1.1.json")
train_df = train_df.sample(150, random_state=42)

dev_df = pd.read_json("hover_dev_release_v1.1.json")
dev_df = dev_df.sample(300, random_state=42)





In [7]:
results = []

for i in range(5):
    train_df_run, train_accuracy = run_train(prompts, train_df)
    print(train_accuracy)

    optimized_prompts = optimize_prompts(prompts, train_df_run, meta_prompt)
    
    dev_df_run, dev_accuracy = run_dev(optimized_prompts, dev_df)
    print(dev_accuracy)

    results.append({
        "iteration": i,
        "train_accuracy": train_accuracy,
        "dev_accuracy": dev_accuracy,
        "prompts": prompts,                # before optimization
        "optimized_prompts": optimized_prompts,  # after optimization
    })
    
    prompts = optimized_prompts



llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñâ| 149/150 (99.3%) | ‚è≥ 00:36<00:00 | 14.61it/s 

Exception in worker on attempt 1: raised APITimeoutError('Request timed out.')
Requeuing...


llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 150/150 (100.0%) | ‚è≥ 00:20<00:00 |  7.40it/s
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 150/150 (100.0%) | ‚è≥ 00:19<00:00 |  7.52it/s
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 150/150 (100.0%) | ‚è≥ 00:18<00:00 |  8.11it/s
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 150/150 (100.0%) | ‚è≥ 00:15<00:00 |  9.46it/s
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 150/150 (100.0%) | ‚è≥ 00:18<00:00 |  8.09it/s
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 150/150 (100.0%) | ‚è≥ 00:15<00:00 |  9.69it/s
                                                                       
                                                                       
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 150/150 (100.0%) | ‚è≥ 03:31<00:00 |  2.64s/it

Exception in worker on attempt 1: raised APITimeoutError('Request timed out.')
Requeuing...




0.6666666666666666
['claim', 'uid', 'ground_truth_label', 'ground_truth_wikipedia_titles', 'query_1', 'passages_1', 'summary_1', 'query_2', 'passages_2', 'summary_2', 'query_3', 'passages_3', 'summary_3', 'final_answer', 'correctness', 'evaluation']

üîß Creating batches with 128,000 token limit
üìä Processing 150 examples in 4 batches
   ‚úÖ Batch 1/4: Optimized
   ‚úÖ Batch 2/4: Optimized
   ‚úÖ Batch 3/4: Optimized
   ‚úÖ Batch 4/4: Optimized



[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 300/300 (100.0%) | ‚è≥ 00:34<00:00 |  8.75it/s

[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A


Exception in worker on attempt 1: raised APITimeoutError('Request timed out.')
Requeuing...



[A

llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 150/150 (100.0%) | ‚è≥ 16:07<00:00 |  6.45s/it
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 150/150 (100.0%) | ‚è≥ 13:36<00:00 |  5.44s/it
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 300/300 (100.0%) | ‚è≥ 00:52<00:00 |  5.71it/s


[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A



0.47


llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 150/150 (100.0%) | ‚è≥ 00:17<00:00 |  8.37it/s
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 150/150 (100.0%) | ‚è≥ 00:27<00:00 |  5.47it/s
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 150/150 (100.0%) | ‚è≥ 00:15<00:00 |  9.56it/s
llm_generate |‚ñà‚ñà‚ñç       | 37/150 (24.7%) | ‚è≥ 00:11<00:19 |  5.81it/s 

Exception in worker on attempt 1: raised InternalServerError('<html>\r\n<head><title>502 Bad Gateway</title></head>\r\n<body>\r\n<center><h1>502 Bad Gateway</h1></center>\r\n<hr><center>cloudflare</center>\r\n</body>\r\n</html>')
Requeuing...


llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñâ| 149/150 (99.3%) | ‚è≥ 00:46<00:00 |  2.53it/s 

Exception in worker on attempt 1: raised APITimeoutError('Request timed out.')
Requeuing...


llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 150/150 (100.0%) | ‚è≥ 00:15<00:00 |  9.59it/s
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 150/150 (100.0%) | ‚è≥ 01:13<00:00 |  2.04it/s
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 150/150 (100.0%) | ‚è≥ 00:39<00:00 |  3.83it/s
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 150/150 (100.0%) | ‚è≥ 00:15<00:00 |  9.54it/s
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñé  | 110/150 (73.3%) | ‚è≥ 01:19<00:16 |  2.48it/s 

Exception in worker on attempt 1: raised APITimeoutError('Request timed out.')
Requeuing...


llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñä  | 118/150 (78.7%) | ‚è≥ 01:22<00:13 |  2.43it/s 

Exception in worker on attempt 1: raised APITimeoutError('Request timed out.')
Requeuing...


llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 150/150 (100.0%) | ‚è≥ 01:54<00:00 |  2.25s/it

0.5133333333333333
['claim', 'uid', 'ground_truth_label', 'ground_truth_wikipedia_titles', 'query_1', 'passages_1', 'summary_1', 'query_2', 'passages_2', 'summary_2', 'query_3', 'passages_3', 'summary_3', 'final_answer', 'correctness', 'evaluation']

üîß Creating batches with 128,000 token limit
üìä Processing 150 examples in 5 batches
   ‚úÖ Batch 1/5: Optimized
   ‚úÖ Batch 2/5: Optimized
   ‚úÖ Batch 3/5: Optimized
   ‚úÖ Batch 4/5: Optimized
   ‚úÖ Batch 5/5: Optimized



llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 150/150 (100.0%) | ‚è≥ 17:48<00:00 |  2.25s/it 
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 150/150 (100.0%) | ‚è≥ 17:48<00:00 |  2.25s/it 

Exception in worker on attempt 1: raised APITimeoutError('Request timed out.')
Requeuing...



[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A

[A[A                                                                
                                                                       
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 150/150 (100.0%) | ‚è≥ 18:32<00:00 |  2.25s/it

[A[A                                                                
                                                                       
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 150/150 (100.0%) | ‚è≥ 18:32<00:00 |  2.25s/it

Exception in worker on attempt 1: raised APITimeoutError('Request timed out.')
Requeuing...



[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[

0.5933333333333334


llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 150/150 (100.0%) | ‚è≥ 00:17<00:00 |  8.79it/s
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 150/150 (100.0%) | ‚è≥ 00:36<00:00 |  4.13it/s
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 150/150 (100.0%) | ‚è≥ 00:15<00:00 |  9.49it/s
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 150/150 (100.0%) | ‚è≥ 00:37<00:00 |  4.05it/s
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 150/150 (100.0%) | ‚è≥ 00:16<00:00 |  8.92it/s
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 150/150 (100.0%) | ‚è≥ 00:37<00:00 |  4.05it/s
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 150/150 (100.0%) | ‚è≥ 00:16<00:00 |  9.27it/s
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà | 136/150 (90.7%) | ‚è≥ 01:51<00:11 |  1.17it/s 

Exception in worker on attempt 1: raised APITimeoutError('Request timed out.')
Requeuing...


llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñä| 147/150 (98.0%) | ‚è≥ 02:06<00:05 |  1.76s/it 

Exception in worker on attempt 1: raised APITimeoutError('Request timed out.')
Requeuing...


llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 150/150 (100.0%) | ‚è≥ 02:32<00:00 |  6.66s/it

0.58
['claim', 'uid', 'ground_truth_label', 'ground_truth_wikipedia_titles', 'query_1', 'passages_1', 'summary_1', 'query_2', 'passages_2', 'summary_2', 'query_3', 'passages_3', 'summary_3', 'final_answer', 'correctness', 'evaluation']

üîß Creating batches with 128,000 token limit
üìä Processing 150 examples in 5 batches
   ‚úÖ Batch 1/5: Optimized
   ‚úÖ Batch 2/5: Optimized
   ‚úÖ Batch 3/5: Optimized
   ‚úÖ Batch 4/5: Optimized
   ‚úÖ Batch 5/5: Optimized


llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 300/300 (100.0%) | ‚è≥ 00:28<00:00 | 10.69it/s
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 300/300 (100.0%) | ‚è≥ 01:14<00:00 |  4.01it/s
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 300/300 (100.0%) | ‚è≥ 00:28<00:00 | 10.47it/s
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 150/150 (100.0%) | ‚è≥ 15:53<00:00 |  6.35s/it

[A                                                                   
[A                                                                   

Exception in worker on attempt 1: raised APITimeoutError('Request timed out.')
Requeuing...


llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 300/300 (100.0%) | ‚è≥ 00:28<00:00 | 10.50it/s
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñå | 258/300 (86.0%) | ‚è≥ 00:57<00:07 |  5.35it/s
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñã | 259/300 (86.3%) | ‚è≥ 00:57<00:07 |  5.35it/s  
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñã | 260/300 (86.7%) | ‚è≥ 00:57<00:06 |  6.26it/s  

Exception in worker on attempt 1: raised APITimeoutError('Request timed out.')
Requeuing...


llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 300/300 (100.0%) | ‚è≥ 01:14<00:00 |  1.92s/it
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 300/300 (100.0%) | ‚è≥ 00:34<00:00 |  8.79it/s


0.54



[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 150/150 (100.0%) | ‚è≥ 00:16<00:00 |  9.24it/s

[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A

[A[A                                                                
                                                                       
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 300/300 (100.0%) | ‚è≥ 03:02<00:00 |  1.92s/it

[A[A 

Exception in worker on attempt 1: raised APITimeoutError('Request timed out.')
Requeuing...



[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 150/150 (100.0%) | ‚è≥ 00:19<00:00 |  7.89it/s


[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A

Exception in worker on attempt 1: raised APITimeoutError('Request timed out.')
Requeuing...




[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà|

Exception in worker on attempt 1: raised APITimeoutError('Request timed out.')
Requeuing...





[A[A[A

0.5466666666666666
['claim', 'uid', 'ground_truth_label', 'ground_truth_wikipedia_titles', 'query_1', 'passages_1', 'summary_1', 'query_2', 'passages_2', 'summary_2', 'query_3', 'passages_3', 'summary_3', 'final_answer', 'correctness', 'evaluation']

üîß Creating batches with 128,000 token limit
üìä Processing 150 examples in 5 batches
   ‚úÖ Batch 1/5: Optimized
   ‚úÖ Batch 2/5: Optimized
   ‚úÖ Batch 3/5: Optimized
   ‚úÖ Batch 4/5: Optimized
   ‚úÖ Batch 5/5: Optimized






[A[A[A[A



[A[A[A[A



[A[A[A[A



[A[A[A[A



[A[A[A[A



[A[A[A[A



[A[A[A[A



[A[A[A[A



[A[A[A[A



[A[A[A[A



[A[A[A[A



[A[A[A[A



[A[A[A[A



[A[A[A[A



[A[A[A[A



llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 300/300 (100.0%) | ‚è≥ 23:33<00:00 |  4.71s/it
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 300/300 (100.0%) | ‚è≥ 21:37<00:00 |  4.33s/it
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 150/150 (100.0%) | ‚è≥ 19:30<00:00 |  7.81s/it
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 150/150 (100.0%) | ‚è≥ 18:07<00:00 |  7.25s/it
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 150/150 (100.0%) | ‚è≥ 16:20<00:00 |  6.54s/it




[A[A[A[A



[A[A[A[A



[A[A[A[A



[A[A[A[A



[A[A[A[A



[A[A[A[A



[A[A[A[A



[A[A[A[A



[A[A[A[A



[A[A[A[A



[A[A[A[A



[A[A[A[A



[A[A[A[A



[A[A[A[A



[A[A[A[A



[A[A[A[A



[A[A[A[A



[A[A[A[A




Exception in worker on attempt 1: raised APITimeoutError('Request timed out.')
Requeuing...


llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 300/300 (100.0%) | ‚è≥ 00:28<00:00 | 10.69it/s
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 300/300 (100.0%) | ‚è≥ 01:06<00:00 |  4.54it/s
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 300/300 (100.0%) | ‚è≥ 00:28<00:00 | 10.61it/s
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 300/300 (100.0%) | ‚è≥ 01:10<00:00 |  4.26it/s
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 300/300 (100.0%) | ‚è≥ 00:28<00:00 | 10.70it/s


0.55



llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 300/300 (100.0%) | ‚è≥ 06:03<00:00 |  4.59s/it 
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 300/300 (100.0%) | ‚è≥ 06:03<00:00 |  4.59s/it 

Exception in worker on attempt 1: raised APITimeoutError('Request timed out.')
Requeuing...



[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 300/300 (100.0%) | ‚è≥ 06:41<00:00 |  1.34s/it
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 150/150 (100.0%) | ‚è≥ 01:15<00:00 |  2.00it/s

[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 150/150 (100.0%) | ‚è≥ 00:51<00:00 |  2.90it/s
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 150/150 (100.0%) | ‚è≥ 00:15<00:00 |  9.55it/s
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñã    | 85/150 (56.7%) | ‚è≥ 00:16<00:11 |  5.85it/s 

Exception in worker on attempt 1: raised InternalServerError('<html>\r\n<head><title>502 Bad Gateway</title></head>\r\n<body>\r\n<center><h1>502 Bad Gateway</h1></center>\r\n<hr><center>cloudflare</center>\r\n</body>\r\n</html>')
Requeuing...


llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñä| 148/150 (98.7%) | ‚è≥ 00:35<00:01 |  1.61it/s 

Exception in worker on attempt 1: raised APITimeoutError('Request timed out.')
Requeuing...


llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñä| 148/150 (98.7%) | ‚è≥ 00:38<00:01 |  1.61it/s 

Exception in worker on attempt 1: raised APITimeoutError('Request timed out.')
Requeuing...


llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 150/150 (100.0%) | ‚è≥ 00:15<00:00 |  9.61it/s
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 150/150 (100.0%) | ‚è≥ 00:40<00:00 |  3.70it/s
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 150/150 (100.0%) | ‚è≥ 00:15<00:00 |  9.71it/s

llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 150/150 (100.0%) | ‚è≥ 02:32<00:00 |  4.43s/it 
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 150/150 (100.0%) | ‚è≥ 02:32<00:00 |  4.43s/it 

Exception in worker on attempt 1: raised APITimeoutError('Request timed out.')
Requeuing...



llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 150/150 (100.0%) | ‚è≥ 02:33<00:00 |  4.43s/it 
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 150/150 (100.0%) | ‚è≥ 02:33<00:00 |  4.43s/it 

Exception in worker on attempt 1: raised APITimeoutError('Request timed out.')
Requeuing...



llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 150/150 (100.0%) | ‚è≥ 03:57<00:00 |  4.43s/it 
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 150/150 (100.0%) | ‚è≥ 03:57<00:00 |  4.43s/it 

Exception in worker on attempt 1: raised APITimeoutError('Request timed out.')
Requeuing...



llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 150/150 (100.0%) | ‚è≥ 04:00<00:00 |  4.43s/it 
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 150/150 (100.0%) | ‚è≥ 04:00<00:00 |  4.43s/it 

Exception in worker on attempt 1: raised APITimeoutError('Request timed out.')
Requeuing...



llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 150/150 (100.0%) | ‚è≥ 04:21<00:00 |  4.43s/it 
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 150/150 (100.0%) | ‚è≥ 04:21<00:00 |  4.43s/it 

Exception in worker on attempt 1: raised APITimeoutError('Request timed out.')
Requeuing...




0.6133333333333333
['claim', 'uid', 'ground_truth_label', 'ground_truth_wikipedia_titles', 'query_1', 'passages_1', 'summary_1', 'query_2', 'passages_2', 'summary_2', 'query_3', 'passages_3', 'summary_3', 'final_answer', 'correctness', 'evaluation']

üîß Creating batches with 128,000 token limit
üìä Processing 150 examples in 5 batches
   ‚úÖ Batch 1/5: Optimized
   ‚úÖ Batch 2/5: Optimized
   ‚úÖ Batch 3/5: Optimized
   ‚úÖ Batch 4/5: Optimized
   ‚úÖ Batch 5/5: Optimized



[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 300/300 (100.0%) | ‚è≥ 00:30<00:00 |  9.72it/s

[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A


Exception in worker on attempt 1: raised APITimeoutError('Request timed out.')
Requeuing...



[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[

Exception in worker on attempt 1: raised APITimeoutError('Request timed out.')
Requeuing...




[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 150/150 (100.0%) | ‚è≥ 22:32<00:00 |  9.02s/it
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 150/150 (100.0%) | ‚è≥ 20:30<00:00 |  8.20s/it
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 300/300 (100.0%) | ‚è≥ 03:49<00:00 |  1.31it/s
llm_generate |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 300/300 (100.0%) | ‚è≥ 01:34<00:00 |  3.17it/s



[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[

0.5633333333333334





In [8]:
results

[{'iteration': 0,
  'train_accuracy': np.float64(0.6666666666666666),
  'dev_accuracy': np.float64(0.47),
  'prompts': {'create_query_1_prompt': "Given the fields {claim}, produce the field 'query_1'.",
   'summarize_1_prompt': "Given the fields {claim}, {passages_1}, produce the field 'summary_1'.",
   'create_query_2_prompt': "Given the fields {claim}, {summary_1}, produce the field 'query_2'.",
   'summarize_2_prompt': "Given the fields {claim}, {summary_1}, {passages_2}, produce the field 'summary_2'.",
   'create_query_3_prompt': "Given the fields {claim}, {summary_1}, {summary_2}, produce the field 'query_3'.",
   'summarize_3_prompt': "Given the fields {claim}, {summary_1}, {summary_2}, {passages_3}, produce the field 'summary_3'.",
   'final_answer_prompt': "Given the fields {claim}, {summary_1}, {summary_2}, {summary_3}, return either 'SUPPORTED' or 'NOT_SUPPORTED'."},
  'optimized_prompts': {'create_query_1_prompt': 'Task: Using {claim}, produce the field \'query_1\' as a sin