## Evaluate Reports

Head-to-head comparison by LLM between all the reports

In [1]:
import json
import asyncio
from kruppe.llm import OpenAILLM
from kruppe.prompts.experiments import EVALUATE_REPORT_USER, EVALUATE_REPORT_SYSTEM

llm = OpenAILLM(model="gpt-4.1")


## Definen Evaluate Report Method

In [7]:
async def evaluate_reports(query: str, report1_loc: str, report2_loc: str):
    # assuming report 1 is kruppe's
    with open(report1_loc, "r") as f:
        data = json.load(f)
        reports_1 = data['research_reports']

    # report 2 is the vanillas or human report, so we just pick one of them
    if report2_loc.endswith(".txt"):
        with open(report2_loc, "r") as f:
            report_2_txt = f.read()
    else:
        with open(report2_loc, "r") as f:
            data = json.load(f)
            reports_2 = data['research_reports']
            report_2 = reports_2[0]
            report_2_txt = report_2['text']
        
    # assuming report 1 is kruppe's

    async with asyncio.TaskGroup() as tg:
        tasks = []
        for report_1 in reports_1:

            user_message=EVALUATE_REPORT_USER.format(
                query=query,
                answer1=report_1['text'],
                answer2=report_2_txt
            )

            messages = [
                {"role": "system", "content": EVALUATE_REPORT_SYSTEM},
                {"role": "user", "content": user_message},
            ]

            task = tg.create_task(llm.async_generate(messages))
            tasks.append(task)

    results = [task.result() for task in tasks]

    
    return {
        "query": query,
        "evals": [result.text for result in results],
    }

In [3]:
await evaluate_reports(
    "How is ConocoPhillips positioned to deliver sustained free cash flow growth and shareholder returns in the coming years amid its current investment cycle and market conditions?",
    "./kruppe_report/report_0.json",
    "./vanilla_report_4.1/report_0.json",
)

{'query': 'How is ConocoPhillips positioned to deliver sustained free cash flow growth and shareholder returns in the coming years amid its current investment cycle and market conditions?',
 'evals': ['{\n    "Empowerment": { \n        "Winner": "Answer 2", \n        "Explanation": "Answer 2 most effectively empowers the reader by clearly breaking down ConocoPhillips’ positioning into key themes and sub-components (asset portfolio, capital allocation, shareholder returns, risk management) with concise supporting evidence and industry context. The direct use of metrics, references to management actions and external validations, as well as links to primary sources, make it easier for the reader to internalize the information and form an informed judgment about the company’s outlook. Answer 1, while thorough and insightful, is written in a more abstract, narrative style and is slightly less actionable for non-expert readers seeking a quick but robust understanding." \n    },\n    "Cohesiv

## Pair Up Comparison

In [4]:
import pandas as pd

df = pd.read_csv("./reports.csv", index_col=False)
df

Unnamed: 0,category,human_report_loc,question
0,Energy (Oil),/Users/danielliu/Workspace/fin-rag/experiments...,How is ConocoPhillips positioned to deliver su...
1,Energy (Oil),/Users/danielliu/Workspace/fin-rag/experiments...,What is the outlook for Chevron Corporation’s ...
2,Energy (Oil),/Users/danielliu/Workspace/fin-rag/experiments...,What are the updated financial prospects and i...
3,Energy (Oil),/Users/danielliu/Workspace/fin-rag/experiments...,What is the investment outlook for Exxon Mobil...
4,Energy (Oil),/Users/danielliu/Workspace/fin-rag/experiments...,What is the investment outlook for ConocoPhill...
...,...,...,...
68,NVDA,/Users/danielliu/Workspace/fin-rag/experiments...,"What are the key expectations, product announc..."
69,NVDA,/Users/danielliu/Workspace/fin-rag/experiments...,"What are the current trends, risks, and invest..."
70,NVDA,/Users/danielliu/Workspace/fin-rag/experiments...,What is the current and projected financial an...
71,NVDA,/Users/danielliu/Workspace/fin-rag/experiments...,"What are the current trends, challenges, and o..."


In [8]:
async with asyncio.TaskGroup() as tg:
    kruppe_4o_tasks = []
    kruppe_41_tasks = []
    kruppe_o3_tasks = []
    kruppe_human_tasks = []

    for i, row in df.iterrows():
        kruppe_loc = f"./kruppe_report/report_{i}.json"
        # vanilla_4o_loc = f"./vanilla_report_4o_search/report_{i}.json"
        # vanilla_41_loc = f"./vanilla_report_4.1/report_{i}.json"
        vanilla_o3_loc = f"./vanilla_report_o3/report_{i}.json"
        human_loc = row["human_report_loc"]

        # kruppe_4o_task = tg.create_task(evaluate_reports(row["question"], kruppe_loc, vanilla_4o_loc))
        # kruppe_4o_tasks.append(kruppe_4o_task)
        
        # kruppe_41_task = tg.create_task(evaluate_reports(row["question"], kruppe_loc, vanilla_41_loc))
        # kruppe_41_tasks.append(kruppe_41_task)
        
        kruppe_o3_task = tg.create_task(evaluate_reports(row["question"], kruppe_loc, vanilla_o3_loc))
        kruppe_o3_tasks.append(kruppe_o3_task)
        
        kruppe_human_task = tg.create_task(evaluate_reports(row["question"], kruppe_loc, human_loc))
        kruppe_human_tasks.append(kruppe_human_task)
        

# kruppe_4o_results = [task.result() for task in kruppe_4o_tasks]
# kruppe_41_results = [task.result() for task in kruppe_41_tasks]
kruppe_o3_results = [task.result() for task in kruppe_o3_tasks]
kruppe_human_results = [task.result() for task in kruppe_human_tasks]

# Save the results to a JSON file
# with open("kruppe_4o_results.json", "w") as f:
#     json.dump(kruppe_4o_results, f, indent=4)
# with open("kruppe_41_results.json", "w") as f:
#     json.dump(kruppe_41_results, f, indent=4)
with open("kruppe_o3_results.json", "w") as f:
    json.dump(kruppe_o3_results, f, indent=4)
with open("kruppe_human_results.json", "w") as f:
    json.dump(kruppe_human_results, f, indent=4)


### Turn into JSON

In [24]:
# with open("kruppe_4o_results.json", "r") as f:
#     kruppe_4o_results = json.load(f)
#     for item in kruppe_4o_results:
#         new_evals = []
#         for eval in item["evals"]:
#             new_eval = json.loads(eval)
#             new_evals.append(new_eval)
#         item["evals"] = new_evals
# with open("kruppe_4o_results.json", "w") as f:
#     json.dump(kruppe_4o_results, f, indent=4)

# with open("kruppe_41_results.json", "r") as f:
#     kruppe_41_results = json.load(f)
#     for item in kruppe_41_results:
#         new_evals = []
#         for eval in item["evals"]:
#             new_eval = json.loads(eval)
#             new_evals.append(new_eval)
#         item["evals"] = new_evals
# with open("kruppe_41_results.json", "w") as f:
#     json.dump(kruppe_41_results, f, indent=4)

# with open("kruppe_o3_results.json", "r") as f:
#     kruppe_o3_results = json.load(f)
#     for item in kruppe_o3_results:
#         new_evals = []
#         for eval in item["evals"]:
#             new_eval = json.loads(eval)
#             new_evals.append(new_eval)
#         item["evals"] = new_evals
# with open("kruppe_o3_results.json", "w") as f:
#     json.dump(kruppe_o3_results, f, indent=4)

with open("kruppe_human_results.json", "r") as f:
    kruppe_human_results = json.load(f)
    for item in kruppe_human_results:
        new_evals = []
        for eval in item["evals"]:
            print(eval)
            new_eval = json.loads(eval)
            new_evals.append(new_eval)
        item["evals"] = new_evals
with open("kruppe_human_results.json", "w") as f:
    json.dump(kruppe_human_results, f, indent=4)

{
    "Empowerment": { 
        "Winner": "Answer 1", 
        "Explanation": "Answer 1 provides a clear narrative intended to educate the reader, synthesizing not just factual data but also the underlying logic of ConocoPhillips' strategy. It explains how the company's capital discipline, risk mitigation, and portfolio diversification contribute to long-term value, while explicitly acknowledging areas needing continued attention. The tone is neutral and focused on equipping the reader to make informed judgments, rather than just presenting forecasts." 
    },
    "Cohesiveness": { 
        "Winner": "Answer 1", 
        "Explanation": "Answer 1 presents a highly integrated analysis, starting with a hypothesis and delivering sequential, logical discussion from portfolio strategy to financial resilience, risks, peer comparison, and concluding insights, all underpinned by a clear narrative arc. In contrast, Answer 2 is more fragmented, interspersing numerical tables and analytical sectio

## Count Winners

In [27]:
def count_winners(results):
    for item in results:
        query = item["query"]
        evals = item["evals"]

        wincounts = {
            "Empowerment": {},
            "Cohesiveness": {},
            "Comprehensiveness": {},
            "Diversity": {},
            "Overall Winner": {},
        }

        for eval in evals:
            for key, value in eval.items():
                if "1" in value["Winner"]:
                    wincounts[key]["Answer 1"] = wincounts[key].get("Answer 1", 0) + 1
                elif "2" in value["Winner"]:
                    wincounts[key]["Answer 2"] = wincounts[key].get("Answer 2", 0) + 1
                else:
                    wincounts[key]["tie"] = wincounts[key].get("tie", 0) + 1
        item["wincounts"] = wincounts
    return results

In [None]:
with open("kruppe_4o_results.json", "r") as f:
    kruppe_4o_results = json.load(f)
kruppe_4o_results = count_winners(kruppe_4o_results)

with open("kruppe_41_results.json", "r") as f:
    kruppe_41_results = json.load(f)
kruppe_41_results = count_winners(kruppe_41_results)

with open("kruppe_o3_results.json", "r") as f:
    kruppe_o3_results = json.load(f)
kruppe_o3_results = count_winners(kruppe_o3_results)
    
with open("kruppe_human_results.json", "r") as f:
    kruppe_human_results = json.load(f)
kruppe_human_results = count_winners(kruppe_human_results)

In [None]:
# count total winners
def count_total_winners(results):
    total_wincounts = {
        "Empowerment": {"Answer 1": 0, "Answer 2": 0, "tie": 0},
        "Cohesiveness": {"Answer 1": 0, "Answer 2": 0, "tie": 0},
        "Comprehensiveness": {"Answer 1": 0, "Answer 2": 0, "tie": 0},
        "Diversity": {"Answer 1": 0, "Answer 2": 0, "tie": 0},
        "Overall Winner": {"Answer 1": 0, "Answer 2": 0, "tie": 0},
    }

    for item in results:
        wincounts = item["wincounts"]
        for key in total_wincounts.keys():
            total_wincounts[key]["Answer 1"] += wincounts[key].get("Answer 1", 0)
            total_wincounts[key]["Answer 2"] += wincounts[key].get("Answer 2", 0)
            total_wincounts[key]["tie"] += wincounts[key].get("tie", 0)

    return total_wincounts
total_4o_wincounts = count_total_winners(kruppe_4o_results)
total_41_wincounts = count_total_winners(kruppe_41_results)
total_o3_wincounts = count_total_winners(kruppe_o3_results)
total_human_wincounts = count_total_winners(kruppe_human_results)

4o wincounts:  {'Empowerment': {'Answer 1': 407, 'Answer 2': 25, 'tie': 0}, 'Cohesiveness': {'Answer 1': 414, 'Answer 2': 18, 'tie': 0}, 'Comprehensiveness': {'Answer 1': 369, 'Answer 2': 63, 'tie': 0}, 'Diversity': {'Answer 1': 260, 'Answer 2': 172, 'tie': 0}, 'Overall Winner': {'Answer 1': 398, 'Answer 2': 34, 'tie': 0}}


In [32]:
# count total winner, whichever has more winner, wins
def count_total_winners_relative(results):
    total_wincounts = {
        "Empowerment": {"Answer 1": 0, "Answer 2": 0, "tie": 0},
        "Cohesiveness": {"Answer 1": 0, "Answer 2": 0, "tie": 0},
        "Comprehensiveness": {"Answer 1": 0, "Answer 2": 0, "tie": 0},
        "Diversity": {"Answer 1": 0, "Answer 2": 0, "tie": 0},
        "Overall Winner": {"Answer 1": 0, "Answer 2": 0, "tie": 0},
    }

    for item in results:
        wincounts = item["wincounts"]
        for key in total_wincounts.keys():
            if wincounts[key].get("Answer 1", 0) > wincounts[key].get("Answer 2", 0):
                total_wincounts[key]["Answer 1"] += 1
            elif wincounts[key].get("Answer 2", 0) > wincounts[key].get("Answer 1", 0):
                total_wincounts[key]["Answer 2"] += 1
            else:
                total_wincounts[key]["tie"] += 1

    return total_wincounts

total_4o_wincounts_relative = count_total_winners_relative(kruppe_4o_results)
total_41_wincounts_relative = count_total_winners_relative(kruppe_41_results)
total_o3_wincounts_relative = count_total_winners_relative(kruppe_o3_results)
total_human_wincounts_relative = count_total_winners_relative(kruppe_human_results)

In [33]:
# count winner, as long as kruppe wins ONE, kruppe wins
def count_total_winners_any(results):
    total_wincounts = {
        "Empowerment": {"Answer 1": 0, "Answer 2": 0, "tie": 0},
        "Cohesiveness": {"Answer 1": 0, "Answer 2": 0, "tie": 0},
        "Comprehensiveness": {"Answer 1": 0, "Answer 2": 0, "tie": 0},
        "Diversity": {"Answer 1": 0, "Answer 2": 0, "tie": 0},
        "Overall Winner": {"Answer 1": 0, "Answer 2": 0, "tie": 0},
    }

    for item in results:
        wincounts = item["wincounts"]
        for key in total_wincounts.keys():
            if wincounts[key].get("Answer 1", 0) > 0:
                total_wincounts[key]["Answer 1"] += 1
            else:
                total_wincounts[key]["Answer 2"] += 1

    return total_wincounts

total_4o_wincounts_any = count_total_winners_any(kruppe_4o_results)
total_41_wincounts_any = count_total_winners_any(kruppe_41_results)
total_o3_wincounts_any = count_total_winners_any(kruppe_o3_results)
total_human_wincounts_any = count_total_winners_any(kruppe_human_results)

In [34]:
with open("kruppe_4o_results.json", "w") as f:
    new_data = {
        "results": kruppe_4o_results,
        "total_wincounts": total_4o_wincounts,
        "total_wincounts_relative": total_4o_wincounts_relative,
        "total_wincounts_any": total_4o_wincounts_any
    }
    json.dump(new_data, f, indent=4)
with open("kruppe_41_results.json", "w") as f:
    new_data = {
        "results": kruppe_41_results,
        "total_wincounts": total_41_wincounts,
        "total_wincounts_relative": total_41_wincounts_relative,
        "total_wincounts_any": total_41_wincounts_any
    }
    json.dump(new_data, f, indent=4)
with open("kruppe_o3_results.json", "w") as f:
    new_data = {
        "results": kruppe_o3_results,
        "total_wincounts": total_o3_wincounts,
        "total_wincounts_relative": total_o3_wincounts_relative,
        "total_wincounts_any": total_o3_wincounts_any
    }
    json.dump(new_data, f, indent=4)
with open("kruppe_human_results.json", "w") as f:
    new_data = {
        "results": kruppe_human_results,
        "total_wincounts": total_human_wincounts,
        "total_wincounts_relative": total_human_wincounts_relative,
        "total_wincounts_any": total_human_wincounts_any
    }
    json.dump(new_data, f, indent=4)

## Generate Explanations

In [43]:
llm = OpenAILLM(model="gpt-4.1")

In [56]:
from textwrap import dedent

async def explain_metric(metric: str, results):
    system_message = "You explain what makes one model's output better than the other using the explanations"

    kruppe_wins = results["total_wincounts"][metric]["Answer 1"]
    benchmark_wins = results["total_wincounts"][metric]["Answer 2"]

    explanations = []
    for item in results["results"]:
        for eval in item["evals"]:
            explanation = f"Winner: {eval[metric]['Winner']}\nExplanation: {eval[metric]['Explanation']}"
            explanations.append(explanation)
        
    user_message = dedent(
        f"""\
        Out of {kruppe_wins+benchmark_wins} games, a judge decided that, using the metric {metric}, the first model's response was better on {kruppe_wins} counts, and the second model was better on {kruppe_wins} counts.
        Below are all the justifications that the judge made for deciding why either answer 1 or answer 2's response was better than the other. Summarize the justifications and explain what makes one answer better than the other on this metric.

        ALL EXPLANATIONS:
        {"\n\n".join(explanations)}
        """)

    messages = [
        {"role": "system", "content": system_message},
        {"role": "user", "content": user_message},
    ]

    llm_response = await llm.async_generate(messages)
    llm_string = llm_response.text

    return llm_string

async def explain_all_metrics(results):
    metrics = [
        "Empowerment",
        "Cohesiveness",
        "Comprehensiveness",
        "Diversity",
        "Overall Winner"
    ]

    async with asyncio.TaskGroup() as tg:
        tasks = []
        for metric in metrics:
            task = tg.create_task(explain_metric(metric, results))
            tasks.append(task)
    explanations = [task.result() for task in tasks]

    return {metric: explanation for metric, explanation in zip(metrics, explanations)}

### Kruppe 4o w/ Internet Empowerment

In [49]:
with open("kruppe_4o_results.json", "r") as f:
    kruppe_4o_results = json.load(f)

In [57]:
kruppe_4o_results_explanation = await explain_all_metrics(kruppe_4o_results)
kruppe_4o_results_explanation

{'Empowerment': "Certainly! Here's a structured summary and analysis of the justifications given for empowerment—what makes one model's answer more empowering than the other, drawing from and synthesizing the explanations you've provided.\n\n---\n\n## Summary of Justifications: Empowerment Metric\n\n### 1. **Analytical Depth and Synthesis**\n- **Empowering answers consistently go beyond merely listing facts and figures.**\n- They **synthesize data points, strategic context, risks, and implications** into a coherent analytical narrative.\n- This approach helps readers not just know what is happening, but **understand why it matters** and how various factors interact.\n\n### 2. **Contextualization and Cause-Effect Reasoning**\n- Strong explanations **connect facts to broader industry trends, market dynamics, and comparative positioning** (e.g., peer comparisons, macroeconomic forces, regulatory risks).\n- **Explanation of cause-and-effect relationships** (how actions or trends lead to ce

In [58]:
for metric, explanation in kruppe_4o_results_explanation.items():
    kruppe_4o_results['total_wincounts'][metric]['Explanation'] = explanation
with open("kruppe_4o_results.json", "w") as f:
    json.dump(kruppe_4o_results, f, indent=4)

In [68]:
for metric, explanation in kruppe_4o_results_explanation.items():
    print(metric)
    print(explanation)
    print('--'*20)

Empowerment
Certainly! Here's a structured summary and analysis of the justifications given for empowerment—what makes one model's answer more empowering than the other, drawing from and synthesizing the explanations you've provided.

---

## Summary of Justifications: Empowerment Metric

### 1. **Analytical Depth and Synthesis**
- **Empowering answers consistently go beyond merely listing facts and figures.**
- They **synthesize data points, strategic context, risks, and implications** into a coherent analytical narrative.
- This approach helps readers not just know what is happening, but **understand why it matters** and how various factors interact.

### 2. **Contextualization and Cause-Effect Reasoning**
- Strong explanations **connect facts to broader industry trends, market dynamics, and comparative positioning** (e.g., peer comparisons, macroeconomic forces, regulatory risks).
- **Explanation of cause-and-effect relationships** (how actions or trends lead to certain outcomes) is

### Kruppe 4.1

In [59]:
with open("kruppe_41_results.json", "r") as f:
    kruppe_41_results = json.load(f)

In [60]:
kruppe_41_results_explanation = await explain_all_metrics(kruppe_41_results)
kruppe_41_results_explanation

{'Empowerment': 'Certainly! Based on all the explanations and outcomes provided, here\'s a summary of what makes one model’s output better than the other in terms of **Empowerment**—i.e., enabling the reader to make informed, independent judgments—and the key patterns in the justifications:\n\n---\n\n## **Patterns in the Judge\'s Justifications**\n\n### **When Answer 2 Won (Empowerment):**\n- **Clear Structure & Organization:** Answer 2 excels when it is clearly structured, often using explicit sections, bullet points, summaries, and headings. This organization makes the analysis more easily digestible, especially for non-expert readers.\n- **Actionable Insights & Summary Sections:** The presence of actionable advice (e.g., what to monitor, explicit investment theses, risks, upside/downside scenarios) is frequently cited as empowering. Answer 2 often ends with takeaways or practical next steps, which help the reader apply the information.\n- **Balanced View of Risks and Opportunities:*

In [61]:
for metric, explanation in kruppe_41_results_explanation.items():
    kruppe_41_results['total_wincounts'][metric]['Explanation'] = explanation
with open("kruppe_41_results.json", "w") as f:
    json.dump(kruppe_41_results, f, indent=4)

In [69]:
for metric, explanation in kruppe_41_results_explanation.items():
    print(metric)
    print(explanation)
    print('--'*20)

Empowerment
Certainly! Based on all the explanations and outcomes provided, here's a summary of what makes one model’s output better than the other in terms of **Empowerment**—i.e., enabling the reader to make informed, independent judgments—and the key patterns in the justifications:

---

## **Patterns in the Judge's Justifications**

### **When Answer 2 Won (Empowerment):**
- **Clear Structure & Organization:** Answer 2 excels when it is clearly structured, often using explicit sections, bullet points, summaries, and headings. This organization makes the analysis more easily digestible, especially for non-expert readers.
- **Actionable Insights & Summary Sections:** The presence of actionable advice (e.g., what to monitor, explicit investment theses, risks, upside/downside scenarios) is frequently cited as empowering. Answer 2 often ends with takeaways or practical next steps, which help the reader apply the information.
- **Balanced View of Risks and Opportunities:** The explanatio

### o3

In [62]:
with open("kruppe_o3_results.json", "r") as f:
    kruppe_o3_results = json.load(f)

In [63]:
kruppe_o3_results_explanation = await explain_all_metrics(kruppe_o3_results)
kruppe_o3_results_explanation

{'Empowerment': '**Summary of Judge Explanations (Empowerment Metric)**\n\nAcross the justifications, judges consistently prefer Answer 2, except in two cases where Answer 1 is chosen. The explanations for each decision illuminate what judges value for the Empowerment metric and why one response is considered more "empowering" for the reader than the other.\n\n**What Makes One Answer More Empowering (According to Judges):**\n\n1. **Actionable Insights and Decision Frameworks:**  \n   The preferred answers (almost always Answer 2) give the reader not only information, but tools and frameworks to apply that information, make decisions, or form independent judgments. This includes:\n   - Scenario analyses with explicit base/bull/bear cases.\n   - Forward-looking metrics and quantifiable projections.\n   - Sensitivity analysis (how different variables impact outcomes).\n   - Concrete and specific recommendations—what to watch, when to act, what milestones matter.\n   - Guidance for differe

In [64]:
for metric, explanation in kruppe_o3_results_explanation.items():
    kruppe_o3_results['total_wincounts'][metric]['Explanation'] = explanation
with open("kruppe_o3_results.json", "w") as f:
    json.dump(kruppe_o3_results, f, indent=4)

In [70]:
for metric, explanation in kruppe_o3_results_explanation.items():
    print(metric)
    print(explanation)
    print('--'*20)

Empowerment
**Summary of Judge Explanations (Empowerment Metric)**

Across the justifications, judges consistently prefer Answer 2, except in two cases where Answer 1 is chosen. The explanations for each decision illuminate what judges value for the Empowerment metric and why one response is considered more "empowering" for the reader than the other.

**What Makes One Answer More Empowering (According to Judges):**

1. **Actionable Insights and Decision Frameworks:**  
   The preferred answers (almost always Answer 2) give the reader not only information, but tools and frameworks to apply that information, make decisions, or form independent judgments. This includes:
   - Scenario analyses with explicit base/bull/bear cases.
   - Forward-looking metrics and quantifiable projections.
   - Sensitivity analysis (how different variables impact outcomes).
   - Concrete and specific recommendations—what to watch, when to act, what milestones matter.
   - Guidance for different types of reade

### ERP

In [65]:
with open("kruppe_human_results.json", "r") as f:
    kruppe_human_results = json.load(f)

In [66]:
kruppe_human_results_explanation = await explain_all_metrics(kruppe_human_results)
kruppe_human_results_explanation

{'Empowerment': 'Certainly! Here’s a synthesized summary of the justifications and an explanation of what makes one answer better than the other on the metric of **Empowerment**:\n\n---\n\n### Core Themes from the Justifications\n\nThe judge repeatedly prefers answers that:\n\n- **Synthesize and Interpret**: They transform complex or dense information (financials, strategy, risks) into clear explanations, integrating both data and context. They don’t just list facts or numbers, but interpret what they mean.\n- **Narrative Clarity**: They offer structured, logical narratives with hypotheses, evidence, and clear conclusions. This structure guides the reader through reasoning rather than dropping them into a sea of undigested data.\n- **Accessibility**: They use plain, accessible language suitable for non-expert or moderately sophisticated readers, avoiding excessive jargon or insider shorthand common in professional analyst reports.\n- **Analytical Context**: They explain not just **what

In [67]:
for metric, explanation in kruppe_human_results_explanation.items():
    kruppe_human_results['total_wincounts'][metric]['Explanation'] = explanation
with open("kruppe_human_results.json", "w") as f:
    json.dump(kruppe_human_results, f, indent=4)

In [71]:
for metric, explanation in kruppe_human_results_explanation.items():
    print(metric)
    print(explanation)
    print('--'*20)

Empowerment
Certainly! Here’s a synthesized summary of the justifications and an explanation of what makes one answer better than the other on the metric of **Empowerment**:

---

### Core Themes from the Justifications

The judge repeatedly prefers answers that:

- **Synthesize and Interpret**: They transform complex or dense information (financials, strategy, risks) into clear explanations, integrating both data and context. They don’t just list facts or numbers, but interpret what they mean.
- **Narrative Clarity**: They offer structured, logical narratives with hypotheses, evidence, and clear conclusions. This structure guides the reader through reasoning rather than dropping them into a sea of undigested data.
- **Accessibility**: They use plain, accessible language suitable for non-expert or moderately sophisticated readers, avoiding excessive jargon or insider shorthand common in professional analyst reports.
- **Analytical Context**: They explain not just **what** is happening,