## Our Framework
### API Setup

In [None]:
import openai
import time
import re
import os
import subprocess
import graphviz

# 🧠 GPT Client for all Judges
gpt_client = openai.OpenAI(api_key="your own api-key")

# 🧠 DeepSeek Client for Pro and Con agents
deepseek_client = openai.OpenAI(
    api_key="your own api key",
    base_url="https://api.deepseek.com/v1")

### Pro + Con Agents

In [None]:
#first-round pro agent
def generate_first_round_response_pro(case_text):
    user_prompt = f"""
Based on the following case, provide your most likely diagnosis and concise reasoning in 3-4 medically sound sentences.

{case_text}

Please state:
1. What diagnosis you suggest.
2. Your reasoning behind it (3-4 concise sentences).
"""
    messages = [
        {"role": "system", "content": "You are a supportive physician tasked with defending your proposed diagnosis."},
        {"role": "user", "content": user_prompt}
    ]

    response = deepseek_client.chat.completions.create(
        model="deepseek-chat",
        messages=messages,
        temperature=0.3,
        max_tokens=600
    )
    return response.choices[0].message.content.strip()


#first round con agent
def generate_first_round_response_con(case_text, pro_text):
    user_prompt = f"""
Your colleague believes the diagnosis is as follows:

=== Pro Physician Diagnosis ===
{pro_text}

However, you propose a different likely diagnosis. Please state:
1. What diagnosis you suggest instead.
2. Your reasoning behind it (3–4 concise sentences).
"""
    messages = [
        {"role": "system", "content": "You are a challenging physician tasked with proposing a medically sound alternative diagnosis."},
        {"role": "user", "content": user_prompt}
    ]
    response = deepseek_client.chat.completions.create(
        model="deepseek-chat",
        messages=messages,
        temperature=0.5,
        max_tokens=600
    )
    return response.choices[0].message.content.strip()


#pro and con agent
def generate_agent_response(role, case_text, delta=1.0, last_opponent_text=None, your_diagnosis=None):
    if last_opponent_text and your_diagnosis:
        user_prompt = f"""
You are a {role}, participating in a structured clinical reasoning debate.

Please reply using the following two-section format:

B.x.1 {role}’s Defense:
Restate your suggested diagnosis (**{your_diagnosis}**) and provide **new or deeper reasoning** compared to the previous round. Do **not repeat earlier arguments**. Instead, add more medically detailed rationale (e.g., symptom timing, pathophysiology, prevalence, risk factors, diagnostic accuracy, clinical guidelines).

B.x.2 Refutation of the Opponent’s Diagnosis:
Acknowledge that the opponent's diagnosis is plausible, but explain **why your own is more likely**. Provide at least **two specific comparative points** (e.g., diagnostic specificity, typical symptom course, epidemiology, testing sensitivity).

---

Case:
{case_text}

Opponent's Previous Argument:
{last_opponent_text}
"""
        messages = [
            {"role": "system", "content": f"You are a {role} engaged in structured medical debate."},
            {"role": "user", "content": user_prompt}
        ]
        response = deepseek_client.chat.completions.create(
            model="deepseek-chat",
            messages=messages,
            temperature=0.5,
            max_tokens=600
        )
        return response.choices[0].message.content.strip()

### Agreement Checking

In [None]:
def check_agreement(role, your_diag, opponent_text):
                agree_prompt = f"""
You are a {role} reviewing your opponent's clinical argument below. Your original diagnosis was: "{your_diag}".

Opponent's Argument:
{opponent_text}

Do you now believe the opponent's diagnosis is more reasonable than yours?
Please answer with "Yes" or "No" and briefly justify your stance in 1–2 lines.
"""
                response = deepseek_client.chat.completions.create(
                    model="deepseek-chat",
                    messages=[{"role": "user", "content": agree_prompt}],
                    temperature=0.5,
                    max_tokens=200
                )
                return response.choices[0].message.content.strip()

### CRIT Judgement Agents Group

In [None]:
#CRIT Agent
def evaluate_with_critic(pro_text, con_text):
    eval_prompt = f"""
Please evaluate the following two physicians’ arguments using the CRIT framework: Claim, Reasoning, Informativeness, Trustworthiness. Each aspect should be scored from 0.0 to 1.0 with fine-grained differentiation:

Pro:
- Claim: X
- Reasoning: X
- Informativeness: X
- Trustworthiness: X

Con:
- Claim: X
- Reasoning: X
- Informativeness: X
- Trustworthiness: X

---

Here are the physician arguments:

=== Pro Physician ===
{pro_text}

=== Con Physician ===
{con_text}
"""
    judge_models = ["gpt-4o", "gpt-4o-mini", "gpt-4-1106-preview"]
    all_judge_scores = []
    for model_name in judge_models:
        response = gpt_client.chat.completions.create(
            model=model_name,
            messages=[{"role": "user", "content": eval_prompt}],
            temperature=0.3,
            max_tokens=800
        )
        all_judge_scores.append((model_name, response.choices[0].message.content.strip()))
        time.sleep(1.0)
    return all_judge_scores

### Consensus Agent

In [None]:
#consensus agent
def generate_consensus(pro_text, con_text):
    user_prompt = f"""
You are a senior medical consultant evaluating the diagnostic arguments of two physicians.

Please do the following in your answer:
1. Summarize the key strengths and weaknesses of each physician's argument in 2–3 sentences each.
2. Decide which physician's diagnosis is more reasonable based on clinical logic, specificity, and evidence.
3. Justify your choice clearly in 2–3 sentences using medical reasoning.

End with this exact format:
Final Diagnosis: <your selected diagnosis>

=== Pro Physician ===
{pro_text}

=== Con Physician ===
{con_text}
"""
    messages = [
        {"role": "system", "content": "You are a senior medical consultant synthesizing arguments and delivering a clear judgment."},
        {"role": "user", "content": user_prompt}
    ]
    response = deepseek_client.chat.completions.create(
        model="deepseek-chat",
        messages=messages,
        temperature=0.5,
        max_tokens=700
    )
    return response.choices[0].message.content.strip()


### Mermaid Agent

In [None]:

#mermaid agent
def generate_simplified_mermaid_with_consensus_diagnosis(deepseek_client, pro_text, con_text, consensus_diagnosis):
    prompt = f"""
You are a medical visualization expert. Generate a clean and interpretable Mermaid flowchart using `graph TD` that illustrates the diagnostic reasoning paths of both physicians (Pro and Con) based on the debate.

Follow these strict instructions:
1. Use exactly 4 subgraphs in this order: Symptoms, Possible Diagnoses, Supporting Evidence, Final Decision.
2. Begin all diagnostic paths from the Symptoms subgraph.
3. Use two nodes under “Possible Diagnoses” — one for each physician's suggested diagnosis.
4. All reasoning and evidence should flow logically from Symptoms → Diagnoses → Evidence → Final Decision.
5. The Final Decision subgraph must contain exactly two nodes: 
   - One for the Pro‘s suggested diagnosis.
   - One for the Con’s suggested diagnosis.
6. Highlight the **correct consensus diagnosis node** using:  
   `style NODEID fill:#a3f7bf,stroke:#333,stroke-width:2px`  
   Ensure the NODEID matches the node ID used in the graph.
7. Keep the total number of nodes under 14 by merging redundant content.
8. Use full medical terms (no abbreviations, no emojis).
9. Output only raw Mermaid code. Do NOT include markdown, backticks, or explanations.

=== Pro Physician ===
{pro_text}

=== Con Physician ===
{con_text}

=== Consensus Diagnosis ===
{consensus_diagnosis}
"""
    response = deepseek_client.chat.completions.create(
        model="deepseek-chat",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.3,
        max_tokens=1000
    )
    return response.choices[0].message.content.strip()


def clean_mermaid_code(raw_code: str) -> str:
    # Remove any triple-backtick and optional mermaid marker
    code = re.sub(r"^```(?:mermaid)?", "", raw_code.strip(), flags=re.IGNORECASE).strip()
    code = re.sub(r"```$", "", code.strip())

    # Remove any stray HTML tags like <br> (sometimes returned by models)
    code = re.sub(r"<br\s*/?>", "\n", code)

    # Normalize line endings
    code = code.replace("\r\n", "\n").replace("\r", "\n")

    # Remove leading/trailing empty lines
    lines = [line.rstrip() for line in code.split("\n")]
    cleaned_lines = [line for line in lines if line.strip() != ""]

    return "\n".join(cleaned_lines).strip()

def render_graphviz_from_mermaid_text(mermaid_code: str, output_name="diagnosis_flowchart"):
    dot = graphviz.Digraph(format='png')
    dot.attr(rankdir='TB')

    nodes = {}
    edges = []

    for line in mermaid_code.splitlines():
        line = line.strip()
        if not line or line.startswith("graph") or line.startswith("subgraph") or line == "end":
            continue
        if "-->" in line:
            src, dst = [s.strip() for s in line.split("-->")]
            src_id = src.split("[")[0].strip()
            dst_id = dst.split("[")[0].strip()
            edges.append((src_id, dst_id))
            for part in [src, dst]:
                node_id = part.split("[")[0].strip()
                if "[" in part and "]" in part:
                    label = part.split("[", 1)[1].rsplit("]", 1)[0].strip()
                    nodes[node_id] = label
        elif "[" in line and "]" in line:
            node_id = line.split("[")[0].strip()
            label = line.split("[", 1)[1].rsplit("]", 1)[0].strip()
            nodes[node_id] = label

    for node_id, label in nodes.items():
        dot.node(node_id, label=label, shape="box")

    for src, dst in edges:
        dot.edge(src, dst)

    from IPython.display import Image, display

    filepath = dot.render(filename=output_name, cleanup=True)
    print(f"✅ Saved image: {filepath}")
    display(Image(filename=filepath))

### Debate Run

In [None]:
def setup_case():
    return "Patient presents with itching, fatigue, lethargy, yellowish skin, dark urine, loss of appetite, abdominal pain, yellowing of the eyes, malaise, history of receiving a blood transfusion, and exposure to unsterile injections. Please determine the most likely diagnosis and explain your reasoning."

#Whole Debate Process
def run_debate(case_text=None):
    result = {
        "rounds": [],
        "consensus": "",
        "mermaid_code": "",
    }

    case = case_text
    delta = 1.0
    rounds = 3
    last_pro, last_con = "", ""
    first_pro, first_con = "", ""

    for r in range(rounds):
        round_info = {"round": r + 1}

        if r == 0:
            pro = generate_first_round_response_pro(case)
            con = generate_first_round_response_con(case, pro)
            first_pro, first_con = pro, con
        else:
            pro_agree = check_agreement("pro physician", last_pro, last_con)
            con_agree = check_agreement("con physician", last_con, last_pro)
            round_info["pro_agree"] = pro_agree
            round_info["con_agree"] = con_agree

            if "yes" in pro_agree.lower() and "yes" in con_agree.lower():
                pro = "I now agree with the Con physician's diagnosis based on the updated reasoning."
                con = "I now agree with the Pro physician's diagnosis based on the updated reasoning."
            elif "yes" in pro_agree.lower():
                pro = "I now agree with the Con physician's diagnosis based on the updated reasoning."
                con = generate_agent_response("con physician", case, delta, last_pro, last_con)
            elif "yes" in con_agree.lower():
                con = "I now agree with the Pro physician's diagnosis based on the updated reasoning."
                pro = generate_agent_response("pro physician", case, delta, last_con, last_pro)
            else:
                pro = generate_agent_response("pro physician", case, delta, last_con, first_pro)
                con = generate_agent_response("con physician", case, delta, last_pro, first_con)

        round_info["pro"] = pro
        round_info["con"] = con

        all_judges = evaluate_with_critic(pro, con)
        round_info["evaluation"] = [
            {"model": model, "score": score}
            for model, score in all_judges
        ]

        result["rounds"].append(round_info)
        last_pro, last_con = pro, con

    result["consensus"] = generate_consensus(last_pro, last_con)

    mermaid_code = generate_simplified_mermaid_with_consensus_diagnosis(
        deepseek_client, last_pro, last_con, result["consensus"]
    )
    cleaned_code = clean_mermaid_code(mermaid_code)
    result["mermaid_code"] = cleaned_code

    return result


import re
#Simplified Debate Process
def run_inference_minimal_debate(case_text):
    result = {
        "case": case_text,
        "rounds": [],
        "consensus": "",
        "diagnosis": "" 
    }

    delta = 1.0
    rounds = 3
    last_pro, last_con = "", ""
    first_pro, first_con = "", ""

    for r in range(rounds):
        round_info = {"round": r + 1}

        if r == 0:
            pro = generate_first_round_response_pro(case_text)
            con = generate_first_round_response_con(case_text, pro)
            first_pro, first_con = pro, con
        else:
            pro = generate_agent_response("pro physician", case_text, delta, last_con, first_pro)
            con = generate_agent_response("con physician", case_text, delta, last_pro, first_con)

        round_info["pro"] = pro
        round_info["con"] = con
        result["rounds"].append(round_info)

        last_pro, last_con = pro, con

    consensus_text = generate_consensus(last_con, last_pro)
    result["consensus"] = consensus_text

    # extract the output
    match = re.search(r"Final Diagnosis:\s*(.+)", consensus_text)
    if match:
        result["diagnosis"] = match.group(1).strip()

    return result

## Symptom to Disease Prediction

In [None]:
import pandas as pd
import csv

# load the prompt file: create by yourself!
df = pd.read_csv("patient_prompts.csv")

results = []
correct_count = 0

for idx, row in df.iterrows():
    case_text = row["Patient_Prompt"]
    gt_disease = row["Disease"].strip().lower()

    # Three Round
    output = run_inference_minimal_debate(case_text)
    pred_diag = output["diagnosis"].strip().lower()

    
    #is_correct = (pred_diag == gt_disease)
    # if is_correct:
    #     correct_count += 1

    print(f"[{idx+1}] ✅ GT: {gt_disease} | 🔍 Predicted: {pred_diag} ")

    results.append({
        "CaseID": idx + 1,
        "GroundTruth": row["Disease"],
        "PatientPrompt": case_text,
        "PredictedDiagnosis": output["diagnosis"],
    })

# Save file
output_df = pd.DataFrame(results)
output_df.to_csv("patient_predictions.csv", index=False)
print("📁 Saved: patient_predictions.csv")



[1] ✅ GT: fungal infection | 🔍 Predicted: tinea versicolor 
[2] ✅ GT: allergy | 🔍 Predicted: nonallergic rhinitis (vasomotor rhinitis) 
[3] ✅ GT: gerd | 🔍 Predicted: peptic ulcer disease (pud) 
[4] ✅ GT: chronic cholestasis | 🔍 Predicted: obstructive jaundice due to choledocholithiasis 
[5] ✅ GT: drug reaction | 🔍 Predicted: allergic reaction (likely drug-induced) with secondary urinary tract irritation or infection. 
[6] ✅ GT: peptic ulcer diseae | 🔍 Predicted: intestinal parasitic infection (e.g., giardiasis or ascariasis) 
[7] ✅ GT: aids | 🔍 Predicted: hiv/aids (likely with oral candidiasis and systemic opportunistic infection) 
[8] ✅ GT: diabetes | 🔍 Predicted: ** type 2 diabetes mellitus 
[9] ✅ GT: gastroenteritis | 🔍 Predicted: food poisoning (e.g., staphylococcus aureus or bacillus cereus toxin-mediated) with moderate dehydration. 
[10] ✅ GT: bronchial asthma | 🔍 Predicted: acute bronchitis (viral etiology) 
[11] ✅ GT: hypertension | 🔍 Predicted: iron deficiency anemia 
[12] ✅ G

## MedQA Test
### Since the task is slight different, the prompt will be different, here we provide the sample framework for MedQA answering task.

In [None]:
# ✅ Modified script using system + user prompts for each agent
import openai
import time
import json
import re

# === Clients ===

gpt_client = openai.OpenAI(api_key="")

# 🧠 DeepSeek Client for Pro and Con agents
deepseek_client = openai.OpenAI(
    api_key="",
    base_url="")


# === Round 1: Pro ===
def generate_first_round_response_pro(case_text, options):
    system_prompt = "You are a helpful physician tasked with choosing the most likely diagnosis."
    user_prompt = f"""
Case:
{case_text}

Options:
{options}

Please select the most likely diagnosis from the options above and explain your reasoning in 3–4 medically accurate sentences. Return your answer in the following format:
Answer: <Option Letter>
Reasoning: <your step-by-step explanation>
"""
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt.strip()}
    ]
    response = deepseek_client.chat.completions.create(
        model="deepseek-chat",
        messages=messages,
        temperature=0.7,
        max_tokens=600
    )
    return response.choices[0].message.content.strip()

# === Round 1: Con ===
def generate_first_round_response_con(case_text, pro_text, options):
    system_prompt = "You are a critical physician challenging your colleague's diagnosis."
    user_prompt = f"""
Case:
{case_text}

Options:
{options}

Your colleague provided the following diagnosis and reasoning:
{pro_text}

Please propose a different option (choose a different letter from above) and explain your reasoning in 3–4 medically accurate sentences. Return your answer in the following format:
Answer: <Option Letter>
Reasoning: <your rebuttal and justification>
"""
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt.strip()}
    ]
    response = gpt_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages,
        temperature=0.7,
        max_tokens=600
    )
    return response.choices[0].message.content.strip()

# === Round 2/3: Debate ===
def generate_agent_response(role, case_text, last_opponent_text=None, your_diagnosis=None):
    system_prompt = f"You are a {role} engaged in a clinical diagnostic debate."
    user_prompt = f"""
Case: {case_text}
Opponent's Previous Argument: {last_opponent_text}

Please continue the debate by defending your original diagnosis (**{your_diagnosis}**) with new reasoning.
Then, refute your opponent’s diagnosis with at least two comparative medical points.
"""
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt}
    ]
    if role == "pro physician":
        response = deepseek_client.chat.completions.create(
            model="deepseek-chat",
            messages=messages,
            temperature=0.7,
            max_tokens=600
        )
        return response.choices[0].message.content.strip()
    elif role == "con physician":
        response = gpt_client.chat.completions.create(
            model="gpt-4o-mini",
            messages=messages,
            temperature=0.7,
            max_tokens=600
        )
        return response.choices[0].message.content.strip()

# === Final: Consensus ===
def generate_consensus(case_text, pro_text, con_text):
    system_prompt = "You are a neutral senior medical consultant tasked with evaluating two diagnostic arguments and determining the better-supported answer."

    user_prompt = f"""
Case:
{case_text}

Pro Physician's Argument:
{pro_text}

Con Physician's Argument:
{con_text}

Please do the following in your answer:
1. **Summarize** the key strengths and weaknesses of each physician's argument in 2–3 sentences each.
2. Then, **decide** which physician's diagnosis is more reasonable **based on clinical logic, specificity, and evidence.**
3. **Justify** your choice clearly in 2–3 sentences using medical reasoning (e.g., transmission route fit, symptom alignment, diagnostic yield).

Do not hedge or say both are equally valid. Pick the one that is currently more convincing based on available information.

Return your final answer starting with:  
Answer: X (where X is one of A/B/C/D)
"""

    messages = [
        {"role": "system", "content": system_prompt.strip()},
        {"role": "user", "content": user_prompt.strip()}
    ]

    response = deepseek_client.chat.completions.create(
        model="deepseek-chat",
        messages=messages,
        temperature=0.7,
        max_tokens=800
    )
    # response = gpt_client.chat.completions.create(
    #         model="gpt-4o",
    #         messages=messages,
    #         temperature=0.7,
    #         max_tokens=600
    #     )
    #return response.choices[0].message.content.strip()
    return response.choices[0].message.content.strip()

# === Inference Wrapper ===

def extract_answer(text):
    """
    Extracts the answer like 'Answer: B' from model output.
    """
    match = re.search(r"Answer:\s*([A-D])", text)
    return match.group(1).strip() if match else "?"

def run_all_benchmark_cases(json_path, pro_fn, con_fn, debate_fn, consensus_fn, max_cases=None):
    with open(json_path, 'r') as f:
        benchmark = json.load(f)

    results = {}
    count = 0
    correct = 0
    half = 0
    for qid, entry in benchmark["medqa"].items():
        if max_cases is not None and count >= max_cases:
            break

        question = entry["question"]
        options = "\n".join([f"{k}: {v}" for k, v in entry["options"].items()])
        ground_truth = entry["answer"]

        print(f"\n=== Case {qid} ===")
        print(question)
        print(options)
        # Round 1
        pro_text = pro_fn(question, options)
        con_text = con_fn(question, pro_text, options)

        pro_answer = extract_answer(pro_text)
        con_answer = extract_answer(con_text)

        print(f"Pro: {pro_answer} | {pro_text}")
        print(f"Con: {con_answer} | {con_text}")

        if ground_truth in [pro_answer,con_answer]:
            half +=1
        # Round 2
        pro_text2 = debate_fn("pro physician", question, con_text, pro_answer)
        con_text2 = debate_fn("con physician", question, pro_text, con_answer)

        # Round 3
        pro_text3 = debate_fn("pro physician", question, con_text2, pro_answer)
        con_text3 = debate_fn("con physician", question, pro_text2, con_answer)

        # Consensus
        consensus_text = consensus_fn(question, pro_text3, con_text3)
        consensus_answer = extract_answer(consensus_text)

        if consensus_answer == ground_truth:
            correct +=1
        print(f"✅ Consensus: {consensus_answer} | {consensus_text}")
        print(f"🧪 Ground Truth: {ground_truth}")

        results[qid] = {
            "ID":qid,
            "question": question,
            "pro_answer": pro_answer,
            "con_answer": con_answer,
            "consensus_answer": consensus_answer,
            "ground_truth": ground_truth,
            "pro_text_1": pro_text,
            "con_text_1": con_text,
            "pro_text_2": pro_text2,
            "con_text_2": con_text2,
            "pro_text_3": pro_text3,
            "con_text_3": con_text3,
            "consensus_text": consensus_text
        }

        count += 1
    print("correct / count",correct,'/',count)
    print("half / count",half,'/',count)
    return results

### Demo: 50 examples

In [18]:
results = run_all_benchmark_cases(
    json_path="/Users/davinkey/Desktop/AIBigData/final/data/benchmark.json",
    pro_fn=generate_first_round_response_pro,
    con_fn=generate_first_round_response_con,
    debate_fn=generate_agent_response,
    consensus_fn=generate_consensus,
    max_cases=50
)


=== Case 0000 ===
A junior orthopaedic surgery resident is completing a carpal tunnel repair with the department chairman as the attending physician. During the case, the resident inadvertently cuts a flexor tendon. The tendon is repaired without complication. The attending tells the resident that the patient will do fine, and there is no need to report this minor complication that will not harm the patient, as he does not want to make the patient worry unnecessarily. He tells the resident to leave this complication out of the operative report. Which of the following is the correct next action for the resident to take?
A: Disclose the error to the patient and put it in the operative report
B: Tell the attending that he cannot fail to disclose this mistake
C: Report the physician to the ethics committee
D: Refuse to dictate the operative report
Pro: A | Answer: A
Reasoning: The resident has an ethical and professional obligation to disclose the error to the patient and document it in t

In [20]:
def run_all_benchmark_cases(json_path, pro_fn, con_fn, debate_fn, consensus_fn, max_cases=None, start_index=0):
    with open(json_path, 'r') as f:
        benchmark = json.load(f)

    results = {}
    count = 0
    correct = 0
    half = 0

    qids = list(benchmark["medqa"].keys())
    total = len(qids)

    for idx in range(start_index, total):
        if max_cases is not None and count >= max_cases:
            break

        qid = qids[idx]
        entry = benchmark["medqa"][qid]
        question = entry["question"]
        options = "\n".join([f"{k}: {v}" for k, v in entry["options"].items()])
        ground_truth = entry["answer"]

        print(f"\n=== Case {qid} ({idx+1}) ===")
        print(question)
        print(options)

        # Round 1
        pro_text = pro_fn(question, options)
        con_text = con_fn(question, pro_text, options)

        pro_answer = extract_answer(pro_text)
        con_answer = extract_answer(con_text)

        print(f"Pro: {pro_answer} | {pro_text}")
        print(f"Con: {con_answer} | {con_text}")

        if ground_truth in [pro_answer, con_answer]:
            half += 1

        # Round 2 & 3
        pro_text2 = debate_fn("pro physician", question, con_text, pro_answer)
        con_text2 = debate_fn("con physician", question, pro_text, con_answer)

        pro_text3 = debate_fn("pro physician", question, con_text2, pro_answer)
        con_text3 = debate_fn("con physician", question, pro_text2, con_answer)

        # Consensus
        consensus_text = consensus_fn(question, pro_text3, con_text3)
        consensus_answer = extract_answer(consensus_text)

        if consensus_answer == ground_truth:
            correct += 1

        print(f"✅ Consensus: {consensus_answer} | {consensus_text}")
        print(f"🧪 Ground Truth: {ground_truth}")

        results[qid] = {
            "ID": qid,
            "question": question,
            "pro_answer": pro_answer,
            "con_answer": con_answer,
            "consensus_answer": consensus_answer,
            "ground_truth": ground_truth,
            "pro_text_1": pro_text,
            "con_text_1": con_text,
            "pro_text_2": pro_text2,
            "con_text_2": con_text2,
            "pro_text_3": pro_text3,
            "con_text_3": con_text3,
            "consensus_text": consensus_text
        }

        count += 1

    print("correct / count", correct, '/', count)
    print("half / count", half, '/', count)
    return results

### Another 215 Questions (interrupt early, supposed 300 questions)

In [22]:
results1 = run_all_benchmark_cases(
    json_path="/Users/davinkey/Desktop/AIBigData/final/data/benchmark.json",
    pro_fn=generate_first_round_response_pro,
    con_fn=generate_first_round_response_con,
    debate_fn=generate_agent_response,
    consensus_fn=generate_consensus,
    start_index=50,  # 从第51个开始
    max_cases=350     # 再跑50个
)


=== Case 0050 (51) ===
A 65-year-old male is treated for anal carcinoma with therapy including external beam radiation. How does radiation affect cancer cells?
A: Induces the formation of thymidine dimers
B: Induces the formation of disulfide bonds
C: Induces deamination of cytosine
D: Induces breaks in double-stranded DNA
Pro: D | Answer: D
Reasoning: Radiation therapy primarily works by damaging the DNA of cancer cells, leading to their death. The high-energy radiation induces breaks in the double-stranded DNA, which are particularly lethal because they disrupt the integrity of the genetic material. While thymidine dimers (Option A) are caused by UV radiation, and deamination of cytosine (Option C) is a spontaneous or chemical mutagenesis process, ionizing radiation (like external beam radiation) directly causes double-stranded DNA breaks (Option D). Disulfide bond formation (Option B) is unrelated to radiation's mechanism of action.
Con: A | Answer: A  
Reasoning: While it is true 

KeyboardInterrupt: 

In [23]:
results

{'0000': {'ID': '0000',
  'question': 'A junior orthopaedic surgery resident is completing a carpal tunnel repair with the department chairman as the attending physician. During the case, the resident inadvertently cuts a flexor tendon. The tendon is repaired without complication. The attending tells the resident that the patient will do fine, and there is no need to report this minor complication that will not harm the patient, as he does not want to make the patient worry unnecessarily. He tells the resident to leave this complication out of the operative report. Which of the following is the correct next action for the resident to take?',
  'pro_answer': 'A',
  'con_answer': 'B',
  'consensus_answer': 'A',
  'ground_truth': 'B',
  'pro_text_1': 'Answer: A\nReasoning: The resident has an ethical and professional obligation to disclose the error to the patient and document it in the operative report. Transparency is critical in patient care, and omitting complications breaches trust a