# Final Project Starting Guide

Hello everyone, welcome to the final project! This notebook is provided to you to reiterate the rules and guidelines, and give you some starting points.

### What we provide

In this project, we will provide you with 
- This starting guide
- A working API that you can access under ASU network (i.e., on campus or with VPN)
- A starting development data that you can use to develop your agent. It contains 1,000 instances with {domain, input, expected_output}

### Your goal

In this project, you will implement an inference-time agent to solve reasoning requests, as those provided in the development data. The grading of this project will be effort-based and you will get full credit if you produce the minimum deliverables below, with subject to the rules and requirements below.

#### Minimum Deliverables

1. A working agent loop (in the form of a Github project) that the TA can run, and implements *at least three* inference-time algorithms or techniques.
2. Outputs from your agent on the released test data (see important dates). 
3. A short one-page report on how your agent works, and pointer to important techniques (referece to code blocks).

#### Rules and Requirements
1. You must only use our provided API call to access LLMs; meaning that you cannot use any other LLMs in any other way within your agent loop. Some exceptions may be made if you call certain external tools (e.g., Google search) that use some LLMs internally. Please discuss any exceptions with us to avoid penalties up to 50% of the project grade.
2. You must not hardcode a full delegation to an external tool (e.g., google_search(input_problem)). Such delegations must be automatically selected/decided by your agent. Hardcode delegations will lead to a zero.
3. You cannot use Cursor or any AI coding aids to implement the final project. You can, however, ask LLMs (or other online resources) for conceptual clarification or code examples. Your final project should not contain any blocks of code (i.e., > 3 lines) that are written by AI. Violations will lead to a zero.
4. Your agent should be able to run efficiently, with <20 LLM calls per question. Exceptions may be made when you have a complicated agent but please discuss with us. Up to 10% of the project grade may be deducted if we observe very inefficient LLM usages that do not clearly benefit the performance.
5. Your agent must run without any requests to any paid services (paid is defined by if the TA has to pay to run it, regardless of whether you actuallly pay for it or not.) Violations will lead to a zero.
6. You must submit a Github project link as your code submission. All changes must be tracked and any commits should be within 100 lines of +/- with good messages. Points will be deducted to up to 25% of the project grade if we observe "magic commits" or too few commits. 


### Suggestions
1. Start early, please.
2. You should consider how you can evaluate whether your output is good enough compared to the provided expected_outputs, and we will not release how we will actually evaluate your outputs; meaning that you have to try to predict how we will evaluate things.
3. Start with a basic implementation, and iterate based on mistakes/feedbacks.
4. Find more development data, or create your own cases to stree-test your agent. 
5. You are free to modify any provided code in this starting guide, or not using any of these code at all.

### Important dates
- **Release of final test data**: 11/25/2025
- **Deadline for submitting all deliverables**: 12/05/2025

### Extra Credit. 
The top 20 projects (ranked by performance metrics on the test data and at the TA's discretion of implementation quality) will be given extra credits. The actual credits will be between 1% to 7.5% depending on the ranking. 

In [85]:
# %% Minimal setup
# If needed (uncomment in a notebook):
# !pip install requests python-dotenv

import os
import json
import textwrap
import re
import time
import requests

API_KEY = os.getenv("OPENAI_API_KEY", "cse476")
API_BASE = os.getenv("API_BASE", "http://10.4.58.53:41701/v1")
MODEL = os.getenv("MODEL_NAME", "bens_model")


def call_model_chat_completions(prompt: str,
                                system: str = "You are a helpful assistant. Reply with only the final answer—no explanation.",
                                model: str = MODEL,
                                max_tokens: int = 128,
                                temperature: float = 0.0,
                                timeout: int = 60) -> dict:
    """
    Calls an OpenAI-style /v1/chat/completions endpoint and returns:
    { 'ok': bool, 'text': str or None, 'raw': dict or None, 'status': int, 'error': str or None, 'headers': dict }
    """
    url = f"{API_BASE}/chat/completions"
    headers = {
        "Authorization": f"Bearer {API_KEY}",
        "Content-Type":  "application/json",
    }
    payload = {
        "model": model,
        "messages": [
            {"role": "system", "content": system},
            {"role": "user",   "content": prompt}
        ],
        "temperature": temperature,
        "max_tokens": max_tokens,
    }

    try:
        resp = requests.post(url, headers=headers,
                             json=payload, timeout=timeout)
        status = resp.status_code
        hdrs = dict(resp.headers)
        if status == 200:
            data = resp.json()
            text = data.get("choices", [{}])[0].get(
                "message", {}).get("content", "")
            return {"ok": True, "text": text, "raw": data, "status": status, "error": None, "headers": hdrs}
        else:
            # try best-effort to surface error text
            err_text = None
            try:
                err_text = resp.json()
            except Exception:
                err_text = resp.text
            return {"ok": False, "text": None, "raw": None, "status": status, "error": str(err_text), "headers": hdrs}
    except requests.RequestException as e:
        return {"ok": False, "text": None, "raw": None, "status": -1, "error": str(e), "headers": {}}


## 1) Smoke test: direct inference

We’ll do a single request with a strict instruction to answer briefly.  
*If you see an auth error, set `OPENAI_API_KEY` and (if needed) `API_BASE`/`MODEL_NAME`.*


In [86]:
# %% Direct call example
demo_prompt = "What is 17 + 28? Answer with just the number."
result = call_model_chat_completions(demo_prompt)
print("OK:", result["ok"], "HTTP:", result["status"])
print("MODEL SAYS:", (result["text"] or "").strip())

# Optional: Inspect rate-limit headers if your provider exposes them
for k in ["x-ratelimit-remaining-requests", "x-ratelimit-limit-requests", "x-request-id"]:
    if k in result["headers"]:
        print(f"{k}: {result['headers'][k]}")


OK: True HTTP: 200
MODEL SAYS: 45


## 2) A tiny test set (3 questions)

We’ll cover:
1. **Math reasoning** — inequality solving,
2. **Common sense** — buoyancy/ice & water,
3. **Logic** — a classic race-position puzzle.

We also tightly constrain the required answer forms to enable simple auto‑grading.


In [87]:
# %% Define three tests: input + expected
tests = [
    {
        "id": "math_inequality",
        "type": "numeric",  # grader will prefer numeric extraction
        "prompt": "Solve for the smallest integer n such that 3n + 5 > 26. Answer with just the integer.",
        "expected": "8",    # Because 3n > 21 => n > 7, smallest integer is 8
    },
    {
        "id": "commonsense_ice",
        "type": "text",
        "prompt": (
            "You place an ice cube in a glass of water and mark the water level. "
            "After the ice melts, does the water level rise, fall, or stay the same? "
            "Answer with exactly one of: 'rise', 'fall', 'stay the same'."
        ),
        "expected": "stay the same",
    },
    {
        "id": "logic_race",
        "type": "text",
        "prompt": (
            "In a race, you pass the person in second place. What position are you now in? "
            "Answer with a single word like 'first', 'second', 'third'."
        ),
        "expected": "second",
    },
]


## 3) Minimal evaluator

We provide some example code to decide whether the agent outputs match the expected outputs, just to give you an idea of how evaluations can be done. You are free to use this code, or not.

In [88]:
# %% Simple normalization and evaluation helpers
def normalize_text(s: str) -> str:
    s = (s or "").strip().lower()
    # Remove surrounding punctuation and extra whitespace
    s = re.sub(r"[^\w\s\-']", " ", s)
    s = re.sub(r"\s+", " ", s).strip()

    # Map common synonyms used in these tests
    synonyms = {
        "unchanged": "stay the same",
        "no change": "stay the same",
        "same": "stay the same",
        "second place": "second",
        "2nd": "second",
        "first place": "first",
        "third place": "third",
    }
    return synonyms.get(s, s)

def extract_number(s: str):
    # Returns first number occurrence as string if found, else None
    if not s:
        return None
    m = re.search(r"[-+]?\d+(\.\d+)?", s)
    return m.group(0) if m else None

def grade(expected: str, got: str, kind: str) -> bool:
    if kind == "numeric":
        exp_num = extract_number(expected)
        got_num = extract_number(got)
        return (exp_num is not None) and (got_num == exp_num)
    else:
        return normalize_text(got) == normalize_text(expected)

def evaluate_tests(tests, model=MODEL):
    rows = []
    for t in tests:
        r = call_model_chat_completions(
            t["prompt"],
            system="You are a careful solver. Reply ONLY with the final answer, nothing else.",
            model=model,
            temperature=0.0,
        )
        got = (r["text"] or "").strip()
        is_correct = grade(t["expected"], got, t["type"])
        rows.append({
            "id": t["id"],
            "expected": t["expected"],
            "got": got,
            "correct": is_correct,
            "status": r["status"],
            "error": r["error"],
        })
        # Tiny pacing to be polite to the API
        time.sleep(0.2)

    # Print a small report
    correct = sum(1 for x in rows if x["correct"])
    print(f"Score: {correct}/{len(rows)} correct")
    for x in rows:
        mark = "✅" if x["correct"] else "❌"
        print(f"{mark} {x['id']}: expected={x['expected']!r}, got={x['got']!r} (HTTP {x['status']})")
        if x["error"]:
            print("   error:", x["error"])
    return rows

results = evaluate_tests(tests)


Score: 2/3 correct
❌ math_inequality: expected='8', got='4' (HTTP 200)
✅ commonsense_ice: expected='stay the same', got='stay the same' (HTTP 200)
✅ logic_race: expected='second', got='second' (HTTP 200)


In [89]:
def self_evaluate(question, prediction, expected_answer, model=MODEL):
    """
    Use the model itself as a strict grader.
    Returns True if the model says the prediction matches the expected answer; else False.
    Falls back to a simple normalized string compare if the model's reply is malformed.
    """
    import re

    system = "You are a strict grader. Reply with exactly True or False. No punctuation. No explanation."
    prompt = f"""You are grading a question-answer pair.

Return exactly True if the PREDICTION would be accepted as correct for the EXPECTED_ANSWER.
Otherwise, return False.

QUESTION:
{question}

PREDICTION:
{prediction}

EXPECTED_ANSWER:
{expected_answer}

Answer with exactly: True or False
"""

    r = call_model_chat_completions(
        prompt,
        system=system,
        model=model,
        temperature=0.0,
    )

    reply = (r.get("text") or "").strip().lower()
    if reply.startswith("true"):
        return True
    if reply.startswith("false"):
        return False

    # Fallback: simple normalization-based equality
    norm = lambda s: re.sub(r"\s+", " ", (s or "").strip().lower())
    return norm(prediction) == norm(expected_answer)


In [90]:
def self_evaluate_tests(tests, model=MODEL, grader_model=None, sleep_sec=0.2, verbose=True):
    """
    Run the tests by querying the model for each prompt, then use LLM-as-a-judge
    (self_evaluate) to determine correctness.

    Args:
        tests: list of dicts with keys: id, prompt, expected (and optionally type)
        model: model used to generate predictions
        grader_model: model used to judge correctness (defaults to `model` if None)
        sleep_sec: small delay between calls to be polite to the API
        verbose: if True, print a summary line per test

    Returns:
        rows: list of dicts with fields:
              id, expected, got, correct, status, error
    """
    import time

    judge_model = grader_model or model
    rows = []

    for t in tests:
        # 1) Get model prediction
        r = call_model_chat_completions(
            t["prompt"],
            system="You are a careful solver. Reply ONLY with the final answer, nothing else.",
            model=model,
            temperature=0.0,
        )
        got = (r.get("text") or "").strip()

        # 2) LLM-as-a-judge: strict True/False
        is_correct = self_evaluate(
            question=t["prompt"],
            prediction=got,
            expected_answer=t["expected"],
            model=judge_model,
        )

        row = {
            "id": t.get("id", "<unnamed>"),
            "expected": t["expected"],
            "got": got,
            "correct": bool(is_correct),
            "status": r.get("status"),
            "error": r.get("error"),
        }
        rows.append(row)

        if verbose:
            mark = "✅" if is_correct else "❌"
            print(f"{mark} {row['id']}: expected={row['expected']!r}, got={row['got']!r} (HTTP {row['status']})")
            if row["error"]:
                print("   error:", row["error"])

        if sleep_sec:
            time.sleep(sleep_sec)

    return rows

# Example:
results_llm_judge = self_evaluate_tests(tests, verbose=True, model=MODEL, grader_model=MODEL)


❌ math_inequality: expected='8', got='4' (HTTP 200)
✅ commonsense_ice: expected='stay the same', got='stay the same' (HTTP 200)
✅ logic_race: expected='second', got='second' (HTTP 200)


## My Agent Implementation


In [91]:
# load development data from json
import json

with open("cse476_final_project_dev_data.json", "r", encoding="utf-8") as f:
    dev_data = json.load(f)

print("Number of dev examples:", len(dev_data))
print("First dev item:")
print(dev_data[0])


Number of dev examples: 1000
First dev item:
{'input': 'Let $ABCD$ be a convex quadrilateral with $AB = CD = 10$ , $BC = 14$ , and $AD = 2\\sqrt{65}$ . Assume that the diagonals of $ABCD$ intersect at point $P$ , and that the sum of the areas of triangles $APB$ and $CPD$ equals the sum of the areas of triangles $BPC$ and $APD$ . Find the area of quadrilateral $ABCD$ .', 'output': '112', 'domain': 'math'}


In [92]:
# load test data + answer template
with open("cse_476_final_project_test_data.json", "r", encoding="utf-8") as f:
    test_data = json.load(f)

with open("cse_476_final_project_answers.json", "r", encoding="utf-8") as f:
    test_answers = json.load(f)

print("Test questions:", len(test_data))
print("Template rows:", len(test_answers))

Test questions: 6208
Template rows: 6208


Inference Techniques #1 chain of thought

In [93]:
def chain_of_thought(
    task_text: str,
    system_message: str = (
        "You are a careful but VERY concise reasoner.\n"
        "Dont explain what youre doing do the steps that you need to do\n"
        "Solve the problem in at most 5 short steps.\n"
        "Do NOT restate the question.\n"
        "Use as few words as possible.\n"
        "On the last line, write exactly: FINAL ANSWER: <answer>"
    ),
    temp: float = 0.4,
):
    """
    Single chain-of-thought call.
    Returns (raw_text, raw_response_dict).
    The prompt is optimized to keep reasoning short and avoid token overflow.
    """

    res = call_model_chat_completions(
        prompt=task_text,
        system=system_message,
        model=MODEL,
        temperature=temp,
        max_tokens=512,
    )

    raw_text = (res.get("text") or "").strip()
    return raw_text, res


In [94]:
txt, _ = chain_of_thought("what is 13+29? Just give the number")
print((txt))

13 + 29 = 42  
FINAL ANSWER: 42


In [95]:
def extract_final_answer(model_output) -> str:
    """Extract a clean final answer string from the model's output."""
    import re

    # Normalize to string
    if model_output is None:
        return ""
    if not isinstance(model_output, str):
        model_output = str(model_output)

    text = model_output.strip()
    lower_output = text.lower()

    # Look for "final answer:" (case-insensitive, use last occurrence)
    if "final answer:" in lower_output:
        idx = lower_output.rfind("final answer:")
        # slice from the original-cased text
        final_answer = text[idx + len("final answer:"):].strip()
    else:
        # Fallback: use the whole output
        final_answer = text

    # Remove \boxed{...} if present
    if "\\boxed{" in final_answer:
        start = final_answer.find("\\boxed{") + len("\\boxed{")
        end = final_answer.rfind("}")
        if end > start:
            final_answer = final_answer[start:end].strip()

    # Strip surrounding $...$ or \( \) LaTeX math wrappers
    final_answer = final_answer.strip()
    if final_answer.startswith("$$") and final_answer.endswith("$$"):
        final_answer = final_answer[2:-2].strip()
    elif final_answer.startswith("$") and final_answer.endswith("$"):
        final_answer = final_answer[1:-1].strip()
    elif final_answer.startswith(r"\(") and final_answer.endswith(r"\)"):
        final_answer = final_answer[2:-2].strip()

    # Normalize True/False style answers
    lower_answer = final_answer.lower()
    words = lower_answer.split()
    first_word = words[0] if words else ""

    if "false" in lower_answer and "true" not in lower_answer:
        final_answer = "False"
    elif "true" in lower_answer and "false" not in lower_answer:
        final_answer = "True"
    elif first_word in ["no", "no,", "no."]:
        final_answer = "False"
    elif first_word in ["yes", "yes,", "yes."]:
        final_answer = "True"

    return final_answer.strip()


In [96]:
print(extract_final_answer("some steps\nFINAL ANSWER: 42"))

42


Inference Techniques #2 : Self consistency 

In [97]:
from collections import Counter

# self consistency
# reduce number steps from 5 to best 2


def self_con(question_text: str,
             sample_count: int = 2,
             temp: float = 0.4,
             ):
    """self-consistency: run CoT multiple times and pick the most common final answer.
    Return (best_answer, all_answers, all_raw_texts). """

    answers = []
    raw_texts = []

    for _ in range(sample_count):
        raw, _ = chain_of_thought(question_text, temp=temp)
        final = extract_final_answer(raw)
        answers.append(final)
        raw_texts.append(raw)

    non_empty = [a for a in answers if a]
    if non_empty:
        counts = Counter(non_empty)
        best = counts.most_common(1)[0][0]
    else:
        best = ""

    return best, answers, raw_texts


In [98]:
# test
best, all_ans, all_raw = self_con("what is 13+29? just give number", sample_count=3)
print("Best:" , best)
print("All:", all_ans)


Best: 42
All: ['42', '42', '42']


Inference technique #3: Reflection/self_correction

In [99]:
# Inference technique #3: Reflection/self_correction
def refl(question_text: str,
         candidate_answer: str,
         temp: float = 0.0):
    """ Reflection: you are a strick grader. 
    If the answer is deamed wrong, solve again and return the fixed answer."""

    # 1) T or F
    judge_system = (
        "you are a strick grader. "
        "reply with True or False, no explanation"
    )

    judge_prompt = f""" You are grading the question answer pair.
    Return what is True if CANDIDATE_ANSWER is acceptable else return False.

    QUESTION:
    {question_text}

    PREVIOUS_ANSWER:
    {candidate_answer}

    Answer only True or False. """

    judge_res = call_model_chat_completions(
        prompt=judge_prompt,
        system=judge_system,
        model=MODEL,
        max_tokens=128,
        temperature=temp,
    )

    judge_reply = (judge_res.get("text") or "").strip().lower()

    if judge_reply.startswith("true"):
        return candidate_answer

    # if wrong ask for a corrcted answer
    fix_system = (
        "You are a probelm solver who is careful."
        "Think step by step and give the final answer on the last line"
        "starting with 'FINAL ANSWER:'."
    )

    fix_prompt = f"""The previous answer was judged inccorect.

    QUESTION:
    {question_text}

    PREVIOUS_ANSWER:
    {candidate_answer}
    
    solve the question correctly, think step by step and end with: FINAL ANSWER: <correct_answer>
    """

    fix_raw, _ = chain_of_thought(
        fix_prompt, system_message=fix_system, temp=0.5)
    fixed = extract_final_answer(fix_raw)
    return fixed or candidate_answer


In [100]:
#test
q = "what is 5 + 7? Just give the number."
best, all_ans, all_raw = self_con(q, sample_count=3)
print("self_con best:", best)
final =refl(q, best)
print("after reflection:", final)

self_con best: 12
after reflection: 12


Boss function to call the 3 techniques

In [101]:
def direct_chat_answer(question: str, model=MODEL) -> str:
    """
    For easier questions, get a single-shot answer.
    The model should include a clear 'Final answer:' marker.
    """
    system = (
        "You are a concise problem solver. "
        "Give the correct answer. Include the phrase 'Final answer:' before your final result."
    )
    prompt = f"""Answer the following question correctly.

QUESTION:
{question}

Show brief reasoning if needed, but clearly mark the final result with:
Final answer: <answer here>
"""

    r = call_model_chat_completions(
        prompt,
        system=system,
        model=model,
        temperature=0.0,
        max_tokens=128,  # small limit as requested
    )
    return (r.get("text") or "").strip()


In [102]:
def estimate_difficulty(question: str, model=MODEL) -> int:
    """
    Ask the model to rate the difficulty of the question from 1 (very easy)
    to 10 (extremely hard). Returns an int.
    """
    system = (
        "You are a difficulty grader for reasoning questions. "
        "Reply with exactly one integer from 1 to 10 and nothing else."
    )
    prompt = f"""Rate the DIFFICULTY of the following question on a scale from 1 to 10.

1 = very easy, can be answered directly with little or no step-by-step reasoning.
10 = extremely difficult, requires deep multi-step reasoning or advanced problem solving.

QUESTION:
{question}

Answer with only a single integer from 1 to 10.
"""

    r = call_model_chat_completions(
        prompt,
        system=system,
        model=model,
        temperature=0.0,
        max_tokens=16,   # small number, as requested
    )

    txt = (r.get("text") or "").strip()
    # Try to parse an int in [1,10]
    try:
        diff = int("".join(ch for ch in txt if ch.isdigit()))
        if 1 <= diff <= 10:
            return diff
    except Exception:
        pass

    # Fallback if parsing fails
    return 5


In [103]:
#test
for i in range(3):
    ex=dev_data[i]
    print("\nExample", i)
    print("INPUT", ex["input"])
    print("GOLD", ex["output"])
    print("AGENT", agent_solve(ex, sc_samples=3))



Example 0
INPUT Let $ABCD$ be a convex quadrilateral with $AB = CD = 10$ , $BC = 14$ , and $AD = 2\sqrt{65}$ . Assume that the diagonals of $ABCD$ intersect at point $P$ , and that the sum of the areas of triangles $APB$ and $CPD$ equals the sum of the areas of triangles $BPC$ and $APD$ . Find the area of quadrilateral $ABCD$ .
GOLD 112
AGENT We are given a convex quadrilateral $ABCD$ with the following side lengths:

- $AB = 10$
- $CD = 10$
- $BC = 14$
- $AD = 2\sqrt{65}$

Also, the diagonals intersect at point $P$, and we are told that:

$$
\text{Area}(\triangle APB) + \text{Area}(\triangle CPD) = \text{Area}(\triangle BPC) + \text{Area}(\triangle APD)
$$

This condition implies that the **sum of the areas of opposite triangles formed by the diagonals is equal**, which is a known geometric property that occurs **only when the diagonals are perpendicular**.

---

### Step 1: Use the property of perpendicular diagonals

If the diagonals $AC$ and $BD$ intersect at $P$ and are **perpend

In [104]:
def extract_final_answer(model_output) -> str:
    """Extract a clean final answer string from the model's output."""
    import re

    # Normalize to string
    if model_output is None:
        return ""
    if not isinstance(model_output, str):
        model_output = str(model_output)

    text = model_output.strip()
    lower_output = text.lower()

    # Look for "final answer:" (case-insensitive, use last occurrence)
    if "final answer:" in lower_output:
        idx = lower_output.rfind("final answer:")
        # slice from the original-cased text
        final_answer = text[idx + len("final answer:"):].strip()
    else:
        # Fallback: use the whole output
        final_answer = text

    # Remove \boxed{...} if present
    if "\\boxed{" in final_answer:
        start = final_answer.find("\\boxed{") + len("\\boxed{")
        end = final_answer.rfind("}")
        if end > start:
            final_answer = final_answer[start:end].strip()

    # Strip surrounding $...$ or \( \) LaTeX math wrappers
    final_answer = final_answer.strip()
    if final_answer.startswith("$$") and final_answer.endswith("$$"):
        final_answer = final_answer[2:-2].strip()
    elif final_answer.startswith("$") and final_answer.endswith("$"):
        final_answer = final_answer[1:-1].strip()
    elif final_answer.startswith(r"\(") and final_answer.endswith(r"\)"):
        final_answer = final_answer[2:-2].strip()

    # Normalize True/False style answers
    lower_answer = final_answer.lower()
    words = lower_answer.split()
    first_word = words[0] if words else ""

    if "false" in lower_answer and "true" not in lower_answer:
        final_answer = "False"
    elif "true" in lower_answer and "false" not in lower_answer:
        final_answer = "True"
    elif first_word in ["no", "no,", "no."]:
        final_answer = "False"
    elif first_word in ["yes", "yes,", "yes."]:
        final_answer = "True"

    return final_answer.strip()


In [105]:
def agent_solve(example: dict, sc_samples: int = 2, use_reflection: bool = True) -> str:
    """Run full agent on example.

    1) First, estimate difficulty with a short chat call.
    2) If difficulty >= 5: use self_con (CoT + self-consistency) + refl (self-correction).
    3) Otherwise: use a single direct chat response.
    Returns final answer as plain string.
    """
    question = example.get("input", "")

    # 1) Estimate difficulty (1–10)
    difficulty = estimate_difficulty(question, model=MODEL)
    # print(f"[DEBUG] Estimated difficulty: {difficulty}")  # optional

    # 2a) Hard question → CoT + self-consistency + reflection
    if difficulty >= 5:
        best, all_ans, all_raw = self_con(
            question_text=question,
            sample_count=sc_samples,
        )

        if use_reflection:
            final = refl(
                question_text=question,
                candidate_answer=best,
            )
        else:
            final = best

    # 2b) Easy question → single direct chat response
    else:
        final = direct_chat_answer(question, model=MODEL)

    # 3) Return clean final answer string
    return extract_final_answer(final)


In [106]:
#!/usr/bin/env python3
"""
Generate a placeholder answer file that matches the expected auto-grader format.

Replace the placeholder logic inside `build_answers()` with your own agent loop
before submitting so the ``output`` fields contain your real predictions.

Reads the input questions from cse_476_final_project_test_data.json and writes
an answers JSON file where each entry contains a string under the "output" key.
"""

from __future__ import annotations

import json
import math

from pathlib import Path
from typing import Any, Dict, List, Tuple
from concurrent.futures import ThreadPoolExecutor, as_completed


INPUT_PATH = Path("cse_476_final_project_test_data.json")
OUTPUT_PATH = Path("cse_476_final_project_answers.json")


def load_questions(path: Path) -> List[Dict[str, Any]]:
    with path.open("r") as fp:
        data = json.load(fp)
    if not isinstance(data, list):
        raise ValueError("Input file must contain a list of question objects.")
    return data


def build_answers(questions: List[Dict[str, Any]]) -> List[Dict[str, str]]:
    """ use agent to generate answer
    split into 8 chubks which run parallel"""

    n = len(questions)
    num_chunks = 8
    chunk_size = math.ceil(n/num_chunks)
    
    jobs: List[Tuple[int,int,int]]=[]
    for chunk_id in range(num_chunks):
        start_idx = chunk_id * chunk_size
        end_idx = min((chunk_id+1)* chunk_size , n)
        if start_idx >= n:
            break
        jobs.append((start_idx, end_idx, chunk_id))
        
    print(f"Total questoins: {n}")
    print("Jobs (start_idx, end_idx, chunk_id):")
    for j in jobs:
        print(" ", j)


    def worker(args: Tuple[int,int,int])-> List[Tuple[int,str]]:
        """ worker fro single chunk: run agent on questions"""
        start_idx, end_idx, chunk_id = args
        print(f"[chunk {chunk_id}] running indices {start_idx}..{end_idx-1}(total {end_idx-start_idx})")
        local_results: List[Tuple[int, str]]=[]
        for i in range(start_idx, end_idx):
            ex = questions[i]

            # use agent to solve questions
            raw_ans = agent_solve(ex, sc_samples=2, use_reflection=False)

            # clean
            out_str = extract_final_answer(raw_ans)
            local_results.append((i, out_str))

            #prog log
            if( i - start_idx)% 50 ==0:
                print(f"[chunk {chunk_id}] done global index{i}")
        
        print(f"[chunk {chunk_id}] finished with {len(local_results)} answers")
        return local_results


        # run all chunk simultaneously
    max_workers = min(4, len(jobs))
    print(f"running max workers = {max_workers}")

    all_pairs: List[Tuple[int, str]] =[]
    with ThreadPoolExecutor(max_workers=max_workers) as pool:
        futures = {pool.submit(worker, job): job for job in jobs}
        for fut in as_completed(futures):
            chunk_result = fut.result()
            all_pairs.extend(chunk_result)

    all_pairs.sort(key = lambda x: x[0])
    if len(all_pairs) != n:
        raise ValueError(f"Expected {n} answers, got {len(all_pairs)} from chunks")

    answers: List[Dict[str,str]] = [{"output": ""} for _ in range(n)]
    for idx, out_str in all_pairs:
        answers[idx]["output"] = out_str

    return answers


def validate_results(
    questions: List[Dict[str, Any]], answers: List[Dict[str, Any]]
) -> None:
    if len(questions) != len(answers):
        raise ValueError(
            f"Mismatched lengths: {len(questions)} questions vs {len(answers)} answers."
        )
    for idx, answer in enumerate(answers):
        if "output" not in answer:
            raise ValueError(f"Missing 'output' field for answer index {idx}.")
        if not isinstance(answer["output"], str):
            raise TypeError(
                f"Answer at index {idx} has non-string output: {type(answer['output'])}"
            )
        if len(answer["output"]) >= 5000:
            raise ValueError(
                f"Answer at index {idx} exceeds 5000 characters "
                f"({len(answer['output'])} chars). Please make sure your answer does not include any intermediate results."
            )


def main() -> None:
    questions = load_questions(INPUT_PATH)
    answers = build_answers(questions)

    with OUTPUT_PATH.open("w") as fp:
        json.dump(answers, fp, ensure_ascii=False, indent=2)

    with OUTPUT_PATH.open("r") as fp:
        saved_answers = json.load(fp)
    validate_results(questions, saved_answers)
    print(
        f"Wrote {len(answers)} answers to {OUTPUT_PATH} "
        "and validated format successfully."
    )


if __name__ == "__main__":
    main()


Total questoins: 6208
Jobs (start_idx, end_idx, chunk_id):
  (0, 776, 0)
  (776, 1552, 1)
  (1552, 2328, 2)
  (2328, 3104, 3)
  (3104, 3880, 4)
  (3880, 4656, 5)
  (4656, 5432, 6)
  (5432, 6208, 7)
running max workers = 4
[chunk 0] running indices 0..775(total 776)
[chunk 1] running indices 776..1551(total 776)
[chunk 2] running indices 1552..2327(total 776)
[chunk 3] running indices 2328..3103(total 776)
[chunk 1] done global index776
[chunk 0] done global index0
[chunk 2] done global index1552
[chunk 3] done global index2328
[chunk 0] done global index50
[chunk 2] done global index1602
[chunk 1] done global index826
[chunk 2] done global index1652
[chunk 1] done global index876
[chunk 0] done global index100
[chunk 2] done global index1702
[chunk 1] done global index926
[chunk 0] done global index150
[chunk 2] done global index1752
[chunk 0] done global index200
[chunk 1] done global index976
[chunk 2] done global index1802
[chunk 0] done global index250
[chunk 1] done global index10

In [109]:
# check if files have merged correct
with open("final_project_result2.json", "r", encoding="utf-8") as f:
    data = json.load(f)

print("Rows:", len(data))
print("First row:", data[0])
print("Last row:", data[-1])





Rows: 6208
First row: {'output': '2'}
Last row: {'output': '- **Stack:** Yellow block → Orange block → Blue block'}
