You are an expert evaluator of autonomous driving reasoning.

You will be given:
- A driving scenario description.
- A question about the scenario.
- A candidate answer produced by a model.

Your job is to evaluate how strong the answer's **causal reasoning** is: how clearly, correctly, and completely it explains *why* things happen, *how* objects and events influence each other, and *why* the autonomous vehicle would choose certain actions.

Scoring guidelines (1–10):

1–3: Very poor causality
- Little or no cause-effect explanation.
- Mostly restates the question or scenario without explaining “why”.
- Important causal factors are missing or wrong.

4–6: Weak / partial causality
- Some causal links, but shallow or incomplete.
- Misses several important factors or mixes up causes and effects.
- Reasoning may be partially correct but not coherent.

7–8: Good causality
- Mostly correct, coherent chain of cause-effect.
- Identifies key objects, risks, and how they influence AV actions.
- Minor omissions or small mistakes, but overall logically sound.

9–10: Excellent causality
- Clear, detailed, and technically accurate chain of cause-effect.
- Correctly explains how the AV perceives, predicts, and decides actions.
- Explicitly links objects, risks, and traffic rules to the AV’s decisions.

Return your evaluation ONLY as a JSON object in this exact format:

{
  "causality_score": <integer from 1 to 10>,
}

Now here is the data:

SCENARIO:
{scenario_text}

QUESTION:
{question_text}

MODEL_ANSWER:
{model_answer}


In [None]:
import os
import json
import time
import openai

openai.api_key = os.getenv("OPENAI_API_KEY")  # use env var, not hard-coded

EVAL_MODEL = "gpt-4o-mini"  # or "gpt-4o"

def build_scenario_text(caption, speed, steering, objects_info, relations_info):
    # Reuse your extracted info to form a compact text block for the judge
    objects_lines = []
    for obj in objects_info:
        objects_lines.append(
            f"- {obj['name']} (type: {obj['type']}, "
            f"status: {', '.join(obj['status']) if obj['status'] else 'N/A'}, "
            f"safety: {', '.join(obj['safety']) if obj['safety'] else 'N/A'}, "
            f"positions: {', '.join(obj['positions']) if obj['positions'] else 'N/A'}, "
            f"importance: {obj['importance']})"
        )
    objects_text = "\n".join(objects_lines) if objects_lines else "No objects described."

    if relations_info:
        relations_text = "\n".join(
            [f"- {src} --> {tgt}: {rel}" for (src, tgt, rel) in relations_info]
        )
    else:
        relations_text = "No explicit relations found."

    scenario_text = (
        f"Caption: {caption}\n"
        f"Speed: {speed}\n"
        f"Steering: {steering}\n\n"
        f"Objects:\n{objects_text}\n\n"
        f"Relations:\n{relations_text}"
    )
    return scenario_text


def build_causality_prompt(scenario_text, question_text, model_answer):
    prompt = f"""
You are an expert evaluator of autonomous driving reasoning.

You will be given:
- A driving scenario description.
- A question about the scenario.
- A candidate answer produced by a model.

Your job is to evaluate how strong the answer's **causal reasoning** is: how clearly, correctly, and completely it explains *why* things happen, *how* objects and events influence each other, and *why* the autonomous vehicle would choose certain actions.

Scoring guidelines (1–10):

1–3: Very poor causality
- Little or no cause-effect explanation.
- Mostly restates the question or scenario without explaining “why”.
- Important causal factors are missing or wrong.

4–6: Weak / partial causality
- Some causal links, but shallow or incomplete.
- Misses several important factors or mixes up causes and effects.
- Reasoning may be partially correct but not coherent.

7–8: Good causality
- Mostly correct, coherent chain of cause-effect.
- Identifies key objects, risks, and how they influence AV actions.
- Minor omissions or small mistakes, but overall logically sound.

9–10: Excellent causality
- Clear, detailed, and technically accurate chain of cause-effect.
- Correctly explains how the AV perceives, predicts, and decides actions.
- Explicitly links objects, risks, and traffic rules to the AV’s decisions.

Return your evaluation ONLY as a JSON object in this exact format:

{{
  "causality_score": <integer from 1 to 10>
}}

Now here is the data:

SCENARIO:
{scenario_text}

QUESTION:
{question_text}

MODEL_ANSWER:
{model_answer}
""".strip()
    return prompt


def score_causality(caption, speed, steering, objects_info, relations_info,
                    question_text, model_answer,
                    model=EVAL_MODEL, temperature=0.0, max_tokens=300):
    """
    Returns a dict: {"causality_score": int, "justification": str}
    """
    scenario_text = build_scenario_text(caption, speed, steering, objects_info, relations_info)
    prompt = build_causality_prompt(scenario_text, question_text, model_answer)

    retries = 3
    for attempt in range(retries):
        try:
            response = openai.ChatCompletion.create(
                model=model,
                messages=[{"role": "user", "content": prompt}],
                temperature=temperature,
                max_tokens=max_tokens,
            )
            content = response.choices[0].message.content
            data = json.loads(content)
            return {
                "causality_score": int(data.get("causality_score", 0))
            }
        except json.JSONDecodeError:
            # If response isn't valid JSON, you can either retry or return a default
            if attempt < retries - 1:
                time.sleep(2)
                continue
            return {"causality_score": 0}
        except openai.error.OpenAIError:
            if attempt < retries - 1:
                time.sleep(2)
                continue
            return {"causality_score": 0, "justification": "API error during evaluation."}
