# LLM Recall Scoring – Sample Sandbox
This notebook scaffolds Method 2 for recall scoring using the sampled assets and the guidance defined in `INSTRUCTIONS_RECALL.md`.

## 1. Load Project Dependencies and Environment Configuration
We initialise key libraries, point the working directory at the sample assets, and capture environment variables required for calling the OpenAI API.

In [73]:
import os
from pathlib import Path
import json
import re
import textwrap
from typing import Dict, List, Tuple, Optional

import pandas as pd

try:
    from openai import OpenAI
except ImportError:  # Notebook can still run data prep without OpenAI SDK
    OpenAI = None

# Locate the project root regardless of the notebook launch directory.
PROJECT_ROOT: Optional[Path] = None
SAMPLE_DIR: Optional[Path] = None
for candidate in [Path.cwd(), *Path.cwd().parents]:
    potential = candidate / "recall_openended" / "sample"
    if potential.exists():
        PROJECT_ROOT = candidate
        SAMPLE_DIR = potential
        break

if PROJECT_ROOT is None or SAMPLE_DIR is None:
    raise FileNotFoundError("Could not locate 'recall_openended/sample' relative to the current working directory.")


def load_env_file(path: Path) -> None:
    if not path.exists():
        return
    for line in path.read_text(encoding="utf-8").splitlines():
        line = line.strip()
        if not line or line.startswith("#") or "=" not in line:
            continue
        key, value = line.split("=", 1)
        key = key.strip()
        value = value.strip().strip('"').strip("'")
        if value:
            os.environ[key] = value


load_env_file(PROJECT_ROOT / ".env")
load_env_file(SAMPLE_DIR / ".env")

INSTRUCTIONS_PATH = SAMPLE_DIR / "INSTRUCTIONS_RECALL.md"
MODEL_EVENTS_PATH = SAMPLE_DIR / "sample_model_answers_events.md"
OPEN_ENDED_PATH = SAMPLE_DIR / "sample_open_ended_madmax_long.csv"
OUTPUT_SAMPLE_PATH = SAMPLE_DIR / "coded_responses_full_method2.csv"

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
client = OpenAI(api_key=OPENAI_API_KEY) if (OpenAI and OPENAI_API_KEY) else None

pd.set_option("display.max_colwidth", 160)

print(f"Project root: {PROJECT_ROOT}")
print(f"Sample asset directory: {SAMPLE_DIR}")
print(f"OpenAI client initialised: {client is not None}")

Project root: c:\Users\ashra\Documents\NeuralSense\NeuralData\clients\544_WBD_CXCU
Sample asset directory: c:\Users\ashra\Documents\NeuralSense\NeuralData\clients\544_WBD_CXCU\recall_openended\sample
OpenAI client initialised: True


## 2. Parse INSTRUCTIONS_RECALL.md for Experiment Metadata
We load the context document to surface the scoring rubric, input expectations, and output schema within the notebook session.

In [74]:
def load_instruction_sections(path: Path, headers: List[str]) -> Tuple[str, Dict[str, str]]:
    """Return the full markdown text and selected sections keyed by header."""
    text = path.read_text(encoding="utf-8")
    sections: Dict[str, str] = {}
    for header in headers:
        pattern = re.compile(rf"^## {re.escape(header)}\n(.*?)(?=^## |\Z)", re.MULTILINE | re.DOTALL)
        match = pattern.search(text)
        if match:
            sections[header] = match.group(1).strip()
    return text, sections

sections_to_pull = [
    "1. Project Goal",
    "2. Input Data",
    "3. Scoring Targets (What the LLM Should Output)",
    "4. Prompt Construction (Per Row / Batch)",
    "5. Pipeline Requirements",
    "6. OpenAI API Integration Notes",
    "8. Quality Checks"
]

instructions_text, instruction_sections = load_instruction_sections(INSTRUCTIONS_PATH, sections_to_pull)
print(f"Loaded instructions summary with {len(instructions_text.splitlines())} lines")
pd.Series(instruction_sections).to_frame("detail").head()

Loaded instructions summary with 348 lines


Unnamed: 0,detail
1. Project Goal,We have a CSV of open-ended recall responses to video content. \nEach row = 1 participant’s response to a scene they watched in **long** or **short** forma...
2. Input Data,"### 2.1 Main CSV\n\nFile: `sample_open_ended_madmax_long.csv`\n\nColumns (at minimum):\n\n- `id` – unique integer per row (0, 1, 2, …). \n - If not presen..."
3. Scoring Targets (What the LLM Should Output),"For each row in the CSV, the LLM must output:\n\n- `recall_score` – **integer from 0 to 100**\n - 0 = no correct recall or completely off-topic.\n - 100 =..."


## 3. Inspect Sample Data Assets
We load the sampled open-ended responses and the model answer reference to confirm schemas, cardinalities, and basic data hygiene before running any scoring passes.

In [75]:
from io import StringIO

try:
    open_ended_df = pd.read_csv(OPEN_ENDED_PATH)
except UnicodeDecodeError:
    with OPEN_ENDED_PATH.open("r", encoding="cp1252", errors="ignore") as fh:
        buffer = StringIO(fh.read())
    open_ended_df = pd.read_csv(buffer)

if "id" not in open_ended_df.columns:
    open_ended_df.insert(0, "id", range(len(open_ended_df)))

model_events_text = MODEL_EVENTS_PATH.read_text(encoding="utf-8")

print(f"Open-ended sample shape: {open_ended_df.shape}")
print(f"Unique titles: {open_ended_df['title'].nunique()} | Unique forms: {open_ended_df['form'].unique().tolist()}")
open_ended_df.head()

Open-ended sample shape: (6, 9)
Unique titles: 1 | Unique forms: ['Short']


Unnamed: 0,id,respondent,group,questionnaire,question_code,question,form,title,response
0,0,2,A,Post,Q18,Recall,Short,Mad Max,This is a high speed chase scene from the movie mad max furry road. The actor jumps from his vehicle with his spears and kills the actor driving the car. Th...
1,1,4,A,Post,Q18,Recall,Short,Mad Max,In this post apocalyptical wasteland fuel and machines are instruments of war between waring factions. The buzzards control their area and lay traps for wou...
2,2,5,F,Post,Q18,Recall,Short,Mad Max,"I remember the woman Furiosa speeding anddriving and everyone else fighting off everyone who was chasing them, including the guy in the passenger seat.They ..."
3,3,3,F,Post,Q18,Recall,Short,Mad Max,It looked like theres some sort of chase going on between two factions and there was a lady driving one of the cars who I guess was in charge of the one fac...
4,4,6,A,Post,Q18,Recall,Short,Mad Max,"A couple guys in a car put on the creepy nun masks and then rob a money truck. Theres a shootout and a chase with the cops. During the chase, an accomplice ..."


In [76]:
missing_response_rows = open_ended_df[open_ended_df["response"].isna() | (open_ended_df["response"].str.strip() == "")]
print(f"Rows with blank responses: {len(missing_response_rows)}")
open_ended_df.groupby(["title", "form"]).size().to_frame("count")

Rows with blank responses: 0


Unnamed: 0_level_0,Unnamed: 1_level_0,count
title,form,Unnamed: 2_level_1
Mad Max,Short,6


In [77]:
print(model_events_text.splitlines()[:20])

['', '', '## Mad Max - Long Form', '', '1. In a post-apocalyptic desert wasteland, petrol (gasoline) and water are extremely scarce and precious resources.', '', '2. Max Rockatansky, a lone drifter, wanders this wasteland as a hardened survivor.', '', '3. Max is suddenly ambushed and captured by the War Boys, a group of feral, fanatical soldiers devoted to the warlord Immortan Joe.', '', '4. The War Boys drag Max to Immortan Joe’s desert stronghold, a fortress known as the Citadel.', '', '5. At the Citadel, Max is identified as a universal blood donor and is imprisoned.', '', '6. Max is strapped into medical apparatus and used as a living “blood bag,” providing continuous blood transfusions to a sickly War Boy named Nux.', '', '7. Max makes an early, frantic attempt to escape from the Citadel by fleeing through its twisting underground tunnels and caves.', '', '8. The War Boys quickly recapture Max and chain him up again, reinforcing his status as Nux’s captive blood donor.', '']


## 4. Define LLM Scoring Utilities
Helper utilities prepare model events, craft prompts, and wrap LLM calls with retry logic required for the scoring workflow.

In [78]:
def normalise_title(title: str) -> str:
    cleaned = re.sub(r"\s+", " ", title.strip()).lower()
    return cleaned.replace(":", "")


def normalise_form(form: str) -> str:
    form_norm = form.strip().lower().replace("-", " ")
    form_norm = form_norm.replace(" form", "").strip()
    alias_map = {
        "lf": "long",
        "longform": "long",
        "shortform": "short",
    }
    return alias_map.get(form_norm, form_norm)


def parse_model_events(markdown_text: str) -> Dict[Tuple[str, str], List[str]]:
    sections = {}
    pattern = re.compile(r"^##\s*(.+?)\s*-\s*(.+?)\s*$", re.MULTILINE)
    matches = list(pattern.finditer(markdown_text))
    for idx, match in enumerate(matches):
        title_raw, form_raw = match.group(1), match.group(2)
        start = match.end()
        end = matches[idx + 1].start() if (idx + 1) < len(matches) else len(markdown_text)
        section_text = markdown_text[start:end]
        events = [evt.strip() for evt in re.findall(r"^\s*\d+\.\s+(.*)$", section_text, re.MULTILINE) if evt.strip()]
        sections[(normalise_title(title_raw), normalise_form(form_raw))] = events
    return sections

model_events_lookup = parse_model_events(model_events_text)
print(f"Loaded events for {len(model_events_lookup)} title/format combinations")
list(model_events_lookup.keys())

Loaded events for 2 title/format combinations


[('mad max', 'long'), ('mad max', 'short')]

In [79]:
SYSTEM_PROMPT = textwrap.dedent(
    """
    You are an expert qualitative coder in applied neuroscience. Evaluate each participant recall response.
    For every row you receive:
    - Compare the participant response against the chronological MODEL EVENTS list.
    - Judge which events are recalled and in how much detail (names, locations, actions, chronology).
    - Produce:
      - "recall_score": integer 0-100 (0 = off-topic or blank, 100 = richly detailed and highly accurate).
      - "confidence_score": integer 0-100 where higher means the mapping from response to events is clear.
      - "rationale": 1-3 sentences summarising the recall quality.
    - Be conservative when uncertain. If the response is blank or clearly states the participant does not remember, set recall_score to 0 and confidence_score to 95.
    - Never invent events beyond the provided MODEL EVENTS.
    - Answer in strict JSON only.
    """
).strip()


def build_prompt_block(row: pd.Series, events: List[str]) -> str:
    events_text = "\n".join(f"{idx + 1}. {event}" for idx, event in enumerate(events)) if events else "(No model events found.)"
    return textwrap.dedent(
        f"""
        Title: {row['title']}
        Format: {row['form']}
        Question code: {row['question_code']}
        Row ID: {row['id']}

        MODEL EVENTS (chronological):
        {events_text}

        PARTICIPANT RESPONSE:
        {row['response']}

        Please evaluate this response and return a JSON object with keys id, recall_score, confidence_score, rationale.
        """
    ).strip()


def build_batch_prompt(batch_rows: pd.DataFrame, events_lookup: Dict[Tuple[str, str], List[str]]) -> Tuple[str, List[Tuple[str, str]]]:
    blocks = []
    missing_keys: List[Tuple[str, str]] = []
    for _, row in batch_rows.iterrows():
        key = (normalise_title(row["title"]), normalise_form(row["form"]))
        events = events_lookup.get(key, [])
        if not events:
            missing_keys.append(key)
        blocks.append(build_prompt_block(row, events))
    return "\n\n".join(blocks), missing_keys


def call_llm_batch(prompt: str, client_obj=client, model: str = "gpt-4.1", max_retries: int = 3, sleep_seconds: float = 2.0) -> str:
    """Call the OpenAI Responses API with basic retry support."""
    if client_obj is None:
        raise RuntimeError("OpenAI client is not initialised. Set OPENAI_API_KEY before calling the model.")

    last_error: Optional[Exception] = None
    payload = [
        {"role": "system", "content": [{"type": "input_text", "text": SYSTEM_PROMPT}]},
        {"role": "user", "content": [{"type": "input_text", "text": prompt}]}
    ]

    for attempt in range(1, max_retries + 1):
        try:
            response = client_obj.responses.create(
                model=model,
                input=payload,
                temperature=0.0,
            )
            return response.output_text
        except Exception as exc:  # noqa: BLE001 broad to keep notebook lightweight
            last_error = exc
            wait_for = sleep_seconds * (2 ** (attempt - 1))
            print(f"Attempt {attempt} failed: {exc}. Retrying in {wait_for:.1f}s")
            import time
            time.sleep(wait_for)
    raise RuntimeError("Failed to retrieve LLM response") from last_error

In [80]:
def parse_llm_json(raw_output: str) -> List[Dict[str, object]]:
    raw_output = raw_output.strip()
    try:
        parsed = json.loads(raw_output)
    except json.JSONDecodeError as exc:
        raise ValueError(f"Model returned non-JSON payload: {raw_output[:200]}") from exc
    if isinstance(parsed, dict):
        parsed = [parsed]
    if not isinstance(parsed, list):
        raise ValueError("Expected list of JSON objects from model output")
    cleaned = []
    for entry in parsed:
        if not isinstance(entry, dict):
            continue
        required = {"id", "recall_score", "confidence_score", "rationale"}
        if not required.issubset(entry):
            continue
        cleaned.append({key: entry[key] for key in required})
    return cleaned


def enrich_dataframe_with_scores(df: pd.DataFrame, scored_rows: List[Dict[str, object]]) -> pd.DataFrame:
    scored_df = pd.DataFrame(scored_rows).set_index("id")
    merged = df.set_index("id").join(scored_df, how="left")
    return merged.reset_index()


def heuristic_stub_scores(batch_rows: pd.DataFrame, events_lookup: Dict[Tuple[str, str], List[str]]) -> List[Dict[str, object]]:
    """Fallback deterministic scorer when no API key is available."""
    results: List[Dict[str, object]] = []
    for _, row in batch_rows.iterrows():
        key = (normalise_title(row["title"]), normalise_form(row["form"]))
        events = events_lookup.get(key, [])
        response = row["response"] or ""
        token_count = len(response.split())
        coverage_ratio = min(len(events), max(response.lower().count(" "), 1)) / (len(events) or 1)
        recall_score = int(min(100, coverage_ratio * 40 + min(token_count, 120) * 0.3))
        confidence_score = int(max(30, min(95, 60 + coverage_ratio * 30)))
        rationale = "Stub score generated locally; replace with real LLM call."
        results.append(
            {
                "id": int(row["id"]),
                "recall_score": recall_score,
                "confidence_score": confidence_score,
                "rationale": rationale,
            }
        )
    return results

## 5. Execute Full Scoring Workflow
We batch the entire dataset, call the LLM for each chunk, merge the resulting scores, and save the final coded file.

In [81]:
MODEL_NAME = "gpt-4.1"
BATCH_SIZE = 5  # adjust based on token budget and latency requirements

print(f"Total responses to score: {len(open_ended_df)}")

Total responses to score: 6


In [82]:
all_results: List[Dict[str, object]] = []
missing_keys_overall: set[Tuple[str, str]] = set()

if open_ended_df.empty:
    raise ValueError("No responses found in the input dataset.")

for start in range(0, len(open_ended_df), BATCH_SIZE):
    batch_df = open_ended_df.iloc[start : start + BATCH_SIZE]
    prompt_text, missing_keys = build_batch_prompt(batch_df, model_events_lookup)
    if missing_keys:
        missing_keys_overall.update(missing_keys)

    if client is not None:
        raw_response = call_llm_batch(prompt_text, client_obj=client, model=MODEL_NAME)
        batch_results = parse_llm_json(raw_response)
    else:
        print("OPENAI_API_KEY not detected; using heuristic stub for demonstration.")
        batch_results = heuristic_stub_scores(batch_df, model_events_lookup)

    for entry in batch_results:
        entry["id"] = int(entry["id"])
        entry["recall_score"] = int(entry["recall_score"])
        entry["confidence_score"] = int(entry["confidence_score"])

    all_results.extend(batch_results)

if missing_keys_overall:
    print(f"Warning: missing model events for {sorted(missing_keys_overall)}")

scored_full_df = enrich_dataframe_with_scores(open_ended_df, all_results)
scored_full_df.head()

Unnamed: 0,id,respondent,group,questionnaire,question_code,question,form,title,response,recall_score,confidence_score,rationale
0,0,2,A,Post,Q18,Recall,Short,Mad Max,This is a high speed chase scene from the movie mad max furry road. The actor jumps from his vehicle with his spears and kills the actor driving the car. Th...,55,80,"The participant recalls the high-speed chase, Furiosa and her crew defending the rig, and a character jumping from a vehicle with spears, which loosely maps..."
1,1,4,A,Post,Q18,Recall,Short,Mad Max,In this post apocalyptical wasteland fuel and machines are instruments of war between waring factions. The buzzards control their area and lay traps for wou...,45,75,"The response captures the general setting (post-apocalyptic, warring factions, Buzzards laying traps) and mentions a fight involving spikes and bombs, as we..."
2,2,5,F,Post,Q18,Recall,Short,Mad Max,"I remember the woman Furiosa speeding anddriving and everyone else fighting off everyone who was chasing them, including the guy in the passenger seat.They ...",40,70,"The participant recalls Furiosa driving, a high-speed chase, fighting, and general chaos, which aligns with the overall scene. However, details are vague, a..."
3,3,3,F,Post,Q18,Recall,Short,Mad Max,It looked like theres some sort of chase going on between two factions and there was a lady driving one of the cars who I guess was in charge of the one fac...,80,90,"This response accurately recalls several key events: the chase between factions, Furiosa as the leader, a man tied to a car, vehicles being blown up, and a ..."
4,4,6,A,Post,Q18,Recall,Short,Mad Max,"A couple guys in a car put on the creepy nun masks and then rob a money truck. Theres a shootout and a chase with the cops. During the chase, an accomplice ...",0,95,"The response is entirely off-topic, describing a bank robbery and police chase unrelated to the provided model events. No relevant details are recalled."


In [83]:
score_columns = ["recall_score", "confidence_score"]
valid_scores = scored_full_df.dropna(subset=score_columns)
missing_count = len(scored_full_df) - len(valid_scores)

scored_full_df.to_csv(OUTPUT_SAMPLE_PATH, index=False)
print(f"Rows scored: {len(valid_scores)} | Rows missing scores: {missing_count}")
print(f"Saved scored dataset to {OUTPUT_SAMPLE_PATH.name}")
valid_scores[score_columns].describe()

Rows scored: 6 | Rows missing scores: 0
Saved scored dataset to coded_responses_full_method2.csv


Unnamed: 0,recall_score,confidence_score
count,6.0,6.0
mean,53.333333,85.0
std,34.59287,11.83216
min,0.0,70.0
25%,41.25,76.25
50%,50.0,85.0
75%,73.75,93.75
max,100.0,100.0
