# Method 1: deterministic scoring method

* **Inputs**

  * Raw data: open-ended recall responses with `respondent`, `group`, `questionnaire`, `question_code`, `question`, `form` (Long/Short), `title`, `response`.
  * Model answers: markdown file with **chronological numbered events** for *Mad Max – Long* and *Mad Max – Short*.

* **Event mapping**

  * Extract numbered events from the `Mad Max - Long Form` section (up to 46 events).
  * Extract numbered events from the `Mad Max: Fury Road – Short Form Scene Events` section (up to 46 events; padded with `None` where fewer).
  * For each row:

    * If `title` contains “Mad Max” and `form == "Short"` → use short-form events.
    * Otherwise (Mad Max Long) → use long-form events.

* **Text preprocessing**

  * Convert both responses and events to lowercase.
  * Remove punctuation.
  * Split into tokens and remove common stopwords.
  * Represent each event and each response as a **set of tokens**.

* **Per-event scoring (T1–T46)**

  * For each event (T_i), compute:

    * Token overlap between response and event: intersection + Jaccard similarity (`|intersection| / |event_tokens|`).
  * **Accuracy (T_i_Accuracy)**:

    * Set to `1` if:

      * At least 2 overlapping tokens, **and**
      * Jaccard similarity ≥ 0.12.
    * Otherwise `0`.
    * If no event defined for that index (padded), set `NaN`.
  * **Detail (T_i_Detail)**:

    * Only considered if `T_i_Accuracy == 1`.
    * Set to `1` if:

      * The response contains **any detail keyword** (e.g., character names, locations, key objects like “Furiosa”, “Immortan Joe”, “War Rig”, “Citadel”, etc.), **or**
      * There are ≥ 4 overlapping tokens with the event.
    * Otherwise `0`.
    * If no event defined: `NaN`.

* **Recall score**

  * `recall_score` = sum of all `T*_Accuracy` and `T*_Detail` columns (T1–T46), treating `NaN` as 0.
  * Higher scores = more events recalled and/or recalled with detail.

* **Confidence score**

  * For each response, collect Jaccard similarities for all events where `Accuracy == 1`.
  * If no events matched → `confidence_score = 40` (low-moderate).
  * Otherwise:

    * Compute mean similarity across matched events.
    * Map mean similarity to a range roughly from 60 to 100, then clamp to [30, 100].
  * Interpreted as a **subjective confidence** (0–100) in how well the automatic method captured the event structure of the recall.

* **Outputs**

  * Original columns preserved.
  * Added 92 event-level columns: `T1_Accuracy`–`T46_Accuracy`, `T1_Detail`–`T46_Detail`.
  * Added `recall_score` and `confidence_score`.
  * Exported as:

    * `coded_responses_full.csv` (all rows).
    * `coded_responses_preview.csv` (first 20 rows).


In [10]:
import pandas as pd
import numpy as np
from pathlib import Path
import re
import string

# --------------------------------------------------
# CONFIG: file paths (edit if needed)
# --------------------------------------------------
RAW_CSV = "sample_open_ended_madmax_long.csv"
MODEL_MD = "sample_model_answers_events.md"
FULL_OUT = "coded_responses_full_method1.csv"
PREVIEW_OUT = "coded_responses_preview.csv"

# --------------------------------------------------
# Helpers
# --------------------------------------------------
def resolve_file(filename: str) -> Path:
    """Return the first matching path for filename across common notebook roots."""
    search_dirs = []
    cwd = Path.cwd()
    search_dirs.append(cwd)
    sample_dir = cwd / "recall_openended" / "sample"
    if sample_dir != cwd:
        search_dirs.append(sample_dir)
    if "__file__" in globals():
        try:
            search_dirs.append(Path(__file__).resolve().parent)
        except Exception:
            pass
    resolved_roots = []
    for directory in search_dirs:
        if directory is None:
            continue
        try:
            resolved_dir = directory.resolve()
        except FileNotFoundError:
            continue
        if resolved_dir in resolved_roots:
            continue
        resolved_roots.append(resolved_dir)
        candidate = resolved_dir / filename
        if candidate.exists():
            return candidate
    raise FileNotFoundError(
        f"Could not locate {filename!r}. Checked: " + ", ".join(str(p) for p in resolved_roots)
    )

def read_csv_safely(path: Path) -> pd.DataFrame:
    """Try a couple of common encodings so Windows-generated CSVs load cleanly."""
    for encoding in ("utf-8", "utf-8-sig", "cp1252"):
        try:
            return pd.read_csv(path, encoding=encoding)
        except UnicodeDecodeError:
            continue
    raise

def read_text_safely(path: Path) -> str:
    """Read text using the same fallback encodings as the CSV helper."""
    for encoding in ("utf-8", "utf-8-sig", "cp1252"):
        try:
            return path.read_text(encoding=encoding)
        except UnicodeDecodeError:
            continue
    raise

# --------------------------------------------------
# Load data
# --------------------------------------------------
raw_path = resolve_file(RAW_CSV)
md_path = resolve_file(MODEL_MD)
output_dir = raw_path.parent

df = read_csv_safely(raw_path)
md_text = read_text_safely(md_path)

# --------------------------------------------------
# Parse model answers into event lists
# (matches the logic used in the ChatGPT run)
# --------------------------------------------------

def extract_numbered(section: str):
    """Extract lines starting with '1.', '2.', ... as event strings."""
    events = []
    for m in re.finditer(r"\n\s*\d+\.\s*(.+)", section):
        events.append(m.group(1).strip())
    return events

text = md_text

# Long form section: between "## Mad Max - Long Form" and
# "Mad Max: Fury Road – Short Form Scene Events" (note the en dash)
start_long = text.find("## Mad Max - Long Form")
start_short_label = text.find("Mad Max: Fury Road – Short Form Scene Events")

if start_long == -1:
    raise ValueError("Could not find '## Mad Max - Long Form' in model answers markdown.")

if start_short_label == -1:
    raise ValueError("Could not find 'Mad Max: Fury Road – Short Form Scene Events' label in model answers markdown.")

long_section = text[start_long:start_short_label]
short_section = text[start_short_label:]

events_long = extract_numbered(long_section)
events_short = extract_numbered(short_section)

# Sanity: pad to 46 max events
MAX_EVENTS = 46
if len(events_long) > MAX_EVENTS:
    events_long = events_long[:MAX_EVENTS]
if len(events_short) > MAX_EVENTS:
    events_short = events_short[:MAX_EVENTS]

# For indexing convenience later, pad with None up to MAX_EVENTS
events_long_padded = events_long + [None] * (MAX_EVENTS - len(events_long))
events_short_padded = events_short + [None] * (MAX_EVENTS - len(events_short))

# --------------------------------------------------
# Text normalization and tokenization
# --------------------------------------------------
stopwords = {
    "the","and","a","an","to","of","in","on","at","by","for","from","with","as",
    "is","are","was","were","be","been","being","it","its","this","that","these",
    "those","into","their","his","her","they","them","he","she","or","but","so",
    "up","out","about","over","under","through","across","off","not","no","than",
    "then","there","here","when","while","after","before","also","just","very",
    "own","still","now","back"
}

def normalize(text: str):
    """Lowercase, strip punctuation, remove stopwords, return token set."""
    text = text.lower()
    text = text.translate(str.maketrans(string.punctuation, " " * len(string.punctuation)))
    tokens = [t for t in text.split() if t and t not in stopwords]
    return set(tokens)

# Precompute event token sets
event_tokens_long = [normalize(e) if e is not None else set() for e in events_long_padded]
event_tokens_short = [normalize(e) if e is not None else set() for e in events_short_padded]

# Keywords for "detail"
detail_keywords = {
    "max","rockatansky","nux","immortan","joe","furiosa","war","rig","warboys","war","boys",
    "citadel","buzzards","gas","town","bullet","farm","people","eater","rictus","sandstorm",
    "storm","desert","wasteland","blood","bag","wives","breeders","convoy","tanker","truck"
}

# --------------------------------------------------
# Scoring function for a single row
# --------------------------------------------------
def score_response(row):
    """
    Returns:
        acc_list: list of length MAX_EVENTS with 0/1/NaN
        det_list: list of length MAX_EVENTS with 0/1/NaN
        conf: int, 0–100
    """
    text = row.get("response", "")
    if isinstance(text, float) and np.isnan(text):
        text = ""
    resp = str(text).strip()

    # Blank / missing
    if not resp:
        return [0] * MAX_EVENTS, [0] * MAX_EVENTS, 40

    resp_tokens = normalize(resp)
    resp_lower = resp.lower()
    form = str(row.get("form", "")).strip().lower()
    title = str(row.get("title", "")).lower()

    # Choose event list based on title + form
    if "mad max" in title:
        if form == "short":
            ev_texts = events_short_padded
            ev_tokens_list = event_tokens_short
        else:
            # default to Long if anything else
            ev_texts = events_long_padded
            ev_tokens_list = event_tokens_long
    else:
        # Unknown title: treat as no defined events
        ev_texts = [None] * MAX_EVENTS
        ev_tokens_list = [set()] * MAX_EVENTS

    acc_list = []
    det_list = []
    sim_scores = []

    for i in range(MAX_EVENTS):
        ev_text = ev_texts[i]
        ev_tokens = ev_tokens_list[i]

        if ev_text is None or not ev_tokens:
            acc_list.append(np.nan)
            det_list.append(np.nan)
            continue

        inter = resp_tokens.intersection(ev_tokens)
        jacc = len(inter) / len(ev_tokens) if ev_tokens else 0.0

        # Conservative accuracy rule:
        # Accuracy = 1 if there's enough overlap (both absolute and relative)
        if len(inter) >= 2 and jacc >= 0.12:
            acc = 1
            sim_scores.append(jacc)
        else:
            acc = 0

        # Detail only if event is accurate
        if acc == 1:
            # Matching detail keywords OR substantial lexical overlap
            has_detail_kw = any(k in resp_lower for k in detail_keywords)
            if has_detail_kw or len(inter) >= 4:
                det = 1
            else:
                det = 0
        else:
            det = 0

        acc_list.append(acc)
        det_list.append(det)

    # Confidence heuristic
    if not sim_scores:
        conf = 40
    else:
        mean_sim = float(np.mean(sim_scores))
        # Start at 60 and increase up to +40 as similarity improves
        conf = int(60 + min(40, 40 * (mean_sim / 0.4)))
        # clip
        conf = min(100, max(30, conf))

    return acc_list, det_list, conf

# --------------------------------------------------
# Apply scoring to all rows
# --------------------------------------------------
acc_cols = [f"T{i}_Accuracy" for i in range(1, MAX_EVENTS + 1)]
det_cols = [f"T{i}_Detail" for i in range(1, MAX_EVENTS + 1)]

acc_data = []
det_data = []
conf_data = []

for _, row in df.iterrows():
    acc_list, det_list, conf = score_response(row)
    acc_data.append(acc_list)
    det_data.append(det_list)
    conf_data.append(conf)

acc_df = pd.DataFrame(acc_data, columns=acc_cols)
det_df = pd.DataFrame(det_data, columns=det_cols)

coded_df = pd.concat([df.reset_index(drop=True), acc_df, det_df], axis=1)

# recall_score: sum of all Accuracy + Detail, treating NaN as 0
event_cols = acc_cols + det_cols
coded_df["recall_score"] = coded_df[event_cols].fillna(0).sum(axis=1).astype(int)
coded_df["confidence_score"] = conf_data

# --------------------------------------------------
# Save outputs
# --------------------------------------------------
full_out_path = output_dir / FULL_OUT
preview_out_path = output_dir / PREVIEW_OUT
coded_df.to_csv(full_out_path, index=False)
coded_df.head(20).to_csv(preview_out_path, index=False)

print(f"Saved full coded dataset to: {full_out_path}")
print(f"Saved preview (first 20 rows) to: {preview_out_path}")
print("\nHead of coded data:")
print(coded_df.head())
print("\nRecall score summary by question_code:")
print(coded_df.groupby("question_code")["recall_score"].describe())


Saved full coded dataset to: C:\Users\ashra\Documents\NeuralSense\NeuralData\clients\544_WBD_CXCU\recall_openended\sample\coded_responses_full_method1.csv
Saved preview (first 20 rows) to: C:\Users\ashra\Documents\NeuralSense\NeuralData\clients\544_WBD_CXCU\recall_openended\sample\coded_responses_preview.csv

Head of coded data:
   respondent group questionnaire question_code question   form    title  \
0           2     A          Post           Q18   Recall  Short  Mad Max   
1           4     A          Post           Q18   Recall  Short  Mad Max   
2           5     F          Post           Q18   Recall  Short  Mad Max   
3           3     F          Post           Q18   Recall  Short  Mad Max   
4           6     A          Post           Q18   Recall  Short  Mad Max   

                                            response  T1_Accuracy  \
0  This is a high speed chase scene from the movi...            1   
1  In this post apocalyptical wasteland fuel and ...            1   
2  I 