# User Study Results Analysis

This notebook analyzes the results from the LLM consistency visualization user study.

**Data sources:**
- **Comparisons**: Exit interview questions comparing graph vs list interfaces (1-7 scale: 1=graph, 7=list)
- **Single Distribution**: Per-interface questionnaire responses (one for list+dataset A, one for graph+dataset B per participant)
- **Outputs**: Telemetry (timing analysis; accuracy to be added)

**Key filtering**: We only include participants who completed the Comparisons survey (have prolific_pid in metadata). All other data is filtered to these PIDs.

## 1. Setup & Load Data

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.ticker import MaxNLocator
import seaborn as sns
import re
import numpy as np
import textwrap
from pathlib import Path
from scipy.stats import ttest_ind

# Seaborn styling for prettier plots (no grid background)
sns.set_theme(style="white", palette="muted", font_scale=1.1)

# Paths: works if run from repo root or from user_study_results/
DATA_DIR = Path("user_study_results")
if not (DATA_DIR / "user_study_logging - Comparison.csv").exists():
    DATA_DIR = Path(".")

COMPARISONS_PATH = DATA_DIR / "user_study_logging - Comparison.csv"
SINGLE_DIST_PATH = DATA_DIR / "user_study_logging - Single Distribution.csv"
OUTPUTS_PATH = DATA_DIR / "user_study_logging - outputs.csv"

comparisons_df = pd.read_csv(COMPARISONS_PATH)
single_dist_df = pd.read_csv(SINGLE_DIST_PATH)
outputs_df = pd.read_csv(OUTPUTS_PATH)

### 1.1 Extract Prolific IDs from Comparisons

Parse the "Metadata - no need to edit this" column which contains URLs with `prolific_pid=...`.
These PIDs define our analysis cohort—we filter ALL other data to only include these participants.

In [None]:
def extract_prolific_pids(metadata_series):
    """Extract prolific_pid values from metadata URL strings."""
    pids = set()
    for val in metadata_series.dropna():
        s = str(val)
        match = re.search(r'prolific_pid=([a-f0-9]+)', s, re.IGNORECASE)
        if match:
            pids.add(match.group(1))
    return pids

# Get metadata column - handle possible variations in column name
meta_col = [c for c in comparisons_df.columns if 'Metadata' in c and 'no need' in c.lower()]
if not meta_col:
    meta_col = [c for c in comparisons_df.columns if 'Metadata' in c]
meta_col = meta_col[0] if meta_col else 'Metadata - no need to edit this'

valid_prolific_pids = extract_prolific_pids(comparisons_df[meta_col])
print(f"Found {len(valid_prolific_pids)} unique Prolific IDs from Comparisons survey")
print(f"Sample PIDs: {list(valid_prolific_pids)[:5]}")

### 1.2 Filter Comparisons to Valid PIDs

In [None]:
def get_pid_from_metadata(val):
    if pd.isna(val):
        return None
    match = re.search(r'prolific_pid=([a-f0-9]+)', str(val), re.IGNORECASE)
    return match.group(1) if match else None

def get_study_id_from_metadata(val):
    if pd.isna(val):
        return None
    match = re.search(r'study_id=([^&]+)', str(val))
    return match.group(1) if match else None

comparisons_df['prolific_pid'] = comparisons_df[meta_col].apply(get_pid_from_metadata)
comparisons_filtered = comparisons_df[comparisons_df['prolific_pid'].isin(valid_prolific_pids)].copy()

# If same PID appears multiple times (e.g. different study runs), keep the most recent
comparisons_filtered = comparisons_filtered.sort_values('Timestamp').drop_duplicates(subset=['prolific_pid'], keep='last')

print(f"Comparisons after filtering: {len(comparisons_filtered)} rows (one per participant)")

## 2. Comparisons: Histograms for 1–7 Scale Questions

Scale: **1 = Graph preferred**, **7 = List preferred**

Questions:
- Which interface made the task easier overall?
- Which interface felt more overwhelming to use?
- Which interface better supported understanding the range and distribution of outputs?
- Which interface made you feel more confident in your answers?
- Overall, which interface do you prefer?

In [None]:
COMPARISON_QUESTIONS = [
    "Which interface made the task easier overall?",
    "Which interface felt more overwhelming to use?",
    "Which interface better supported understanding the range and distribution of outputs?",
    "Which interface made you feel more confident in your answers?",
    "Overall, which interface do you prefer?",
]

fig, axes = plt.subplots(2, 3, figsize=(14, 8))
axes = axes.flatten()

for i, q in enumerate(COMPARISON_QUESTIONS):
    ax = axes[i]
    col = [c for c in comparisons_filtered.columns if q in c or c == q]
    if not col:
        ax.text(0.5, 0.5, f"Column not found: {q[:40]}...", ha='center', va='center')
        continue
    vals = pd.to_numeric(comparisons_filtered[col[0]], errors='coerce').dropna()
    vals = vals[(vals >= 1) & (vals <= 7)]
    ax.hist(vals, bins=np.arange(0.5, 8.5, 1), edgecolor='black', alpha=0.7)
    ax.set_ylabel('Count')
    ax.yaxis.set_major_locator(MaxNLocator(integer=True))
    ax.set_xticks(range(1, 8))
    ax.set_xticklabels(['1\n(graph)', '2', '3', '4', '5', '6', '7\n(list)'])
    title = '\n'.join(textwrap.wrap(q, width=50))
    ax.set_title(title)

axes[-1].axis('off')
fig.suptitle('Comparisons: Graph vs List (1=Graph, 7=List)', fontsize=12, y=1.02)
plt.tight_layout()
plt.show()

## 3. Single Distribution

### 3.1 Filter to Valid PIDs

Filter rows to participants in our cohort.

In [None]:
sd_meta_col = 'Metadata - no need to edit this'
single_dist_df['prolific_pid'] = single_dist_df[sd_meta_col].apply(get_pid_from_metadata)
single_dist_df['study_id'] = single_dist_df[sd_meta_col].apply(get_study_id_from_metadata)
single_dist_filtered = single_dist_df[single_dist_df['prolific_pid'].isin(valid_prolific_pids)].copy()

print(f"Single Distribution: {len(single_dist_df)} total → {len(single_dist_filtered)} after PID filter")

### 3.2 Split: Graph vs List

Use "Which interface were you using?" and verify with `vis_type` from Metadata URL.
- **Graph**: "Graph" in interface question AND `vis_type=graph` in metadata
- **List**: "List of outputs" (or similar) AND `vis_type=raw_outputs` in metadata

In [None]:
# Parse vis_type from metadata URL (format: ...&vis_type=raw_outputs or vis_type=graph)
def parse_vis_from_metadata(val):
    if pd.isna(val): return None
    m = re.search(r'vis_type=([^,&]+)', str(val))
    return m.group(1) if m else None

single_dist_filtered['vis_type'] = single_dist_filtered[sd_meta_col].apply(parse_vis_from_metadata)

# Map to graph vs list
single_dist_filtered['interface'] = single_dist_filtered['vis_type'].map({
    'graph': 'graph',
    'raw_outputs': 'list'
})

# If vis_type parsing failed, use the question
mask = single_dist_filtered['interface'].isna()
single_dist_filtered.loc[mask, 'interface'] = single_dist_filtered.loc[mask, 'Which interface were you using?'].map(
    lambda x: 'graph' if str(x).strip().lower() == 'graph' else ('list' if 'list' in str(x).lower() else None)
)

sd_graph = single_dist_filtered[single_dist_filtered['interface'] == 'graph'].copy()
sd_list = single_dist_filtered[single_dist_filtered['interface'] == 'list'].copy()

print(f"Graph responses: {len(sd_graph)}")
print(f"List responses: {len(sd_list)}")

### 3.2.1 Condition counts: Interface × Task × Order

Number of responses (and participants) in each combination.

In [None]:
sd_valid = single_dist_filtered[single_dist_filtered['interface'].notna()].copy()
sd_valid = sd_valid.sort_values(['prolific_pid', 'study_id', 'Timestamp'])
sd_valid['order'] = sd_valid.groupby(['prolific_pid', 'study_id']).cumcount() + 1  # 1 = Task 1, 2 = Task 2
sd_valid['task_type'] = sd_valid["What were the questions about"]
sd_valid['interface_label'] = sd_valid['interface'].map({'graph': 'Graph', 'list': 'List'})

print("Response-level counts: Interface × Task type × Order")
print("(Each row = one response; each participant has 2 responses: Task 1 and Task 2)")
counts = sd_valid.groupby(['interface_label', 'task_type', 'order']).size().unstack(fill_value=0)
counts = counts.rename(columns={i: f'Task {i}' for i in counts.columns})
print(counts.to_string())
print()

print("Participant-level conditions: Interface and dataset for Task 1")
print("(Determines full 2-task sequence per person)")
sd_first = sd_valid[sd_valid['order'] == 1]
cond = sd_first['interface_label'].astype(str) + " + " + sd_first['task_type'].astype(str) + " first"
for c, n in cond.value_counts().sort_index().items():
    print(f"  {c}: {n}")

### 3.3 Survey Question Columns

We use columns by index to handle CSV encoding. **Exclude** "This study is about elephant behavior".

In [None]:
# Column names for survey questions (by section). Skip elephant column.
SECTIONS = {
    "Understanding model behavior for individual prompts": [
        "\"Using this interface, I understood how diverse (ie, how narrow or broad) the output space was for a given prompt\"",
        "\"I understood what a typical (or 'average') output looked like for a given prompt. \"",
        "\"I had a good sense of what rare or unusual outputs looked like for a given prompt.\"",
        "\"I felt I had seen enough of the model's behavior for a prompt to make a good decision.\"",
        "\"I could see if there were recurring patterns in the outputs, and if so, what types\"",
    ],
    "Workload and Effort": [
        "\"Using this interface required a lot of mental effort.\"",
        "\"I had to work hard to complete the task using this interface.\"",
        "\"I felt frustrated while using this interface.\"",
        "\"I felt rushed while using this interface.\"",
    ],
    "Usability and satisfaction": [
        "\"I found the interface easy to use.\"",
        "\"I felt confident in the decisions I made using this interface.\"",
        "\"I would want to use this interface for similar tasks in my own work.\"",
    ],
}

def find_column(df, partial_name):
    """Find column that contains the partial name (handles encoding variations)."""
    partial_clean = partial_name.replace('\\"', '"').strip()
    for c in df.columns:
        if partial_clean in c or c.strip() == partial_clean:
            return c
    return None

# Build list of (section, question, column_name)
survey_cols = []
for section, questions in SECTIONS.items():
    for q in questions:
        col = find_column(single_dist_filtered, q.replace('\\"', '"'))
        if col:
            survey_cols.append((section, q[:60] + ('...' if len(q) > 60 else ''), col))
        else:
            # Try matching by key phrase
            key = q.split(",")[0][:50] if ',' in q else q[:50]
            for c in single_dist_filtered.columns:
                if key.replace('\\"', '').replace('"', '') in c.replace('"', ''):
                    survey_cols.append((section, q[:60], c))
                    break

print("Resolved columns:")
for sec, q, col in survey_cols:
    print(f"  {sec}: {col[:50]}...")

### 3.4 Horizontal Stacked Bar Charts (Graph vs List)

Same data as above, shown as horizontal stacked bars. Each question is a row; two paired bars per row (Graph, List), with response levels 1–7 stacked using a diverging palette.

## 4. Accuracy

### 4.1 Ground Truth for Accuracy Questions

The Single Distribution survey includes accuracy questions (not usability). We compute ground truth from the **first 20 outputs** of each dataset:
- **user_study_monsters**: First prompt (child in magical world); outputs from `examples_user_study_monsters`
- **user_study_places**: First prompt (explore magical world); outputs from `examples_user_study_places`

We exclude "write one sentence" free-text questions. Ground truth answers are computed below for later accuracy comparison.

### 4.2 Accuracy Analysis (by Interface)

We compare user responses to ground truth for each accuracy question. **Filtering**: Only rows where the user answered about that dataset (Monsters vs Places from "What were the questions about").

#### Methodology

1. **Exact-match questions** (creature/location most frequent, theme, phrase, impossible, most likely sentence): 1 if correct, 0 if wrong. Theme and phrase use flexible string matching (strip quotes, case-insensitive) to handle minor variations.

2. **Bucket questions** (percentage ranges: 0%, 1–5%, 6–15%, 16–30%, 31–50%, >50%): **Partial credit** based on bucket distance.
   - Buckets ordered: `[0%, 1–5%, 6–15%, 16–30%, 31–50%, >50%]` (indices 0–5).
   - Distance = `|truth_index - user_index|` (max 5).
   - **Score = max(0, 1 - distance/5)**. So: exact match = 1.0; off by 1 bucket = 0.8; off by 2 = 0.6; off by 3 = 0.4; off by 4 = 0.2; off by 5 = 0.0.
   - Rationale: Saying 6–15% when truth is 0% is less wrong than saying >50%; partial credit reflects that.

3. **Normalization**: Bucket strings normalized (replace hyphens with en-dash, strip whitespace) before comparison.

4. **Theme matching**: Flexible—exact match after normalizing, or ≥2 key words from the ground-truth theme present in the user answer (to handle paraphrasing).

5. **Phrase matching**: Ground-truth phrase substring in user answer or vice versa (case-insensitive, quotes stripped). Note: Places ground truth is "ancient druids" (6/20 outputs); survey may offer "ancient forest" (4/20)—we use strict data-derived truth.

6. **Impossible / Most likely**: Match by string prefix or substantial overlap (first 40–60 chars) to allow for minor truncation in logged responses.

In [None]:
# Column indices for accuracy questions (from Single Distribution CSV)
MONSTERS_COLS = {
    "creature_most_freq": 2,
    "creature_buckets": [3, 4, 5, 6],  # Lumisprite, Luminara, Mistwraith, Lumivine
    "char_buckets": [7, 8, 9, 10],     # Glowing, Guides, Moon/stars, Aggressive
    "theme": 11,
    "phrase": 12,
    "impossible": 13,
    "most_likely": 14,
}
PLACES_COLS = {
    "location_most_freq": 16,
    "location_buckets": [17, 18, 19, 20],  # Crystal Caverns, Whispering Glade, Shimmering Vale, Celestial Library
    "npc_buckets": [21, 22, 23, 24],       # Merchants, Wizards, Druids, Elves
    "theme": 25,
    "phrase": 26,
    "impossible": 27,
    "most_likely": 28,
}

BUCKET_ORDER = ["0%", "1–5%", "6–15%", "16–30%", "31–50%", ">50%"]

def normalize_bucket(s):
    """Normalize bucket string for comparison (handle 1-5% vs 1–5%, etc.)."""
    if pd.isna(s) or str(s).strip() == "": return None
    s = str(s).strip()
    s = s.replace("-", "–")  # normalize hyphen to en-dash
    if s in BUCKET_ORDER: return s
    # Try to match
    for b in BUCKET_ORDER:
        if b.replace("–", "-") in s or s.replace("-", "–") == b:
            return b
    return None

def bucket_index(bucket):
    """Get index of bucket in BUCKET_ORDER, or None if unknown."""
    b = normalize_bucket(bucket)
    return BUCKET_ORDER.index(b) if b in BUCKET_ORDER else None

def bucket_partial_credit(truth_bucket, user_bucket):
    """
    Partial credit for bucket questions. Score = max(0, 1 - distance/5).
    """
    ti, ui = bucket_index(truth_bucket), bucket_index(user_bucket)
    if ti is None or ui is None: return np.nan
    distance = abs(ti - ui)
    return max(0.0, 1.0 - distance / 5.0)

def normalize_text(s):
    """Strip quotes, whitespace, lowercase for flexible string matching."""
    if pd.isna(s): return ""
    s = str(s).strip().lower()
    s = s.replace(""", '"').replace(""", '"').replace("'", "'")
    return s

def theme_match(user_val, truth_val):
    """Flexible theme matching: exact after normalize, or key phrases present."""
    u, t = normalize_text(user_val), normalize_text(truth_val)
    if u == t: return True
    # Key phrases from truth (monsters: ethereal/glowing; places: ancient/druids)
    key_words = t.split()[:5]  # first 5 words as proxy for main idea
    return sum(1 for w in key_words if len(w) > 3 and w in u) >= 2

def phrase_match(user_val, truth_val):
    """Phrase match: normalize and check if truth phrase is in user answer."""
    u, t = normalize_text(user_val), normalize_text(truth_val)
    return t in u or u in t

# Split by dataset and interface
sd = single_dist_filtered.copy()
cols = list(sd.columns)
m_df = sd[sd["What were the questions about"] == "Monsters"]
p_df = sd[sd["What were the questions about"] == "Locations"]

def compute_accuracy(df, col_map, truth, dataset_name):
    """Compute accuracy scores for a dataset. Returns dict of question -> (scores_list, by_interface)."""
    results = {}
    if len(df) == 0: return results

    # Creature/location most frequent
    col = cols[col_map["creature_most_freq" if "monsters" in dataset_name else "location_most_freq"]]
    gt_val = truth["creature_most_frequent" if "monsters" in dataset_name else "location_most_frequent"]
    for iface in ["graph", "list"]:
        subset = df[df["interface"] == iface]
        if len(subset) == 0: continue
        vals = subset[col].apply(lambda x: normalize_text(x) == normalize_text(gt_val) if pd.notna(x) else False)
        key = f"most_frequent_entity_{iface}"
        results[key] = vals.astype(float).tolist()

    # Bucket questions
    bucket_keys = ["creature_buckets", "char_buckets"] if "monsters" in dataset_name else ["location_buckets", "npc_buckets"]
    truth_buckets = {"creature_buckets": truth["creature_buckets"], "char_buckets": truth["characteristic_buckets"]} if "monsters" in dataset_name else {"location_buckets": truth["location_buckets"], "npc_buckets": truth["npc_buckets"]}
    for bk, col_idxs in zip(bucket_keys, [col_map["creature_buckets"], col_map["char_buckets"]] if "monsters" in dataset_name else [col_map["location_buckets"], col_map["npc_buckets"]]):
        gt_buckets = truth_buckets[bk]
        for i, (col_idx, entity) in enumerate(zip(col_idxs, list(gt_buckets.keys()))):
            col_name = cols[col_idx]
            gt_b = gt_buckets[entity]
            for iface in ["graph", "list"]:
                subset = df[df["interface"] == iface]
                if len(subset) == 0: continue
                scores = subset[col_name].apply(lambda u: bucket_partial_credit(gt_b, u))
                key = f"bucket_{entity}_{iface}"
                results[key] = scores.tolist()

    # Theme
    col = cols[col_map["theme"]]
    gt_theme = truth["theme"]
    for iface in ["graph", "list"]:
        subset = df[df["interface"] == iface]
        if len(subset) == 0: continue
        vals = subset[col].apply(lambda x: theme_match(x, gt_theme) if pd.notna(x) else False)
        results[f"theme_{iface}"] = vals.astype(float).tolist()

    # Phrase
    col = cols[col_map["phrase"]]
    gt_phrase = truth["phrase_most_frequent"]
    for iface in ["graph", "list"]:
        subset = df[df["interface"] == iface]
        if len(subset) == 0: continue
        vals = subset[col].apply(lambda x: phrase_match(x, gt_phrase) if pd.notna(x) else False)
        results[f"phrase_{iface}"] = vals.astype(float).tolist()

    # Impossible (exact or startswith)
    col = cols[col_map["impossible"]]
    gt_imp = truth["impossible"]
    for iface in ["graph", "list"]:
        subset = df[df["interface"] == iface]
        if len(subset) == 0: continue
        vals = subset[col].apply(lambda x: str(x).strip().startswith(gt_imp[:40]) if pd.notna(x) else False)
        results[f"impossible_{iface}"] = vals.astype(float).tolist()

    # Most likely sentence
    col = cols[col_map["most_likely"]]
    gt_sent = truth["most_likely_sentence"]
    for iface in ["graph", "list"]:
        subset = df[df["interface"] == iface]
        if len(subset) == 0: continue
        vals = subset[col].apply(lambda x: normalize_text(str(x)[:60]) in normalize_text(gt_sent) or normalize_text(gt_sent)[:60] in normalize_text(str(x)) if pd.notna(x) else False)
        results[f"most_likely_{iface}"] = vals.astype(float).tolist()

    return results

In [None]:
import sys
# Allow import from user_study_results/ when running from repo root or notebook dir
if str(DATA_DIR) not in sys.path:
    sys.path.insert(0, str(DATA_DIR))
from ground_truth_outputs import MONSTERS_FIRST_20, PLACES_FIRST_20

def pct_to_bucket(pct):
    """Map percentage (0-100) to survey bucket."""
    if pct == 0: return "0%"
    if pct <= 5: return "1–5%"
    if pct <= 15: return "6–15%"
    if pct <= 30: return "16–30%"
    if pct <= 50: return "31–50%"
    return ">50%"

def count_contains(outputs, needle, case_sensitive=False):
    """Count outputs that contain needle (substring)."""
    n = needle if case_sensitive else needle.lower()
    return sum(1 for o in outputs if (o if case_sensitive else o.lower()).find(n) >= 0)

# --- MONSTERS GROUND TRUTH ---
monsters = MONSTERS_FIRST_20
n = len(monsters)

# Creature names (survey options: The Lumisprite, The Luminara, The Mistwraith, The Lumivine)
creature_names = ["Lumisprite", "Luminara", "Mistwraith", "Lumivine"]
creature_counts = {c: count_contains(monsters, c) for c in creature_names}
most_frequent_creature = "The " + max(creature_counts, key=creature_counts.get)  # Survey uses "The Lumivine" etc.

monsters_creature_pcts = {c: (creature_counts[c] / n) * 100 for c in creature_names}
monsters_creature_buckets = {c: pct_to_bucket(monsters_creature_pcts[c]) for c in creature_names}

# Characteristics (Glowing/bioluminescent, Guides lost travelers, Born from moon or stars, Aggressive predators)
char_phrases = [
    ("Glowing / bioluminescent creatures", ["glow", "glowing", "bioluminescent", "luminescent", "luminous"]),
    ("Guides lost travelers", ["guide", "guiding", "guides", "traveler", "travelers"]),
    ("Born from moon or stars", ["moon", "star", "stellar", "celestial", "fallen star"]),
    ("Aggressive predators", ["aggressive", "predator", "devour", "attack", "hostile", "hunt"]),
]
monsters_char_counts = {}
for label, phrases in char_phrases:
    cnt = sum(1 for o in monsters if any(p in o.lower() for p in phrases))
    monsters_char_counts[label] = pct_to_bucket((cnt / n) * 100)

# Most frequent phrase (survey options like "guides lost travelers", "devours its prey")
phrase_options = ["guides lost travelers", "devours its prey", "rules the battlefield"]
phrase_counts = {p: sum(1 for o in monsters if p in o.lower()) for p in phrase_options}
most_frequent_phrase_monsters = max(phrase_counts, key=phrase_counts.get) if any(phrase_counts.values()) else "guides lost travelers"

# Theme (Ethereal/glowing..., Fierce beasts..., Urban magical...)
# From outputs: mostly ethereal/glowing forest guardians
monsters_theme = "Ethereal, glowing forest guardians that guide lost souls"

# Impossible: The Ironclad Ravager (steampunk) - NOT in distribution
monsters_impossible = "The Ironclad Ravager: A mechanical beast powered by steam engines that hunts travelers in abandoned cities."

# Most likely sentence: The Lumivine is a bioluminescent serpent...
monsters_most_likely = "The Lumivine is a bioluminescent serpent that weaves through enchanted forests, guiding lost travelers beneath the glow of the moon."

# --- PLACES GROUND TRUTH ---
places = PLACES_FIRST_20
n_place = len(places)

# Location names (survey options: The Crystal Caverns, The Whispering Glade, The Shimmering Vale, The Celestial Library)
location_names = ["Crystal Caverns", "Whispering Glade", "Shimmering Vale", "Celestial Library"]
# Also match "Whispering Glade" without "The", "Luminous Glade" etc. - use canonical substring
location_counts = {
    "Crystal Caverns": count_contains(places, "Crystal Cavern"),
    "Whispering Glade": count_contains(places, "Whispering Glade") + count_contains(places, "Whispering Grove"),
    "Shimmering Vale": count_contains(places, "Shimmering Vale"),
    "Celestial Library": count_contains(places, "Celestial Library"),
}
# Whispering Grove is a variant of Whispering Glade - already counted. Luminous Glade, Shimmering Glade are different.
most_frequent_location = "The " + max(location_counts, key=location_counts.get)  # Survey uses "The Whispering Glade" etc.

places_location_pcts = {c: (location_counts[c] / n_place) * 100 for c in location_names}
places_location_buckets = {c: pct_to_bucket(places_location_pcts[c]) for c in location_names}

# NPC types (Merchants, Wizards, Druids, Elves) - column header says "creatures" but values are these
npc_phrases = ["Merchants", "Wizards", "Druids", "Elves"]
npc_counts = {p: count_contains(places, p) for p in npc_phrases}
places_npc_buckets = {p: pct_to_bucket((npc_counts[p] / n_place) * 100) for p in npc_phrases}

# Phrase options for places (e.g. "ancient forest")
phrase_options_places = ["ancient forest", "ancient druids", "moon spirits"]
phrase_counts_places = {p: sum(1 for o in places if p in o.lower()) for p in phrase_options_places}
most_frequent_phrase_places = max(phrase_counts_places, key=phrase_counts_places.get)

# Theme for places
places_theme = "An ancient magical forest tied to druids or spirits"

# Impossible (futuristic): The Neon Citadel
places_impossible = "The Neon Citadel: A futuristic city powered by arcane machinery and ruled by technomancers wielding holographic spells."

# Most likely sentence (places)
places_most_likely = "The Luminous Glade is a serene forest where ancient, glowing trees rise high, said to have been blessed by moon spirits to guide lost travelers and protect sacred knowledge hidden within."

# --- CONSOLIDATED GROUND TRUTH (for accuracy comparison) ---
GROUND_TRUTH = {
    "monsters": {
        "creature_most_frequent": most_frequent_creature,
        "creature_buckets": monsters_creature_buckets,
        "characteristic_buckets": monsters_char_counts,
        "theme": monsters_theme,
        "phrase_most_frequent": most_frequent_phrase_monsters,
        "impossible": monsters_impossible,
        "most_likely_sentence": monsters_most_likely,
    },
    "places": {
        "location_most_frequent": most_frequent_location,
        "location_buckets": places_location_buckets,
        "npc_buckets": places_npc_buckets,
        "theme": places_theme,
        "phrase_most_frequent": most_frequent_phrase_places,
        "impossible": places_impossible,
        "most_likely_sentence": places_most_likely,
    },
}

print("=== MONSTERS GROUND TRUTH ===\n")
print("1. Which creature name appeared most frequently?", most_frequent_creature)
print("2. Creature % buckets:", monsters_creature_buckets)
print("3. Characteristic % buckets:", monsters_char_counts)
print("4. Which theme?", monsters_theme)
print("5. Which phrase appeared most frequently?", most_frequent_phrase_monsters)
print("6. Impossible (distractor):", monsters_impossible[:60] + "...")
print("7. Most likely sentence:", monsters_most_likely[:60] + "...")
print("\n=== PLACES GROUND TRUTH ===\n")
print("1. Which location occurred most frequently?", most_frequent_location)
print("2. Location % buckets:", places_location_buckets)
print("3. NPC % buckets (Wizards, Druids, etc.):", places_npc_buckets)
print("4. Which theme?", places_theme)
print("5. Which phrase?", most_frequent_phrase_places)
print("6. Impossible:", places_impossible[:60] + "...")
print("7. Most likely sentence:", places_most_likely[:60] + "...")
print("\n--- GROUND_TRUTH dict ready for accuracy comparison ---")

In [None]:
# Run accuracy computation for both datasets
monsters_results = compute_accuracy(m_df, MONSTERS_COLS, GROUND_TRUTH["monsters"], "monsters")
places_results = compute_accuracy(p_df, PLACES_COLS, GROUND_TRUTH["places"], "places")

# Build summary DataFrame for all questions
def mean_n(vals):
    clean = [v for v in vals if not (isinstance(v, float) and np.isnan(v))]
    return (np.mean(clean), len(clean)) if clean else (np.nan, 0)

rows = []
for dataset, res in [("Monsters", monsters_results), ("Places", places_results)]:
    all_labels = sorted(set(k.replace("_graph", "").replace("_list", "") for k in res))
    for label in all_labels:
        g_m, g_n = mean_n(res.get(label + "_graph", []))
        l_m, l_n = mean_n(res.get(label + "_list", []))
        rows.append({
            "Dataset": dataset,
            "Question": label.replace("_", " ").replace("bucket ", "bucket: "),
            "Graph_mean": g_m,
            "Graph_n": g_n,
            "List_mean": l_m,
            "List_n": l_n,
        })

accuracy_df = pd.DataFrame(rows)

# Display
print("=" * 90)
print("ACCURACY BY INTERFACE (Graph vs List)")
print("=" * 90)
print("\nScoring: Binary (0/1) for exact-match questions; partial credit for bucket questions.")
print("Partial credit: score = max(0, 1 - |truth_idx - user_idx|/5) for buckets [0%, 1-5%, ..., >50%].")
print()

for dataset in ["Monsters", "Places"]:
    subset = accuracy_df[accuracy_df["Dataset"] == dataset]
    print(f"\n--- {dataset} ---")
    for _, r in subset.iterrows():
        g_str = f"{r['Graph_mean']:.3f} (n={int(r['Graph_n'])})" if not np.isnan(r['Graph_mean']) else "-"
        l_str = f"{r['List_mean']:.3f} (n={int(r['List_n'])})" if not np.isnan(r['List_mean']) else "-"
        print(f"  {r['Question']:45} Graph: {g_str:18}   List: {l_str}")

# Aggregated: mean accuracy across question types (exact-match vs bucket)
print("\n" + "=" * 90)
print("AGGREGATE: Mean accuracy by question type (averaged across sub-questions)")
print("=" * 90)
for dataset in ["Monsters", "Places"]:
    sub = accuracy_df[accuracy_df["Dataset"] == dataset]
    # Group bucket questions
    bucket_rows = sub[sub["Question"].str.startswith("bucket:")]
    exact_rows = sub[~sub["Question"].str.startswith("bucket:")]
    def wmean(df, which):  # which = "Graph" or "List"
        col_mean, col_n = f"{which}_mean", f"{which}_n"
        total = 0.0
        count = 0
        for _, r in df.iterrows():
            n = r[col_n]
            m = r[col_mean]
            if n > 0 and not np.isnan(m):
                total += m * n
                count += n
        return total / count if count > 0 else np.nan
    print(f"\n{dataset}:")
    print(f"  Bucket questions (partial credit): Graph {wmean(bucket_rows,'Graph'):.3f}, List {wmean(bucket_rows,'List'):.3f}")
    print(f"  Exact-match questions:             Graph {wmean(exact_rows,'Graph'):.3f}, List {wmean(exact_rows,'List'):.3f}")

accuracy_df  # Return for further analysis

### 4.3 Accuracy Plots (Graph vs List)

Paired bar charts for accuracy by question. Bucket sub-questions (e.g., Lumisprite, Luminara, ...) are combined into single "Creature buckets" / "Characteristic buckets" etc. using weighted means. Also includes overall average accuracy.

In [None]:
# Map bucket sub-questions to combined labels (for grouping)
BUCKET_GROUPS = {
    "Monsters": {
        "Creature buckets": ["Lumisprite", "Luminara", "Mistwraith", "Lumivine"],
        "Characteristic buckets": [
            "Glowing / bioluminescent creatures",
            "Guides lost travelers",
            "Born from moon or stars",
            "Aggressive predators",
        ],
    },
    "Places": {
        "Location buckets": ["Crystal Caverns", "Whispering Glade", "Shimmering Vale", "Celestial Library"],
        "NPC buckets": ["Merchants", "Wizards", "Druids", "Elves"],
    },
}

def wmean_rows(df, rows, which):
    """Weighted mean across rows. which = 'Graph' or 'List'."""
    col_mean, col_n = f"{which}_mean", f"{which}_n"
    total = 0.0
    count = 0
    for _, r in rows.iterrows():
        n = r[col_n]
        m = r[col_mean]
        if n > 0 and not np.isnan(m):
            total += m * n
            count += n
    return total / count if count > 0 else np.nan, count

def se_from_scores(scores):
    """Standard error of the mean. Handles NaN."""
    clean = [s for s in scores if not (isinstance(s, float) and np.isnan(s))]
    if len(clean) < 2:
        return 0.0
    return np.std(clean, ddof=1) / np.sqrt(len(clean))

def question_to_label(q):
    """Map displayed Question back to results key base."""
    if q.startswith("bucket:"):
        entity = q.replace("bucket:", "").strip()
        return "bucket_" + entity
    return q.replace(" ", "_")

def pvalue_graph_vs_list(g_scores, l_scores):
    """Independent t-test. Returns p-value or 1.0 if insufficient data."""
    g = [s for s in g_scores if not (isinstance(s, float) and np.isnan(s))]
    l = [s for s in l_scores if not (isinstance(s, float) and np.isnan(s))]
    if len(g) < 2 or len(l) < 2:
        return 1.0
    _, p = ttest_ind(g, l)
    return float(p)

# Build plot_df: combined bucket groups + exact-match questions + overall (with SE)
results_map = {"Monsters": monsters_results, "Places": places_results}
plot_rows = []
for dataset in ["Monsters", "Places"]:
    res = results_map[dataset]
    sub = accuracy_df[accuracy_df["Dataset"] == dataset].copy()
    bucket_sub = sub[sub["Question"].str.startswith("bucket:")]
    exact_sub = sub[~sub["Question"].str.startswith("bucket:")]

    # Combined bucket groups
    for group_name, entity_names in BUCKET_GROUPS[dataset].items():
        matches = bucket_sub[
            bucket_sub["Question"].str.extract(r"bucket:\s*(.+)", expand=False).str.strip().isin(entity_names)
        ]
        if len(matches) > 0:
            g_m, g_n = wmean_rows(matches, matches, "Graph")
            l_m, l_n = wmean_rows(matches, matches, "List")
            # SE from per-participant average across sub-questions
            g_keys = [f"bucket_{e}_graph" for e in entity_names if f"bucket_{e}_graph" in res]
            l_keys = [f"bucket_{e}_list" for e in entity_names if f"bucket_{e}_list" in res]
            g_scores = np.mean([res[k] for k in g_keys], axis=0) if g_keys else []
            l_scores = np.mean([res[k] for k in l_keys], axis=0) if l_keys else []
            g_se = se_from_scores(g_scores)
            l_se = se_from_scores(l_scores)
            plot_rows.append({
                "Dataset": dataset,
                "Question": group_name,
                "Graph_mean": g_m,
                "Graph_se": g_se,
                "Graph_n": int(g_n),
                "List_mean": l_m,
                "List_se": l_se,
                "List_n": int(l_n),
                "p_value": pvalue_graph_vs_list(g_scores, l_scores),
            })

    # Exact-match questions as-is
    for _, r in exact_sub.iterrows():
        label = question_to_label(r["Question"])
        g_scores = res.get(label + "_graph", [])
        l_scores = res.get(label + "_list", [])
        g_se = se_from_scores(g_scores)
        l_se = se_from_scores(l_scores)
        plot_rows.append({
            "Dataset": r["Dataset"],
            "Question": r["Question"],
            "Graph_mean": r["Graph_mean"],
            "Graph_se": g_se,
            "Graph_n": int(r["Graph_n"]),
            "List_mean": r["List_mean"],
            "List_se": l_se,
            "List_n": int(r["List_n"]),
            "p_value": pvalue_graph_vs_list(g_scores, l_scores),
        })

plot_df = pd.DataFrame(plot_rows)

# Add overall row per dataset (SE from per-participant mean across all questions)
overall_rows = []
for dataset in ["Monsters", "Places"]:
    res = results_map[dataset]
    sub = plot_df[plot_df["Dataset"] == dataset]
    g_m, g_n = wmean_rows(sub, sub, "Graph")
    l_m, l_n = wmean_rows(sub, sub, "List")
    g_keys = [k for k in res if k.endswith("_graph")]
    l_keys = [k for k in res if k.endswith("_list")]
    g_scores = np.mean([res[k] for k in g_keys], axis=0) if g_keys else []
    l_scores = np.mean([res[k] for k in l_keys], axis=0) if l_keys else []
    g_se = se_from_scores(g_scores)
    l_se = se_from_scores(l_scores)
    overall_rows.append({
        "Dataset": dataset,
        "Question": "Overall",
        "Graph_mean": g_m,
        "Graph_se": g_se,
        "Graph_n": int(g_n),
        "List_mean": l_m,
        "List_se": l_se,
        "List_n": int(l_n),
        "p_value": pvalue_graph_vs_list(g_scores, l_scores),
    })
overall_df = pd.DataFrame(overall_rows)

# Plot 1: Paired bar chart by question (excluding overall)
fig, axes = plt.subplots(1, 2, figsize=(12, max(6, len(plot_df) * 0.35)))
palette = sns.color_palette("muted")
graph_color, list_color = palette[0], palette[1]

for ax_idx, dataset in enumerate(["Monsters", "Places"]):
    ax = axes[ax_idx]
    sub = plot_df[plot_df["Dataset"] == dataset]
    n = len(sub)
    y = np.arange(n)
    width = 0.35
    g_se = sub["Graph_se"].fillna(0).values
    l_se = sub["List_se"].fillna(0).values
    ax.barh(y - width/2, sub["Graph_mean"], width, xerr=g_se, label="Graph", color=graph_color, edgecolor="black", linewidth=0.5, capsize=2)
    ax.barh(y + width/2, sub["List_mean"], width, xerr=l_se, label="List", color=list_color, edgecolor="black", linewidth=0.5, capsize=2)
    labels = [f"{q} (y)" if row.get("p_value", 1) < 0.05 else f"{q} (n)" for q, (_, row) in zip(sub["Question"], sub.iterrows())]
    ax.set_yticks(y)
    ax.set_yticklabels(labels, fontsize=9)
    ax.set_xlim(0, 1.05)
    ax.set_xlabel("Accuracy")
    ax.set_title(f"{dataset} – Accuracy by Question (Graph vs List)")
    ax.legend(loc="lower right", fontsize=8)
    ax.axvline(0.5, color="gray", linestyle="--", alpha=0.5)
plt.tight_layout()
plt.show()

# Plot 2: Paired bar chart for overall average only
fig, ax = plt.subplots(figsize=(8, 4))
x = np.arange(2)  # Monsters, Places
width = 0.35
g_se = overall_df["Graph_se"].fillna(0).values
l_se = overall_df["List_se"].fillna(0).values
ax.bar(x - width/2, overall_df["Graph_mean"], width, yerr=g_se, label="Graph", color=graph_color, edgecolor="black", capsize=4)
ax.bar(x + width/2, overall_df["List_mean"], width, yerr=l_se, label="List", color=list_color, edgecolor="black", capsize=4)
overall_labels = [f"{d} (y)" if row.get("p_value", 1) < 0.05 else f"{d} (n)" for d, (_, row) in zip(overall_df["Dataset"], overall_df.iterrows())]
ax.set_xticks(x)
ax.set_xticklabels(overall_labels)
ax.set_ylabel("Accuracy")
ax.set_xlabel("Dataset")
ax.set_title("Overall Average Accuracy (Graph vs List)")
ax.set_ylim(0, 1.05)
ax.legend(loc="upper right")
ax.axhline(0.5, color="gray", linestyle="--", alpha=0.5)
# Add value labels on bars
for i, row in overall_df.iterrows():
    ax.text(i - width/2, row["Graph_mean"] + 0.02, f'{row["Graph_mean"]:.2f}', ha='center', va='bottom', fontsize=9)
    ax.text(i + width/2, row["List_mean"] + 0.02, f'{row["List_mean"]:.2f}', ha='center', va='bottom', fontsize=9)
plt.tight_layout()
plt.show()

In [None]:
# Use column indices if name matching is tricky (pandas standardizes)
# Survey cols: 32,33,34,35, 37,38,39,40,41, 42,43,44 (skip 36=elephant)
SD_SURVEY_COL_INDICES = [32, 33, 34, 35, 37, 38, 39, 40, 41, 42, 43, 44]
SD_SURVEY_SECTIONS = (
    ["Understanding model behavior"] * 5 +
    ["Workload and Effort"] * 4 +
    ["Usability and satisfaction"] * 3
)

def plot_section_histograms_paired(sd_graph, sd_list):
    """Plot paired histograms: Graph vs List side-by-side for each question.
    Single chart per section: each question gets a row with two histograms (Graph | List)
    sharing a centered question header above each pair."""
    palette = sns.color_palette("muted")
    graph_color, list_color = palette[0], palette[1]
    cols = [sd_graph.columns[i] for i in SD_SURVEY_COL_INDICES]
    x_labels = ['1\n(Strongly disagree)', '2', '3', '4', '5', '6', '7\n(Strongly agree)']
    bins = np.arange(0.5, 8.5, 1)

    for section in ["Understanding model behavior", "Workload and Effort", "Usability and satisfaction"]:
        indices = [i for i, s in enumerate(SD_SURVEY_SECTIONS) if s == section]
        section_cols = [cols[i] for i in indices]
        n = len(section_cols)
        # One row per question, 2 columns: Graph | List (side by side)
        fig, axes = plt.subplots(n, 2, figsize=(10, 4 * n))
        if n == 1:
            axes = axes.reshape(1, -1)
        fig.suptitle(section, y=1.01)
        for i, col in enumerate(section_cols):
            ax_g, ax_l = axes[i, 0], axes[i, 1]
            vals_graph = pd.to_numeric(sd_graph[col], errors='coerce').dropna()
            vals_graph = vals_graph[(vals_graph >= 1) & (vals_graph <= 7)]
            vals_list = pd.to_numeric(sd_list[col], errors='coerce').dropna()
            vals_list = vals_list[(vals_list >= 1) & (vals_list <= 7)]
            counts_graph, _ = np.histogram(vals_graph, bins=bins)
            counts_list, _ = np.histogram(vals_list, bins=bins)
            avg_graph = vals_graph.mean() if len(vals_graph) > 0 else np.nan
            avg_list = vals_list.mean() if len(vals_list) > 0 else np.nan
            avg_g_str = f"{avg_graph:.1f}" if not np.isnan(avg_graph) else "—"
            avg_l_str = f"{avg_list:.1f}" if not np.isnan(avg_list) else "—"
            pct_agree_g = (vals_graph >= 5).sum() / len(vals_graph) * 100 if len(vals_graph) > 0 else np.nan
            pct_agree_l = (vals_list >= 5).sum() / len(vals_list) * 100 if len(vals_list) > 0 else np.nan
            pct_g_str = f"{pct_agree_g:.0f}" if not np.isnan(pct_agree_g) else "—"
            pct_l_str = f"{pct_agree_l:.0f}" if not np.isnan(pct_agree_l) else "—"
            x = np.arange(1, 8)
            title = '\n'.join(textwrap.wrap(col, width=55))
            ax_g.bar(x, counts_graph, color=graph_color)
            ax_g.set_title(f'$\\mathbf{{Graph}}$\navg: $\\mathbf{{{avg_g_str}}}$\ntotal agree: $\\mathbf{{{pct_g_str}}}\\%$', fontsize=9)
            ax_g.set_ylabel('Count')
            ax_g.yaxis.set_major_locator(MaxNLocator(integer=True))
            ax_g.set_xticks(range(1, 8))
            ax_g.set_xticklabels(x_labels)
            ax_g.tick_params(axis='x', labelsize=8)
            ax_g.set_ylim(0, 15)
            ax_l.bar(x, counts_list, color=list_color)
            ax_l.set_title(f'$\\mathbf{{List}}$\navg: $\\mathbf{{{avg_l_str}}}$\ntotal agree: $\\mathbf{{{pct_l_str}}}\\%$', fontsize=9)
            ax_l.set_ylabel('Count')
            ax_l.yaxis.set_major_locator(MaxNLocator(integer=True))
            ax_l.set_xticks(range(1, 8))
            ax_l.set_xticklabels(x_labels)
            ax_l.tick_params(axis='x', labelsize=8)
            ax_l.set_ylim(0, 15)
        plt.tight_layout(rect=[0, 0, 1, 0.97])
        plt.subplots_adjust(hspace=0.45, wspace=0.25)
        # Add shared question headers centered above each pair
        for i, col in enumerate(section_cols):
            title = '\n'.join(textwrap.wrap(col, width=55))
            bbox = axes[i, 0].get_position()
            y_pos = bbox.ymax + 0.02
            if y_pos > 1:
                y_pos = 0.99
            fig.text(0.5, y_pos, title, ha='center', wrap=True)
        plt.show()

plot_section_histograms_paired(sd_graph, sd_list)

#### Stacked bars

In [None]:
def plot_section_stacked_bars(sd_graph, sd_list):
    """Horizontal stacked bar charts: for each question, two stacked bars (Graph on top, List below).
    Question text above each pair, Graph/List labels on all bars. Blue=7, red=1. % labels in white."""
    diverging = list(sns.color_palette("RdBu_r", 7))[::-1]  # 1=red, 7=blue (reversed)
    cols = [sd_graph.columns[i] for i in SD_SURVEY_COL_INDICES]
    row_stride = 2.1
    bar_height = 0.48
    bar_gap = 0.08
    question_pad = 0.1
    from matplotlib.transforms import blended_transform_factory
    for section in ["Understanding model behavior", "Workload and Effort", "Usability and satisfaction"]:
        indices = [i for i, s in enumerate(SD_SURVEY_SECTIONS) if s == section]
        section_cols = [cols[i] for i in indices]
        n = len(section_cols)
        fig, ax = plt.subplots(figsize=(10, max(5, n * 1.15)))
        t = blended_transform_factory(ax.transAxes, ax.transData)
        for i, col in enumerate(section_cols):
            y_graph = i * row_stride
            y_list = i * row_stride + bar_height + bar_gap
            vals_graph = pd.to_numeric(sd_graph[col], errors='coerce').dropna()
            vals_graph = vals_graph[(vals_graph >= 1) & (vals_graph <= 7)].astype(int)
            vals_list = pd.to_numeric(sd_list[col], errors='coerce').dropna()
            vals_list = vals_list[(vals_list >= 1) & (vals_list <= 7)].astype(int)
            counts_graph = vals_graph.value_counts().reindex(range(1, 8), fill_value=0).values
            counts_list = vals_list.value_counts().reindex(range(1, 8), fill_value=0).values
            props_graph = counts_graph / counts_graph.sum() if counts_graph.sum() > 0 else np.zeros(7)
            props_list = counts_list / counts_list.sum() if counts_list.sum() > 0 else np.zeros(7)
            left_g, left_l = 0.0, 0.0
            for k in range(7):
                ax.barh(y_graph, props_graph[k], height=bar_height, left=left_g,
                        color=diverging[k], edgecolor='white', linewidth=0.4)
                ax.barh(y_list, props_list[k], height=bar_height, left=left_l,
                        color=diverging[k], edgecolor='white', linewidth=0.4)
                if props_graph[k] >= 0.03:
                    ax.text(left_g + props_graph[k] / 2, y_graph, f'{props_graph[k] * 100:.0f}%',
                            ha='center', va='center', fontsize=7, color='white', fontweight='bold')
                if props_list[k] >= 0.03:
                    ax.text(left_l + props_list[k] / 2, y_list, f'{props_list[k] * 100:.0f}%',
                            ha='center', va='center', fontsize=7, color='white', fontweight='bold')
                left_g += props_graph[k]
                left_l += props_list[k]
            ax.text(-0.025, y_graph, 'Graph', transform=t, fontsize=7, va='center', ha='right', color='#555')
            ax.text(-0.025, y_list, 'List', transform=t, fontsize=7, va='center', ha='right', color='#555')
            ax.text(0.5, y_list + bar_height / 2 + question_pad, col, transform=t, fontsize=8, va='bottom', ha='center')
        ax.set_yticks([])
        ax.set_xlim(0, 1)
        ax.xaxis.set_visible(False)
        ax.spines['top'].set_visible(False)
        ax.spines['right'].set_visible(False)
        y_lo = -bar_height - 0.3
        y_hi = (n - 1) * row_stride + bar_height + bar_gap + bar_height + 0.5
        ax.set_ylim(y_lo, y_hi)
        legend_labels = ['1\n(strongly disagree)', '2', '3', '4', '5', '6', '7\n(strongly agree)']
        legend_handles = [plt.Rectangle((0, 0), 1, 1, fc=diverging[k], label=legend_labels[k]) for k in range(7)]
        ax.legend(handles=legend_handles, title='Response', loc='upper center', bbox_to_anchor=(0.5, -0.18),
                  ncol=7, frameon=False, fontsize=8)
        ax.set_title(section, fontsize=12, pad=12)
        plt.tight_layout(rect=[0.02, 0.18, 0.98, 0.96])
        plt.show()

plot_section_stacked_bars(sd_graph, sd_list)

## 5. Timing by Interface and Dataset

From outputs.csv telemetry: time between consecutive "next" navigations on survey pages. Filter by (prolific_pid, study_id) from cohort. Each segment attributed to (toDataset, toVisType).

### 5.1 Extract PID/Study Pairs

In [None]:
def extract_pid_study_pairs(metadata_series):
    """Extract (prolific_pid, study_id) pairs from metadata URL strings."""
    pairs = set()
    for val in metadata_series.dropna():
        s = str(val)
        pid_m = re.search(r'prolific_pid=([a-f0-9]+)', s, re.IGNORECASE)
        sid_m = re.search(r'study_id=([^&]+)', s)
        if pid_m and sid_m:
            pairs.add((pid_m.group(1), sid_m.group(1)))
    return pairs

sd_meta = [c for c in single_dist_df.columns if 'Metadata' in c][0]
comp_meta = [c for c in comparisons_df.columns if 'Metadata' in c][0]
all_pairs = extract_pid_study_pairs(single_dist_df[sd_meta]) | extract_pid_study_pairs(comparisons_df[comp_meta])
valid_pid_study_pairs = {(p, s) for (p, s) in all_pairs if p in valid_prolific_pids}
print(f"Valid (pid, study_id) pairs: {len(valid_pid_study_pairs)}")

### 5.2 Parse Telemetry & Build Timing DataFrame

In [None]:
import json

def parse_participant_id(pid_str):
    """Parse 'pid_session_study' -> (pid, study_id) or None."""
    if pd.isna(pid_str) or not pid_str or '_' not in str(pid_str):
        return None
    parts = str(pid_str).strip().split('_')
    if len(parts) >= 3:
        return (parts[0], parts[2])
    return None

def extract_timing_from_telemetry(telemetry_str):
    """Parse telemetry JSON, get time between consecutive 'next' events on task pages.
    Returns list of (dataset, vis_type, time_sec)."""
    try:
        events = json.loads(telemetry_str)
    except (json.JSONDecodeError, TypeError):
        return []
    navs = [e for e in events if e.get('type') == 'landing_page_nav' and e.get('data', {}).get('action') == 'next']
    navs = sorted(navs, key=lambda x: x['timestamp'])
    result = []
    for i in range(len(navs) - 1):
        d = navs[i].get('data', {})
        to_ds, to_vt = d.get('toDataset'), d.get('toVisType')
        if to_ds and to_vt and 'user_study' in str(to_ds):
            time_ms = navs[i + 1]['timestamp'] - navs[i]['timestamp']
            if time_ms > 0 and time_ms < 3600 * 1000:  # sanity: under 1 hour
                result.append((to_ds, to_vt, time_ms / 1000))
    return result

# Filter outputs to cohort, get best row per (pid, study_id)
out_with_pid = outputs_df[outputs_df['Participant ID'].notna()].copy()
out_with_pid = out_with_pid.assign(parsed=out_with_pid['Participant ID'].apply(parse_participant_id))
out_with_pid = out_with_pid[out_with_pid['parsed'].notna()]
out_with_pid['pid'] = out_with_pid['parsed'].apply(lambda x: x[0])
out_with_pid['study_id'] = out_with_pid['parsed'].apply(lambda x: x[1])
out_filtered = out_with_pid[out_with_pid.apply(lambda r: (r['pid'], r['study_id']) in valid_pid_study_pairs, axis=1)].copy()

# Best telemetry per (pid, study_id): row with most events
def event_count(telemetry_str):
    try:
        return len(json.loads(telemetry_str))
    except:
        return 0

out_filtered['n_events'] = out_filtered['Telemetry'].apply(event_count)
best_rows = out_filtered.loc[out_filtered.groupby(['pid', 'study_id'])['n_events'].idxmax()]

# Extract all timing segments (task_idx: 0=first task, 1=second task) with participant ids
timing_rows = []
for _, row in best_rows.iterrows():
    for task_idx, (ds, vt, t) in enumerate(extract_timing_from_telemetry(row['Telemetry'])):
        dataset_label = 'monsters' if 'monsters' in ds else 'places'
        interface = 'Graph' if vt == 'graph' else 'List'
        timing_rows.append({'time_sec': t, 'dataset': dataset_label, 'interface': interface, 'task_idx': task_idx, 'pid': row['pid'], 'study_id': row['study_id']})

timing_df = pd.DataFrame(timing_rows)
timing_task2_df = timing_df[timing_df['task_idx'] == 1].copy() if 'task_idx' in timing_df.columns else pd.DataFrame()
print(f"Timing segments: {len(timing_df)} (task 2: {len(timing_task2_df)})")
if len(timing_df) > 0:
    print(timing_df.groupby(['interface', 'dataset']).agg(count=('time_sec', 'count'), mean_sec=('time_sec', 'mean')).round(1))

### 5.3 Plot Timing Summary

In [None]:
def plot_timing_summary(timing_df):
    """Paired bar charts with mean and SEM error bars. Time in minutes."""
    if timing_df.empty:
        print("No timing data to plot.")
        return
    palette = sns.color_palette("muted")
    graph_color, list_color = palette[0], palette[1]
    df = timing_df.copy()
    df['time_min'] = df['time_sec'] / 60

    # Plot 1: Paired bars by interface × dataset
    agg = df.groupby(['interface', 'dataset'])['time_min'].agg(['mean', 'sem', 'count'])
    datasets = sorted(df['dataset'].unique())
    if len(datasets) == 0:
        return
    x = np.arange(len(datasets))
    width = 0.35
    timing_sig = []
    fig, ax = plt.subplots(figsize=(9, 5))
    for i, d in enumerate(datasets):
        g_row = agg.loc[('Graph', d)] if ('Graph', d) in agg.index else pd.Series({'mean': np.nan, 'sem': 0})
        l_row = agg.loc[('List', d)] if ('List', d) in agg.index else pd.Series({'mean': np.nan, 'sem': 0})
        g_mean = g_row['mean'] if not np.isnan(g_row['mean']) else 0
        g_se = g_row['sem'] if not np.isnan(g_row['sem']) else 0
        l_mean = l_row['mean'] if not np.isnan(l_row['mean']) else 0
        l_se = l_row['sem'] if not np.isnan(l_row['sem']) else 0
        ax.bar(x[i] - width/2, g_mean, width, yerr=g_se, label='Graph' if i == 0 else None,
               color=graph_color, capsize=4)
        ax.bar(x[i] + width/2, l_mean, width, yerr=l_se, label='List' if i == 0 else None,
               color=list_color, capsize=4)
        g_vals = df[(df['dataset']==d) & (df['interface']=='Graph')]['time_min'].dropna().tolist()
        l_vals = df[(df['dataset']==d) & (df['interface']=='List')]['time_min'].dropna().tolist()
        pv = pvalue_graph_vs_list(g_vals, l_vals)
        timing_sig.append("y" if pv < 0.05 else "n")
    timing_labels = [f"{d} ({s})" for d, s in zip(datasets, timing_sig)]
    ax.set_xticks(x)
    ax.set_xticklabels(timing_labels)
    ax.set_ylabel('Time (min)')
    ax.set_xlabel('')
    ax.set_title('Time per interface × dataset')
    ax.legend()
    sns.despine(ax=ax)
    plt.tight_layout()
    plt.show()
    # Plot 2: Paired bars by interface only (all datasets combined)
    agg2 = df.groupby('interface')['time_min'].agg(['mean', 'sem'])
    fig2, ax2 = plt.subplots(figsize=(6, 5))
    order = ['Graph', 'List']
    x2 = np.arange(len(order))
    width = 0.6
    for i, iface in enumerate(order):
        if iface in agg2.index:
            mean = agg2.loc[iface, 'mean']
            se = agg2.loc[iface, 'sem']
            se = 0 if np.isnan(se) else se
            color = graph_color if iface == 'Graph' else list_color
            ax2.bar(x2[i], mean, width, yerr=se, color=color, capsize=4, label=iface)
    g_vals = df[df['interface']=='Graph']['time_min'].dropna().tolist()
    l_vals = df[df['interface']=='List']['time_min'].dropna().tolist()
    pv_all = pvalue_graph_vs_list(g_vals, l_vals)
    sig_all = "y" if pv_all < 0.05 else "n"
    ax2.set_xticks(x2)
    ax2.set_xticklabels(order)
    ax2.set_ylabel('Time (min)')
    ax2.set_xlabel('')
    ax2.set_title(f'Time per interface (all datasets) ({sig_all})')
    ax2.legend()
    sns.despine(ax=ax2)
    plt.tight_layout()
    plt.show()

plot_timing_summary(timing_df)

### 5.4 Run Timing Analysis

In [None]:
comp_meta = [c for c in comparisons_df.columns if 'Metadata' in c][0]
comp_pairs = extract_pid_study_pairs(comparisons_df[comp_meta])
valid_pid_study_pairs = {(p, s) for (p, s) in valid_pid_study_pairs | comp_pairs if p in valid_prolific_pids}
print(f"Valid (pid, study_id) pairs: {len(valid_pid_study_pairs)}")

### 3.5 Histograms by Section (Graph vs List)

## 6. Second Task Only: Time and Accuracy

Compare time and accuracy for only the **second task** per participant (by Timestamp order).

In [None]:
# Filter to second task per (prolific_pid, study_id)
sd_task2 = single_dist_filtered.sort_values(['prolific_pid', 'study_id', 'Timestamp']).groupby(['prolific_pid', 'study_id'], as_index=False).nth(1)
m_df_task2 = sd_task2[sd_task2["What were the questions about"] == "Monsters"]
p_df_task2 = sd_task2[sd_task2["What were the questions about"] == "Locations"]

# Compute accuracy for second task (uses same cols from accuracy cell)
monsters_task2 = compute_accuracy(m_df_task2, MONSTERS_COLS, GROUND_TRUTH["monsters"], "monsters")
places_task2 = compute_accuracy(p_df_task2, PLACES_COLS, GROUND_TRUTH["places"], "places")

# Build accuracy_df_task2 (same structure as accuracy_df)
def mean_n(vals):
    clean = [v for v in vals if not (isinstance(v, float) and np.isnan(v))]
    return (np.mean(clean), len(clean)) if clean else (np.nan, 0)

rows = []
for dataset, res in [("Monsters", monsters_task2), ("Places", places_task2)]:
    all_labels = sorted(set(k.replace("_graph", "").replace("_list", "") for k in res))
    for label in all_labels:
        g_m, g_n = mean_n(res.get(label + "_graph", []))
        l_m, l_n = mean_n(res.get(label + "_list", []))
        rows.append({
            "Dataset": dataset,
            "Question": label.replace("_", " ").replace("bucket ", "bucket: "),
            "Graph_mean": g_m, "Graph_n": g_n, "List_mean": l_m, "List_n": l_n,
        })
accuracy_df_task2 = pd.DataFrame(rows)

# Print second-task accuracy summary
print("=" * 80)
print("SECOND TASK ONLY: Accuracy by Interface (Graph vs List)")
print("=" * 80)
for dataset in ["Monsters", "Places"]:
    sub = accuracy_df_task2[accuracy_df_task2["Dataset"] == dataset]
    print(f"\n--- {dataset} ---")
    for _, r in sub.iterrows():
        g_str = f"{r['Graph_mean']:.3f} (n={int(r['Graph_n'])})" if not np.isnan(r['Graph_mean']) else "-"
        l_str = f"{r['List_mean']:.3f} (n={int(r['List_n'])})" if not np.isnan(r['List_mean']) else "-"
        print(f"  {str(r['Question'])[:45]:45s}  Graph: {g_str:18}   List: {l_str}")

# Second-task accuracy plots
results_map_task2 = {"Monsters": monsters_task2, "Places": places_task2}
plot_rows_t2 = []
for dataset in ["Monsters", "Places"]:
    res = results_map_task2[dataset]
    sub = accuracy_df_task2[accuracy_df_task2["Dataset"] == dataset].copy()
    bucket_sub = sub[sub["Question"].str.startswith("bucket:")]
    exact_sub = sub[~sub["Question"].str.startswith("bucket:")]
    for group_name, entity_names in BUCKET_GROUPS[dataset].items():
        matches = bucket_sub[bucket_sub["Question"].str.extract(r"bucket:\s*(.+)", expand=False).str.strip().isin(entity_names)]
        if len(matches) > 0:
            g_m, g_n = wmean_rows(matches, matches, "Graph")
            l_m, l_n = wmean_rows(matches, matches, "List")
            g_keys = [f"bucket_{e}_graph" for e in entity_names if f"bucket_{e}_graph" in res]
            l_keys = [f"bucket_{e}_list" for e in entity_names if f"bucket_{e}_list" in res]
            g_scores = np.mean([res[k] for k in g_keys], axis=0) if g_keys else []
            l_scores = np.mean([res[k] for k in l_keys], axis=0) if l_keys else []
            plot_rows_t2.append({"Dataset": dataset, "Question": group_name, "Graph_mean": g_m, "Graph_se": se_from_scores(g_scores), "Graph_n": int(g_n), "List_mean": l_m, "List_se": se_from_scores(l_scores), "List_n": int(l_n), "p_value": pvalue_graph_vs_list(g_scores, l_scores)})
    for _, r in exact_sub.iterrows():
        label = question_to_label(r["Question"])
        g_scores = res.get(label + "_graph", [])
        l_scores = res.get(label + "_list", [])
        plot_rows_t2.append({"Dataset": r["Dataset"], "Question": r["Question"], "Graph_mean": r["Graph_mean"], "Graph_se": se_from_scores(g_scores), "Graph_n": int(r["Graph_n"]), "List_mean": r["List_mean"], "List_se": se_from_scores(l_scores), "List_n": int(r["List_n"]), "p_value": pvalue_graph_vs_list(g_scores, l_scores)})
plot_df_task2 = pd.DataFrame(plot_rows_t2)

overall_rows_t2 = []
for dataset in ["Monsters", "Places"]:
    res = results_map_task2[dataset]
    sub = plot_df_task2[plot_df_task2["Dataset"] == dataset]
    g_m, g_n = wmean_rows(sub, sub, "Graph")
    l_m, l_n = wmean_rows(sub, sub, "List")
    g_keys = [k for k in res if k.endswith("_graph")]
    l_keys = [k for k in res if k.endswith("_list")]
    g_scores = np.mean([res[k] for k in g_keys], axis=0) if g_keys else []
    l_scores = np.mean([res[k] for k in l_keys], axis=0) if l_keys else []
    overall_rows_t2.append({"Dataset": dataset, "Question": "Overall", "Graph_mean": g_m, "Graph_se": se_from_scores(g_scores), "Graph_n": int(g_n), "List_mean": l_m, "List_se": se_from_scores(l_scores), "List_n": int(l_n), "p_value": pvalue_graph_vs_list(g_scores, l_scores)})
overall_df_task2 = pd.DataFrame(overall_rows_t2)

fig, axes = plt.subplots(1, 2, figsize=(12, max(6, len(plot_df_task2) * 0.35)))
palette = sns.color_palette("muted")
graph_color, list_color = palette[0], palette[1]
for ax_idx, dataset in enumerate(["Monsters", "Places"]):
    ax = axes[ax_idx]
    sub = plot_df_task2[plot_df_task2["Dataset"] == dataset]
    n = len(sub)
    y = np.arange(n)
    width = 0.35
    g_se = sub["Graph_se"].fillna(0).values
    l_se = sub["List_se"].fillna(0).values
    ax.barh(y - width/2, sub["Graph_mean"], width, xerr=g_se, label="Graph", color=graph_color, edgecolor="black", linewidth=0.5, capsize=2)
    ax.barh(y + width/2, sub["List_mean"], width, xerr=l_se, label="List", color=list_color, edgecolor="black", linewidth=0.5, capsize=2)
    labels = [f"{q} (y)" if row.get("p_value", 1) < 0.05 else f"{q} (n)" for q, (_, row) in zip(sub["Question"], sub.iterrows())]
    ax.set_yticks(y)
    ax.set_yticklabels(labels, fontsize=9)
    ax.set_xlim(0, 1.05)
    ax.set_xlabel("Accuracy")
    ax.set_title(f"{dataset} – Second Task: Accuracy (Graph vs List)")
    ax.legend(loc="lower right", fontsize=8)
    ax.axvline(0.5, color="gray", linestyle="--", alpha=0.5)
plt.suptitle("Second Task Only: Accuracy by Question", y=1.02)
plt.tight_layout()
plt.show()

fig, ax = plt.subplots(figsize=(8, 4))
x = np.arange(2)
width = 0.35
g_se = overall_df_task2["Graph_se"].fillna(0).values
l_se = overall_df_task2["List_se"].fillna(0).values
ax.bar(x - width/2, overall_df_task2["Graph_mean"], width, yerr=g_se, label="Graph", color=graph_color, edgecolor="black", capsize=4)
ax.bar(x + width/2, overall_df_task2["List_mean"], width, yerr=l_se, label="List", color=list_color, edgecolor="black", capsize=4)
overall_labels_t2 = [f"{d} (y)" if row.get("p_value", 1) < 0.05 else f"{d} (n)" for d, (_, row) in zip(overall_df_task2["Dataset"], overall_df_task2.iterrows())]
ax.set_xticks(x)
ax.set_xticklabels(overall_labels_t2)
ax.set_ylabel("Accuracy")
ax.set_xlabel("Dataset")
ax.set_title("Second Task Only: Overall Average Accuracy (Graph vs List)")
ax.set_ylim(0, 1.05)
ax.legend(loc="upper right")
ax.axhline(0.5, color="gray", linestyle="--", alpha=0.5)
for i, row in overall_df_task2.iterrows():
    ax.text(i - width/2, row["Graph_mean"] + 0.02, f'{row["Graph_mean"]:.2f}', ha='center', va='bottom', fontsize=9)
    ax.text(i + width/2, row["List_mean"] + 0.02, f'{row["List_mean"]:.2f}', ha='center', va='bottom', fontsize=9)
plt.tight_layout()
plt.show()

# Second-task timing
print("\n" + "=" * 80)
print("SECOND TASK ONLY: Time by Interface and Dataset")
print("=" * 80)
if len(timing_task2_df) > 0:
    print(timing_task2_df.groupby(['interface', 'dataset']).agg(count=('time_sec', 'count'), mean_sec=('time_sec', 'mean')).round(1))
    plot_timing_summary(timing_task2_df)
else:
    print("No second-task timing data.")

### 6.1 Learning Rates: Interface × Task Order (First vs Second)

Single histogram: time and total accuracy by Graph/First, List/First, Graph/Second, List/Second.

In [None]:
# First task per participant
sd_first = single_dist_filtered.sort_values(['prolific_pid', 'study_id', 'Timestamp']).groupby(['prolific_pid', 'study_id'], as_index=False).nth(0)
m_df_first = sd_first[sd_first["What were the questions about"] == "Monsters"]
p_df_first = sd_first[sd_first["What were the questions about"] == "Locations"]

monsters_first = compute_accuracy(m_df_first, MONSTERS_COLS, GROUND_TRUTH["monsters"], "monsters")
places_first = compute_accuracy(p_df_first, PLACES_COLS, GROUND_TRUTH["places"], "places")

def mean_acc_by_interface_order(res_first, res_second, iface_suffix):
    """Per-participant mean accuracy for an interface across first and second tasks."""
    keys_first = [k for k in res_first if k.endswith(iface_suffix)]
    keys_second = [k for k in res_second if k.endswith(iface_suffix)]
    first_scores = np.mean([res_first[k] for k in keys_first], axis=0).tolist() if keys_first else []
    second_scores = np.mean([res_second[k] for k in keys_second], axis=0).tolist() if keys_second else []
    return first_scores, second_scores

g_monsters = mean_acc_by_interface_order(monsters_first, monsters_task2, "_graph")
g_places = mean_acc_by_interface_order(places_first, places_task2, "_graph")
l_monsters = mean_acc_by_interface_order(monsters_first, monsters_task2, "_list")
l_places = mean_acc_by_interface_order(places_first, places_task2, "_list")

def concat_mean(a, b):
    combined = np.concatenate([np.array(a), np.array(b)]) if (a or b) else np.array([])
    return np.nanmean(combined) if len(combined) > 0 else np.nan

graph_first_acc = concat_mean(g_monsters[0], g_places[0])
graph_second_acc = concat_mean(g_monsters[1], g_places[1])
list_first_acc = concat_mean(l_monsters[0], l_places[0])
list_second_acc = concat_mean(l_monsters[1], l_places[1])

timing_first_df = timing_df[timing_df['task_idx'] == 0].copy() if 'task_idx' in timing_df.columns else pd.DataFrame()
if len(timing_first_df) > 0:
    timing_first_df['time_min'] = timing_first_df['time_sec'] / 60
t2_df = timing_task2_df.copy() if len(timing_task2_df) > 0 else pd.DataFrame()
if len(t2_df) > 0:
    t2_df['time_min'] = t2_df['time_sec'] / 60

def safe_mean(ser):
    v = ser.dropna()
    return v.mean() if len(v) > 0 else np.nan
def safe_sem(ser):
    v = ser.dropna()
    return v.sem() if len(v) > 1 else 0

graph_first_time = safe_mean(timing_first_df[timing_first_df['interface']=='Graph']['time_min']) if len(timing_first_df) > 0 else np.nan
graph_second_time = safe_mean(t2_df[t2_df['interface']=='Graph']['time_min']) if len(t2_df) > 0 else np.nan
list_first_time = safe_mean(timing_first_df[timing_first_df['interface']=='List']['time_min']) if len(timing_first_df) > 0 else np.nan
list_second_time = safe_mean(t2_df[t2_df['interface']=='List']['time_min']) if len(t2_df) > 0 else np.nan

graph_first_se_t = safe_sem(timing_first_df[timing_first_df['interface']=='Graph']['time_min']) if len(timing_first_df) > 0 else 0
graph_second_se_t = safe_sem(t2_df[t2_df['interface']=='Graph']['time_min']) if len(t2_df) > 0 else 0
list_first_se_t = safe_sem(timing_first_df[timing_first_df['interface']=='List']['time_min']) if len(timing_first_df) > 0 else 0
list_second_se_t = safe_sem(t2_df[t2_df['interface']=='List']['time_min']) if len(t2_df) > 0 else 0

conditions = ['Graph / First', 'List / First', 'Graph / Second', 'List / Second']
acc_vals = [graph_first_acc, list_first_acc, graph_second_acc, list_second_acc]
time_vals = [graph_first_time, list_first_time, graph_second_time, list_second_time]
time_ses = [graph_first_se_t, list_first_se_t, graph_second_se_t, list_second_se_t]

fig, axes = plt.subplots(1, 2, figsize=(12, 5))
palette = sns.color_palette("muted")
colors = [palette[0], palette[1], palette[0], palette[1]]
hatches = ['', '', '///', '///']
x = np.arange(4)
width = 0.6

ax1 = axes[0]
bars1 = ax1.bar(x, time_vals, width, color=colors, hatch=hatches, edgecolor='black')
ax1.errorbar(x, time_vals, yerr=time_ses, fmt='none', color='black', capsize=4)
ax1.set_xticks(x)
ax1.set_xticklabels(conditions, rotation=15, ha='right')
ax1.set_ylabel('Time (min)')
ax1.set_title('Time by Interface × Task Order')
sns.despine(ax=ax1)

ax2 = axes[1]
ax2.bar(x, acc_vals, width, color=colors, hatch=hatches, edgecolor='black')
ax2.set_xticks(x)
ax2.set_xticklabels(conditions, rotation=15, ha='right')
ax2.set_ylabel('Accuracy')
ax2.set_title('Total Accuracy by Interface × Task Order')
ax2.set_ylim(0, 1.05)
ax2.axhline(0.5, color='gray', linestyle='--', alpha=0.5)
sns.despine(ax=ax2)

for i, v in enumerate(acc_vals):
    if not np.isnan(v):
        ax2.text(i, v + 0.02, f'{v:.2f}', ha='center', va='bottom', fontsize=9)

from matplotlib.patches import Patch
legend_elements = [Patch(facecolor=palette[0], label='Graph'), Patch(facecolor=palette[1], label='List'),
                  Patch(facecolor='white', hatch='///', edgecolor='gray', label='Second task')]
fig.legend(handles=legend_elements, loc='upper center', ncol=3, bbox_to_anchor=(0.5, -0.02))
plt.tight_layout()
plt.subplots_adjust(bottom=0.18)
plt.suptitle('Learning Rates: Time and Accuracy by Interface and Task Order', y=1.02)
plt.show()

print('Time (min):', dict(zip(conditions, [f'{t:.1f}' if not np.isnan(t) else '-' for t in time_vals])))
print('Accuracy:', dict(zip(conditions, [f'{a:.3f}' if not np.isnan(a) else '-' for a in acc_vals])))

### 6.2 Individual Participants: Time by Task (Scatterplot)

Each person: two dots (task 1 and task 2) connected by a line. Color = interface (blue=Graph, orange=List).

In [None]:
# Per-participant timing: need both tasks, build (pid, study_id) -> [(task_idx, time_min, interface), ...]
timing_with_pid = timing_df.copy()
timing_with_pid['time_min'] = timing_with_pid['time_sec'] / 60
timing_with_pid['participant'] = timing_with_pid.apply(lambda r: f"{r['pid']}_{r['study_id']}", axis=1)

# Only participants with BOTH task 1 and task 2 timing
task_counts = timing_with_pid.groupby('participant')['task_idx'].nunique()
complete = task_counts[task_counts == 2].index.tolist()
t_plot = timing_with_pid[timing_with_pid['participant'].isin(complete)]

fig, ax = plt.subplots(figsize=(6, 6))
palette = sns.color_palette("muted")
graph_color, list_color = palette[0], palette[1]

for part in t_plot['participant'].unique():
    sub = t_plot[t_plot['participant'] == part].sort_values('task_idx')
    if len(sub) != 2:
        continue
    x_vals = sub['task_idx'].values + 1  # 1 and 2 for task 1, task 2
    y_vals = sub['time_min'].values
    interfaces = sub['interface'].values
    line_color = graph_color if interfaces[0] == 'Graph' else list_color
    ax.plot(x_vals, y_vals, color=line_color, linestyle='-', linewidth=0.8, alpha=0.6, zorder=0)
    for i, (xi, yi, iface) in enumerate(zip(x_vals, y_vals, interfaces)):
        color = graph_color if iface == 'Graph' else list_color
        ax.scatter(xi, yi, c=[color], marker='o', s=64, zorder=1, alpha=0.6, edgecolors='none')

ax.set_xticks([1, 2])
ax.set_xticklabels(['Task 1', 'Task 2'])
ax.set_xlabel('Task')
ax.set_ylabel('Time (min)')
ax.set_title('Time by Task per Participant (line connects same person)')
from matplotlib.lines import Line2D
legend_elements = [Line2D([0], [0], marker='o', color='w', markerfacecolor=graph_color, markersize=10, alpha=0.6, label='Graph'),
                  Line2D([0], [0], marker='o', color='w', markerfacecolor=list_color, markersize=10, alpha=0.6, label='List')]
ax.legend(handles=legend_elements)
sns.despine(ax=ax)
plt.tight_layout()
plt.show()

### 6.3 Individual Participants: Accuracy by Task (Scatterplot)

Same layout as time scatter: two dots per person (task 1 and 2) connected by line; blue=Graph, orange=List.

In [None]:
def accuracy_per_row(res, df, task_idx):
    """Build (pid, study_id, task_idx, interface, accuracy) from compute_accuracy results."""
    rows = []
    for iface in ['graph', 'list']:
        subset = df[df['interface'] == iface]
        if len(subset) == 0:
            continue
        keys = [k for k in res if k.endswith(f'_{iface}')]
        if not keys:
            continue
        all_scores = np.array([res[k] for k in keys])
        mean_acc = np.mean(all_scores, axis=0)
        iface_cap = 'Graph' if iface == 'graph' else 'List'
        for i, (_, row) in enumerate(subset.iterrows()):
            if i < len(mean_acc):
                rows.append({'pid': row['prolific_pid'], 'study_id': row['study_id'], 'task_idx': task_idx,
                           'interface': iface_cap, 'accuracy': mean_acc[i]})
    return pd.DataFrame(rows)

acc_first = pd.concat([accuracy_per_row(monsters_first, m_df_first, 0), accuracy_per_row(places_first, p_df_first, 0)], ignore_index=True)
acc_second = pd.concat([accuracy_per_row(monsters_task2, m_df_task2, 1), accuracy_per_row(places_task2, p_df_task2, 1)], ignore_index=True)
acc_all = pd.concat([acc_first, acc_second], ignore_index=True)
acc_all['participant'] = acc_all.apply(lambda r: f"{r['pid']}_{r['study_id']}", axis=1)

acc_task_counts = acc_all.groupby('participant')['task_idx'].nunique()
acc_complete = acc_task_counts[acc_task_counts == 2].index.tolist()
a_plot = acc_all[acc_all['participant'].isin(acc_complete)]

fig, ax = plt.subplots(figsize=(6, 6))
palette = sns.color_palette("muted")
graph_color, list_color = palette[0], palette[1]

for part in a_plot['participant'].unique():
    sub = a_plot[a_plot['participant'] == part].sort_values('task_idx')
    if len(sub) != 2:
        continue
    x_vals = sub['task_idx'].values + 1
    y_vals = sub['accuracy'].values
    interfaces = sub['interface'].values
    line_color = graph_color if interfaces[0] == 'Graph' else list_color
    ax.plot(x_vals, y_vals, color=line_color, linestyle='-', linewidth=0.8, alpha=0.6, zorder=0)
    for xi, yi, iface in zip(x_vals, y_vals, interfaces):
        color = graph_color if iface == 'Graph' else list_color
        ax.scatter(xi, yi, c=[color], marker='o', s=64, zorder=1, alpha=0.6, edgecolors='none')

ax.set_xticks([1, 2])
ax.set_xticklabels(['Task 1', 'Task 2'])
ax.set_xlabel('Task')
ax.set_ylabel('Accuracy')
ax.set_title('Accuracy by Task per Participant (line connects same person)')
ax.set_ylim(0, 1.05)
from matplotlib.lines import Line2D
legend_elements = [Line2D([0], [0], marker='o', color='w', markerfacecolor=graph_color, markersize=10, alpha=0.6, label='Graph'),
                  Line2D([0], [0], marker='o', color='w', markerfacecolor=list_color, markersize=10, alpha=0.6, label='List')]
ax.legend(handles=legend_elements)
sns.despine(ax=ax)
plt.tight_layout()
plt.show()

## 7. Summary

- **Cohort**: Participants with valid prolific_pid in Comparisons → all other data filtered to these PIDs.
- **Comparisons**: 5 histograms for 1–7 scale (1=graph, 7=list).
- **Single Distribution**: (1) Paired histograms by section for Graph vs List; (2) horizontal stacked bar charts (same data) with diverging palette.
- **Timing**: From outputs telemetry, time between consecutive "next" navigations on survey pages, by (interface × dataset) and by interface.
- **Second Task Only**: Accuracy and timing for chronologically second task per participant; compares Graph vs List on time and correctness.
- **Learning Rates**: Single histogram of time and total accuracy by interface × task order (Graph/First, List/First, Graph/Second, List/Second).
- **Individual scatterplot**: Time by task per participant; dots connected by line; circle=Graph, x=List.