# Research Comparison Statistics (Base + QLoRA, Llama + Qwen)

This notebook runs one script that reads run files from:
- `results/baseline/runs/**/results_k*_seed*.json`

It includes any valid run matching this matrix:
- models: Llama, Qwen
- methods: Base, QLoRA
- k values used for hypothesis tests: `k=0` and `k=3`

Documentation:
- [Shapiro-Wilk (SciPy)](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.shapiro.html)
- [Paired t-test (SciPy)](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_rel.html)


## Step 1: Run the Stats Script

No configuration needed if your run JSON files are in `results/baseline/runs`.


In [None]:
from pathlib import Path
import sys
import json
import pandas as pd
from IPython.display import display

PROJECT_ROOT = Path.cwd()
if not (PROJECT_ROOT / "results").exists():
    PROJECT_ROOT = PROJECT_ROOT.parent

sys.path.append(str(PROJECT_ROOT))
from scripts.generate_research_comparison import generate

RUNS_ROOT = PROJECT_ROOT / "results" / "baseline" / "runs"
OUT_DIR = PROJECT_ROOT / "results" / "analysis"
PER_ITEM_CSV = OUT_DIR / "per_item_metrics_primary_raw.csv"

summary = generate(runs_root=RUNS_ROOT, per_item_csv=PER_ITEM_CSV, out_dir=OUT_DIR)
print(json.dumps(summary, indent=2))


## Step 2: Load the Output Tables


In [None]:
manifest_df = pd.read_csv(OUT_DIR / "run_manifest.csv")
shapiro_df = pd.read_csv(OUT_DIR / "stats_mean_median_shapiro.csv")
ttest_df = pd.read_csv(OUT_DIR / "stats_paired_ttests.csv")

print("Run manifest:")
display(manifest_df)

print("\nMean / Median / Shapiro:")
display(shapiro_df.sort_values(["condition_id", "seed", "metric"]).reset_index(drop=True))

print("\nPaired t-tests:")
display(ttest_df.sort_values(["comparison", "metric"]).reset_index(drop=True))


## Step 3: Plain-Language Hypothesis Results (EX only)

Decision rule:
- if `p_value < 0.05`: reject `H0`
- else: fail to reject `H0`

Each row in `stats_paired_ttests.csv` is one predefined hypothesis comparison.


In [None]:
ex_rows = ttest_df[ttest_df["metric"] == "ex"].copy()

if ex_rows.empty:
    print("No EX comparisons available yet. Add matching run pairs first.")
else:
    for _, row in ex_rows.sort_values("comparison").iterrows():
        delta = row["mean_diff_right_minus_left"]
        pval = row["p_value"]
        decision = row["decision_alpha_0_05"]
        direction = "improved" if pd.notna(delta) and delta > 0 else "decreased"
        sig_text = "significant" if decision == "reject_H0" else "not significant"

        print(f"Comparison: {row['comparison']}")
        if pd.isna(delta):
            print("  EX delta: unavailable (insufficient matched pairs)")
        else:
            print(f"  EX {direction} by {abs(delta):.3f}")
        print(f"  p-value: {pval:.4g} ({sig_text})")
        print(f"  Decision: {decision}")
        print()
