# 06 Research Comparison (Primary Contribution Focus)

This notebook produces the replication comparison layer for the dissertation: open-source local results are contrasted across prompting, QLoRA fine-tuning, and execution infrastructure, following the comparison structure in Ojuri et al. (2025).

Primary research comparisons:
- Prompting effect: `k=0` vs `k=3`
- Fine-tuning effect: Base vs QLoRA
- Error taxonomy to explain why metrics move

Infrastructure comparison (secondary):
- ReAct as execution support for validity/traceability, not the main semantic claim.

It reads existing JSON run outputs and writes plot-ready artifacts to `results/analysis/`.


In [None]:
from pathlib import Path
import sys
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import display

PROJECT_ROOT = Path.cwd()
if not (PROJECT_ROOT / 'results').exists():
    PROJECT_ROOT = PROJECT_ROOT.parent

sys.path.append(str(PROJECT_ROOT))
from scripts.generate_research_comparison import generate

OUT_DIR = PROJECT_ROOT / 'results' / 'analysis'
summary = generate(out_dir=OUT_DIR, project_root=PROJECT_ROOT)
summary

### Step 1 - Review loaded runs and headline tables
Use this to confirm which JSON artifacts were ingested before interpreting results.


In [None]:
manifest = pd.read_csv(OUT_DIR / 'run_manifest.csv')
metrics_wide = pd.read_csv(OUT_DIR / 'overall_metrics_wide.csv')
metrics_long = pd.read_csv(OUT_DIR / 'overall_metrics_long.csv')

print('Run manifest:')
display(manifest)

print('Overall metrics (wide):')
display(metrics_wide.sort_values('run_label').reset_index(drop=True))

### Step 2 - Inspect overall VA/EM/EX/TS patterns
This chart gives the high-level performance profile per run.


In [None]:
plot_metrics = metrics_long[metrics_long['metric'].isin(['va', 'em', 'ex', 'ts'])].copy()
pivot = plot_metrics.pivot(index='run_label', columns='metric', values='rate_pct')

ax = pivot.plot(kind='bar', figsize=(10, 5), rot=20)
ax.set_ylabel('Rate (%)')
ax.set_xlabel('Run')
ax.set_title('Overall Metrics by Run')
ax.legend(title='Metric', loc='upper left', bbox_to_anchor=(1.01, 1.0))
ax.grid(axis='y', alpha=0.25)
plt.tight_layout()
plt.show()

### Step 3 - Check paired deltas on identical items
Use this for controlled claims (few-shot and fine-tune effects).


In [None]:
paired_path = OUT_DIR / 'paired_deltas.csv'
if paired_path.exists():
    paired = pd.read_csv(paired_path)
    display(paired)

    delta = paired[
        paired['comparison_id'].isin([
            'few_shot_gain_base',
            'few_shot_gain_qlora',
            'qlora_gain_k0',
            'qlora_gain_k3',
        ])
        & paired['metric'].isin(['va', 'em', 'ex'])
    ].copy()

    if not delta.empty:
        pivot_delta = delta.pivot(index='comparison_label', columns='metric', values='delta_pct')
        ax = pivot_delta.plot(kind='bar', figsize=(10, 4), rot=15)
        ax.axhline(0.0, color='black', linewidth=1)
        ax.set_ylabel('Delta (percentage points)')
        ax.set_xlabel('Comparison')
        ax.set_title('Controlled Delta Comparisons')
        ax.legend(title='Metric', loc='upper left', bbox_to_anchor=(1.01, 1.0))
        ax.grid(axis='y', alpha=0.25)
        plt.tight_layout()
        plt.show()
else:
    print('No paired_deltas.csv found yet.')

### Step 4 - Diagnose failure composition
Use this to explain *why* metric changes occurred (join path, aggregation, value linking, etc.).


In [None]:
taxonomy_path = OUT_DIR / 'failure_taxonomy.csv'
if taxonomy_path.exists():
    tax = pd.read_csv(taxonomy_path)
    display(tax.sort_values(['run_label', 'count'], ascending=[True, False]).reset_index(drop=True))

    pivot_tax = tax.pivot(index='run_label', columns='failure_type', values='share_of_failures').fillna(0.0)
    ax = pivot_tax.plot(kind='bar', stacked=True, figsize=(10, 5), rot=20)
    ax.set_ylabel('Share of failed examples')
    ax.set_xlabel('Run')
    ax.set_title('Failure Taxonomy by Run')
    ax.legend(title='Failure type', loc='upper left', bbox_to_anchor=(1.01, 1.0))
    ax.grid(axis='y', alpha=0.25)
    plt.tight_layout()
    plt.show()
else:
    print('No failure_taxonomy.csv found yet.')

## Dissertation Use Notes

- Use `overall_metrics_wide.csv` for headline VA/EM/EX/TS tables.
- Use `paired_deltas.csv` for controlled claims (few-shot gain, fine-tune gain).
- Use `failure_taxonomy.csv` to explain persistent semantic errors (join path, aggregation, value linking).
- If QLoRA files are missing, run `05_qlora_train_eval.ipynb` and rerun this notebook.