In [None]:
{
    "cells": [
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": ["# PoC 2: EU Contract Clause Extraction - Results Analysis"],
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## 1. Setup and Configuration\n",
                "\n",
                "This notebook analyzes the evaluation results from various RAG configurations and a baseline model. It loads data from `.jsonl` files, processes metrics, and presents summary tables and visualizations.",
            ],
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "import pandas as pd\n",
                "import json\n",
                "import os\n",
                "import numpy as np\n",
                "from typing import Any, Dict, List\n",
                "import matplotlib.pyplot as plt\n",
                "import seaborn as sns\n",
                "\n",
                "# Ensure plots are displayed inline in Jupyter\n",
                "%matplotlib inline\n",
                'sns.set_theme(style="whitegrid")\n',
                "\n",
                "# --- Configuration ---\n",
                "# Assuming the notebook is in `eucontract-analyzer/notebooks/`\n",
                "# RESULTS_DIR should point to the `eucontract-analyzer` directory where `rag_evaluation_results` etc. are located\n",
                'RESULTS_DIR = ".." \n',
                "\n",
                "CONFIGS = [\n",
                "    (\n",
                '        "Baseline (Claude 3.5, No Judge)",\n',
                '        r"baseline_evaluation_results/baseline_eval_summary_eu-clauses-gold-en-v1_claude-3-5_20250517_225906.jsonl",\n',
                "    ),\n",
                "    (\n",
                '        "RAG (GPT-4.1)", # Assuming K=5 or not specified, no judge\n',
                '        r"rag_evaluation_results/rag_eval_summary_eu-clauses-gold-en-v1_gpt-4.1_20250515_125739.jsonl",\n',
                "    ),\n",
                "    # --- GPT-4.1-mini RAG ---\n",
                "    (\n",
                '        "RAG (GPT-4.1-mini, K=3, No Judge)",\n',
                '        r"rag_evaluation_results/rag_eval_summary_eu-clauses-gold-en-v1_gpt-4.1-mini_20250516_015403.jsonl",\n',
                "    ),\n",
                "    (\n",
                '        "RAG (GPT-4.1-mini, K=5, No Judge)",\n',
                '        r"rag_evaluation_results/rag_eval_k5_gpt-4.1-mini_eu-clauses-gold-en-v1_20250520_214813.jsonl",\n',
                "    ),\n",
                "    (\n",
                '        "RAG (GPT-4.1-mini, K=5, Judge: Claude 3.5)",\n',
                '        r"rag_evaluation_results/ragevalK5__J-claude-3-5JUDGE_gpt-4.1-mini_text-embedding-3-small_eu-clauses-gold-en-v1_20250522_014404.jsonl",\n',
                "    ),\n",
                "    (\n",
                '        "RAG (GPT-4.1-mini, K=5, Judge: GPT-4.1-mini)",\n',
                '        r"rag_evaluation_results/ragevalK5__J-gpt-4.1-miniJUDGE_gpt-4.1-mini_text-embedding-3-small_eu-clauses-gold-en-v1_20250522_021620.jsonl",\n',
                "    ),\n",
                "    (\n",
                '        "RAG (GPT-4.1-mini, K=5, Judge: Gemma3)",\n',
                '        r"rag_evaluation_results/ragevalK5__J-gemma3-4b-it-qatJUDGE_gpt-4.1-mini_text-embedding-3-small_eu-clauses-gold-en-v1_20250522_023100.jsonl",\n',
                "    ),\n",
                "    # --- Gemma-3 RAG ---\n",
                "    (\n",
                '        "RAG (Gemma-3, K=3, No Judge)",\n',
                '        r"rag_evaluation_results/rag_eval_summary_gemma3-4b-it-qat_eu-clauses-gold-en-v1_20250520_030618.jsonl",\n',
                "    ),\n",
                "    (\n",
                '        "RAG (Gemma-3, K=5, No Judge)",\n',
                '        r"rag_evaluation_results/rag_eval_k5_gemma3-4b-it-qat_eu-clauses-gold-en-v1_20250520_164508.jsonl",\n',
                "    ),\n",
                "    (\n",
                '        "RAG (Gemma-3, K=5, Judge: Gemma3)",\n',
                '        r"rag_evaluation_results/ragevalK5__J-gemma3-4b-it-qatJUDGE_gemma3-4b-it-qat_text-embedding-3-small_eu-clauses-gold-en-v1_20250522_034110.jsonl",\n',
                "    ),\n",
                "    # --- Claude 3.7 RAG ---\n",
                "    (\n",
                '        "RAG (Claude 3.7, K=3, No Judge)",\n',
                '        r"rag_evaluation_results/rag_eval_summary_claude-3-7_eu-clauses-gold-en-v1_20250520_162138.jsonl",\n',
                "    ),\n",
                "    (\n",
                '        "RAG (Claude 3.7, K=5, No Judge)",\n',
                '        r"rag_evaluation_results/rag_eval_k5_claude-3-7_eu-clauses-gold-en-v1_20250520_163412.jsonl",\n',
                "    ),\n",
                "]",
            ],
        },
        {"cell_type": "markdown", "metadata": {}, "source": ["## 2. Helper Functions"]},
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "def load_jsonl(fn: str) -> pd.DataFrame:\n",
                "    path = os.path.join(RESULTS_DIR, fn)\n",
                '    print(f"Loading → {path}")\n',
                "    try:\n",
                '        with open(path, "r", encoding="utf-8") as f:\n',
                "            return pd.DataFrame([json.loads(l) for l in f])\n",
                "    except FileNotFoundError:\n",
                '        print(f"  ⚠️ FileNotFoundError: {path}. Returning empty DataFrame.")\n',
                "        return pd.DataFrame()\n",
                "    except Exception as e:\n",
                '        print(f"  ⚠️ {e.__class__.__name__} loading {path}. Returning empty DataFrame.")\n',
                "        return pd.DataFrame()\n",
                "\n",
                "def flatten(df: pd.DataFrame) -> pd.DataFrame:\n",
                '    if df.empty or "metrics" not in df.columns:\n',
                "        return df\n",
                "    # Ensure 'metrics' column exists and handle potential non-dict entries gracefully\n",
                '    m = df.pop("metrics").apply(lambda d: d if isinstance(d, dict) else {})\n',
                "    \n",
                "    # Define keys to extract from the metrics dictionary\n",
                "    metric_keys = [\n",
                '        "type_set_precision",\n',
                '        "type_set_recall",\n',
                '        "type_set_f1",\n',
                '        "retrieval_recall",\n',
                "    ]\n",
                "    for k in metric_keys:\n",
                "        # Use pd.NA for missing values, which is Pandas' preferred missing value indicator\n",
                "        df[k] = m.apply(lambda d: d.get(k, pd.NA))\n",
                "    return df\n",
                "\n",
                "def summarize(df: pd.DataFrame, label: str) -> Dict[str, Any]:\n",
                "    if df.empty:\n",
                    return {\n                        "Model": label, "F1": np.nan, "Prec": np.nan, "Rec": np.nan,\n                        "RetRec": np.nan, "JudgeScore": np.nan, "Succ": "0/0",\n                    }\n                "\n",
                "    raw_metric_cols = {\n",
                '        "F1": "type_set_f1",\n',
                '        "Prec": "type_set_precision",\n',
                '        "Rec": "type_set_recall",\n',
                '        "RetRec": "retrieval_recall",\n',
                '        "JudgeScore": "llm_judge_overall_quality_score",\n',
                "    }\n",
                "    processed_metrics = {}\n",
                "    for display_key, raw_key in raw_metric_cols.items():\n",
                "        if raw_key in df.columns:\n",
                "            numeric_series = pd.to_numeric(df[raw_key], errors='coerce')\n",
                "            processed_metrics[display_key] = numeric_series.mean()\n",
                "        else:\n",
                "            processed_metrics[display_key] = np.nan\n",
                "\n",
                '    succ = df["error"].isna().sum() if "error" in df.columns else len(df)\n',
                "    total = len(df)\n",
                "\n",
                "    return {\n",
                '        "Model": label,\n',
                '        "F1": processed_metrics.get("F1", np.nan),\n',
                '        "Prec": processed_metrics.get("Prec", np.nan),\n',
                '        "Rec": processed_metrics.get("Rec", np.nan),\n',
                '        "RetRec": processed_metrics.get("RetRec", np.nan),\n',
                '        "JudgeScore": processed_metrics.get("JudgeScore", np.nan),\n',
                '        "Succ": f"{succ}/{total}",\n',
                "    }\n",
                "\n",
                "metric_key_map = {\n",
                '    "F1": "F1", "Prec": "P", "Rec": "R", "RetRec": "RR", "JudgeScore": "JudgeS"\n',
                "}\n",
                "\n",
                "def extract_delta(summary_df: pd.DataFrame, base_model_name: str) -> Dict[str, Any]:\n",
                "    k3_row = pd.Series(dtype=object)\n",
                "    k5_row = pd.Series(dtype=object)\n",
                "\n",
                '    k3_candidates = summary_df[summary_df["Model"].str.contains(f"{base_model_name}, K=3")]\n',
                '    k5_candidates = summary_df[summary_df["Model"].str.contains(f"{base_model_name}, K=5")]\n',
                "\n",
                "    if not k3_candidates.empty:\n",
                "        k3_row = k3_candidates.iloc[-1] # Pick last one, assumes judged version if multiple\n",
                "    if not k5_candidates.empty:\n",
                "        k5_row = k5_candidates.iloc[-1]\n",
                "\n",
                "    if k3_row.empty or k5_row.empty:\n",
                '        print(f"Warning: Missing K=3 or K=5 data for model base {base_model_name}. Skipping delta.")\n',
                "        return {\n",
                '            "Model": base_model_name, "F1@3": np.nan, "F1@5": np.nan, "ΔF1": np.nan,\n',
                '            "P@3": np.nan, "P@5": np.nan, "ΔP": np.nan, "R@3": np.nan, "R@5": np.nan, "ΔR": np.nan,\n',
                '            "RR@3": np.nan, "RR@5": np.nan, "ΔRR": np.nan, "JudgeS@3": np.nan, "JudgeS@5": np.nan,\n',
                '            "Succ@3": "N/A", "Succ@5": "N/A",\n',
                "        }\n",
                "\n",
                '    result_dict = {"Model": base_model_name}\n',
                '    metrics_to_compare = ["F1", "Prec", "Rec", "RetRec", "JudgeScore"]\n',
                "\n",
                "    for metric_key_base in metrics_to_compare:\n",
                "        short_metric_key = metric_key_map.get(metric_key_base, metric_key_base)\n",
                "        val_k3 = k3_row.get(metric_key_base, np.nan)\n",
                "        val_k5 = k5_row.get(metric_key_base, np.nan)\n",
                "        \n",
                "        val_k3 = float(val_k3) if pd.notna(val_k3) else np.nan\n",
                "        val_k5 = float(val_k5) if pd.notna(val_k5) else np.nan\n",
                "\n",
                '        result_dict[f"{short_metric_key}@3"] = val_k3\n',
                '        result_dict[f"{short_metric_key}@5"] = val_k5\n',
                "\n",
                '        if metric_key_base != "JudgeScore":\n',
                "            if pd.notna(val_k3) and pd.notna(val_k5):\n",
                '                result_dict[f"Δ{short_metric_key}"] = val_k5 - val_k3\n',
                "            else:\n",
                '                result_dict[f"Δ{short_metric_key}"] = np.nan\n',
                "    \n",
                '    result_dict["Succ@3"] = k3_row.get("Succ", "N/A")\n',
                '    result_dict["Succ@5"] = k5_row.get("Succ", "N/A")\n',
                "    return result_dict",
            ],
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## 3. Load and Process Data\n",
                "\n",
                "Load all configured result files, flatten their metrics, and store them.",
            ],
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "frames: Dict[str, pd.DataFrame] = {}\n",
                "for label, fn in CONFIGS:\n",
                "    df = load_jsonl(fn)\n",
                "    df = flatten(df) # Flatten also handles pd.NA for RAG specific metrics\n",
                "    frames[label] = df\n",
                "    \n",
                "    # Optional: Display head of each loaded DataFrame for quick inspection\n",
                '    # print(f"\\n--- Head of: {label} ---")\n',
                "    # if not df.empty:\n",
                "    #     display(df[['celex_id', 'type_set_f1', 'llm_judge_overall_quality_score', 'error']].head() if 'llm_judge_overall_quality_score' in df.columns \n",
                "    #               else df[['celex_id', 'type_set_f1', 'error']].head() if 'type_set_f1' in df.columns else df.head())\n",
                "    # else:\n",
                '    #     print("  (empty)")',
            ],
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": ["## 4. Main Performance Comparison"],
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "summary_list = []\n",
                "for label, _ in CONFIGS:\n",
                "    summary_list.append(summarize(frames[label], label))\n",
                "summary_df = pd.DataFrame(summary_list)\n",
                "\n",
                "# Explicitly ensure numeric columns are float, coercing errors to NaN\n",
                'final_numeric_cols = ["F1", "Prec", "Rec", "RetRec", "JudgeScore"]\n',
                "for col in final_numeric_cols:\n",
                "    if col in summary_df.columns:\n",
                "        summary_df[col] = pd.to_numeric(summary_df[col], errors='coerce')\n",
                "    else:\n",
                "        summary_df[col] = np.nan\n",
                "\n",
                'print("--- Main Performance Comparison (including LLM Judge Scores) ---")\n',
                'display(summary_df.style.format("{:.3f}", subset=pd.IndexSlice[:, final_numeric_cols], na_rep="N/A"))',
            ],
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": ["### 4.1. Visualizing Main Performance Metrics"],
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "# Plotting F1 Scores\n",
                "plt.figure(figsize=(12, 8))\n",
                'sns.barplot(data=summary_df.sort_values("F1", ascending=False), y="Model", x="F1", palette="viridis")\n',
                "plt.title('Model Comparison: Average F1 Score')\n",
                "plt.xlabel('Average F1 Score')\n",
                "plt.ylabel('Model Configuration')\n",
                "plt.tight_layout()\n",
                "plt.show()\n",
                "\n",
                "# Plotting Judge Scores (for models that have them)\n",
                'judge_score_df = summary_df.dropna(subset=["JudgeScore"])\n',
                "if not judge_score_df.empty:\n",
                "    plt.figure(figsize=(12, max(6, len(judge_score_df) * 0.5)))\n",
                '    sns.barplot(data=judge_score_df.sort_values("JudgeScore", ascending=False), y="Model", x="JudgeScore", palette="mako")\n',
                "    plt.title('Model Comparison: Average LLM Judge Score')\n",
                "    plt.xlabel('Average Judge Score (0.0-1.0)')\n",
                "    plt.ylabel('Model Configuration')\n",
                "    plt.tight_layout()\n",
                "    plt.show()\n",
                "else:\n",
                '    print("No models with Judge Scores to plot.")',
            ],
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": ["## 5. K=3 vs. K=5 Comparison"],
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                'bases = ["GPT-4.1-mini", "Gemma-3", "Claude 3.7"]\n',
                "cmp_rows = [extract_delta(summary_df, b) for b in bases]\n",
                "k_comparison_df = pd.DataFrame(cmp_rows)\n",
                "\n",
                "# Drop rows if all key metric values are NA (helps if a base model was not found for K3/K5)\n",
                "metric_value_cols = [col for col in k_comparison_df.columns if col not in ['Model', 'Succ@3', 'Succ@5']]\n",
                "k_comparison_df.dropna(subset=metric_value_cols, how='all', inplace=True)\n",
                "\n",
                'print("--- K=3 vs K=5 Quant Comparison (including LLM Judge Scores) ---")\n',
                "if not k_comparison_df.empty:\n",
                "    # Identify float columns for formatting, excluding 'Model', 'Succ@3', 'Succ@5'\n",
                "    float_cols_k_comp = [c for c in k_comparison_df.columns if c not in ['Model', 'Succ@3', 'Succ@5']]\n",
                '    display(k_comparison_df.style.format("{:.3f}", subset=pd.IndexSlice[:, float_cols_k_comp], na_rep="N/A"))\n',
                "else:\n",
                '    print("  (No data for K=3 vs K=5 comparison table)")',
            ],
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": ["### 5.1. Visualizing K=3 vs. K=5 Differences (ΔF1)"],
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "if not k_comparison_df.empty and 'ΔF1' in k_comparison_df.columns:\n",
                '    delta_f1_df = k_comparison_df.dropna(subset=["ΔF1"])\n',
                "    if not delta_f1_df.empty:\n",
                "        plt.figure(figsize=(10, 6))\n",
                '        sns.barplot(data=delta_f1_df.sort_values("ΔF1", ascending=False), y="Model", x="ΔF1", palette="coolwarm")\n',
                "        plt.title('Change in F1 Score (K=5 vs. K=3)')\n",
                "        plt.xlabel('ΔF1 (F1@5 - F1@3)')\n",
                "        plt.ylabel('Base Model')\n",
                "        plt.axvline(0, color='grey', linestyle='--') # Add a line at zero for reference\n",
                "        plt.tight_layout()\n",
                "        plt.show()\n",
                "    else:\n",
                '        print("No models with ΔF1 data to plot.")\n',
                "else:\n",
                "    print(\"K comparison DataFrame is empty or missing 'ΔF1' column.\")",
            ],
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## 6. Per-Document Analysis (Qualitative Insights)\n",
                "\n",
                "The `frames` dictionary holds the detailed per-document DataFrames for each configuration. You can use this for deeper qualitative analysis.\n",
                "\n",
                "For example, to inspect documents where a specific model (e.g., 'RAG (GPT-4.1-mini, K=5, Judge: Claude 3.5)') had a low judge score or a specific error:",
            ],
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "model_to_inspect_label = 'RAG (GPT-4.1-mini, K=5, Judge: Claude 3.5)'\n",
                "if model_to_inspect_label in frames:\n",
                "    df_inspect = frames[model_to_inspect_label]\n",
                "    if not df_inspect.empty and 'llm_judge_overall_quality_score' in df_inspect.columns:\n",
                "        # Ensure the judge score column is numeric for comparison\n",
                "        df_inspect['llm_judge_overall_quality_score_numeric'] = pd.to_numeric(df_inspect['llm_judge_overall_quality_score'], errors='coerce')\n",
                "        \n",
                "        low_score_threshold = 0.5\n",
                "        low_score_docs = df_inspect[df_inspect['llm_judge_overall_quality_score_numeric'] < low_score_threshold]\n",
                "        \n",
                '        print(f"\\n--- Documents with Judge Score < {low_score_threshold} for {model_to_inspect_label} ---")\n',
                "        if not low_score_docs.empty:\n",
                "            display(low_score_docs[['celex_id', 'type_set_f1', 'llm_judge_overall_quality_score', 'llm_judge_rationale']].head())\n",
                "        else:\n",
                '            print(f"No documents found with judge score below {low_score_threshold}.")\n',
                "            \n",
                "        # Example: Find documents with a specific error (if 'error' column exists and is not None)\n",
                "        # specific_error_docs = df_inspect[df_inspect['error'].str.contains('Timeout', na=False) if 'error' in df_inspect.columns else pd.Series(False, index=df_inspect.index)]\n",
                "        # if not specific_error_docs.empty:\n",
                "        #     print(f\"\\n--- Documents with 'Timeout' error for {model_to_inspect_label} ---\")\n",
                "        #     display(specific_error_docs[['celex_id', 'error']].head())\n",
                "            \n",
                "    else:\n",
                "        print(f\"DataFrame for {model_to_inspect_label} is empty or missing 'llm_judge_overall_quality_score' column.\")\n",
                "else:\n",
                "    print(f\"Model '{model_to_inspect_label}' not found in loaded frames.\")",
            ],
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## 7. Conclusion\n",
                "\n",
                "This notebook provides a framework for analyzing the performance of different clause extraction models. Key takeaways include:\n",
                "*   Overall performance trends (F1, Precision, Recall, Judge Scores).\n",
                "*   The impact of changing K in RAG systems.\n",
                "*   Identification of specific documents or error types for deeper investigation.\n",
                "\n",
                "Further analysis could involve more sophisticated visualizations, statistical significance testing, or direct comparison of extracted clauses against gold standards for specific documents.",
            ],
        },
    ],
    "metadata": {
        "kernelspec": {
            "display_name": "Python 3 (ipykernel)",
            "language": "python",
            "name": "python3",
        },
        "language_info": {
            "codemirror_mode": {"name": "ipython", "version": 3},
            "file_extension": ".py",
            "mimetype": "text/x-python",
            "name": "python",
            "nbconvert_exporter": "python",
            "pygments_lexer": "ipython3",
            "version": "3.11.7",
        },
    },
    "nbformat": 4,
    "nbformat_minor": 5,
}