
# Error Analysis


#### **1. Retrieval Precision: Hits vs. Near Misses**

In a legal context, a retrieval "fail" often simply means the exact paragraph was missed, while the correct article or section was still found.

* **Single-Turn Resilience:** **61.1%** of retrieval misses were "Near Misses," where the system found the correct article/recital/annex but the wrong specific paragraph or subpoint.
* **Reranking Impact:** Reranking significantly tightens precision. In multi-turn scenarios, **83.6%** of misses were "Near Misses," ensuring the model almost always had the correct article even if it missed the specific paragraph.

#### **2. RAG Behavior Categories**

We categorize the system's output into four types to understand its failure modes:

* **Success (Retrieval + Good Answer):** The target state. This occurs in over **90%** of single-turn queries.
* **Lucky Guess (Retrieval Missed + Good Answer):** The model uses its internal knowledge and  closely related context to answer correctly despite missing the specific text. This is common in multi-turn dialogues (**14%**).
* **System Failure (Retrieval Missed + Poor Answer):** The model lacks the specific legal knowledge to answer without the correct text.
* **Context Ignored (Retrieval + Poor Answer):** A rare event (**<5%**) where the model has the right text but fails to use it, indicating high model "faithfulness" to the provided law.



In [9]:
import sys
from pathlib import Path
import glob
import os

sys.path.append(str(Path.cwd().parent))
from src.analysis.analyze_errors import ErrorAnalyzer

project_root = Path.cwd().parent 


In [13]:
print("Single Turn")
file_path = project_root / "results/rag_eval_full_single_queries_20260112_200932.json" 

analyzer = ErrorAnalyzer(file_path)

analyzer.check_near_misses()
analyzer.categorize_failures()
analyzer.analyze_multi_turn_decay()

print("\nSingle Turn Reranked")
file_path = project_root / "results/rag_eval_full_single_turn_reranked_20260113_020921.json" 

analyzer = ErrorAnalyzer(file_path)

analyzer.check_near_misses()
analyzer.categorize_failures()
analyzer.analyze_multi_turn_decay()

Single Turn
Loaded 747 records for analysis.

RETRIEVAL: Exact Hits vs. Near Misses
----------------------------------------
Total Retrieval Misses: 54
Near Misses (Correct Doc, Wrong Chunk): 33
-> 61.1% of misses were actually close!

RAG BEHAVIOR CATEGORIES
----------------------------------------
Success (Retrieval + Good Answer): 675 (90.4%)
Lucky Guess (Rretrieval missed + Good Answer): 29 (3.9%)
System Failure (Retrieval missed + Poor Answer): 25 (3.3%)
Context Ignored (Retrieval + Poor Answer): 18 (2.4%)

Single Turn Reranked
Loaded 747 records for analysis.

RETRIEVAL: Exact Hits vs. Near Misses
----------------------------------------
Total Retrieval Misses: 27
Near Misses (Correct Doc, Wrong Chunk): 18
-> 66.7% of misses were actually close!

RAG BEHAVIOR CATEGORIES
----------------------------------------
Success (Retrieval + Good Answer): 701 (93.8%)
Context Ignored (Retrieval + Poor Answer): 19 (2.5%)
Lucky Guess (Rretrieval missed + Good Answer): 18 (2.4%)
System Failure 

In [15]:
print("Multi Turn")
file_path = project_root / "results/rag_eval_full_multi_turn_20260114_215828.json" 

analyzer = ErrorAnalyzer(file_path)

analyzer.check_near_misses()
analyzer.categorize_failures()
analyzer.analyze_multi_turn_decay()

print("\nMulti Turn Reranked")
file_path = project_root / "results/rag_eval_full_multi_turn_reranked_20260114_202740.json" 

analyzer = ErrorAnalyzer(file_path)

analyzer.check_near_misses()
analyzer.categorize_failures()
analyzer.analyze_multi_turn_decay()

Multi Turn
Loaded 228 records for analysis.

RETRIEVAL: Exact Hits vs. Near Misses
----------------------------------------
Total Retrieval Misses: 55
Near Misses (Correct Doc, Wrong Chunk): 33
-> 60.0% of misses were actually close!

RAG BEHAVIOR CATEGORIES
----------------------------------------
Success (Retrieval + Good Answer): 165 (72.4%)
System Failure (Retrieval missed + Poor Answer): 30 (13.2%)
Lucky Guess (Rretrieval missed + Good Answer): 25 (11.0%)
Context Ignored (Retrieval + Poor Answer): 8 (3.5%)

Multi-Turn Performance Decay
----------------------------------------
            rag_score hit_rate baseline_score
turn_number                                  
1                8.13    84.5%           5.42
2                7.52    77.5%           7.23
3                6.64    65.2%           7.13
4                7.24    76.5%           7.00

Context Dependency Check:
Turn 1: RAG beat Baseline in 49/71 cases
Turn 2: RAG beat Baseline in 28/71 cases
Turn 3: RAG beat Baseline i