# 🛡️ Presidio Baseline Framework - Technical Diagnostics

**Comprehensive diagnostic analysis for PII deidentification performance**

This notebook provides deep technical insights into the Microsoft Presidio baseline framework performance, identifying specific improvement opportunities and model behavior patterns.

## 📋 Analysis Sections
1. **Performance Overview** - Overall metrics and framework status
2. **Worst Recall Cases** - Top 5 cases with least recall (missed PII)
3. **Worst Precision Cases** - Top 5 cases with least precision (over-detection)
4. **Category Analysis** - Missed PII by type with improvement insights
5. **Confidence Analysis** - Model confidence vs correctness patterns

---

## 🔧 Setup and Configuration

In [12]:
import pandas as pd
import numpy as np
import sys
import os
from pathlib import Path
from IPython.display import HTML, display
import warnings
warnings.filterwarnings('ignore')

# Add src to path for imports
project_root = Path().absolute().parent
sys.path.append(str(project_root / 'src'))

# Import evaluation functions (following the requirement to use only /evaluation functions)
from evaluation.metrics import PIIEvaluator
from evaluation.diagnostics import (
    get_transcript_cases_by_performance,
    create_diagnostic_html_table_configurable,
    analyze_missed_pii_categories,
    analyze_confidence_vs_correctness
)

# Import baseline framework for flexible integration
from baseline.presidio_framework import PurePresidioFramework


## 📊 Data Loading and Framework Execution

**Flexible Integration**: Load existing results if available, otherwise run the framework.

In [13]:
# Configuration
DATA_PATH = project_root / '.data' / 'synthetic_call_transcripts.csv'
RESULTS_PATH = project_root / 'Demo' / 'presidio_baseline_results.csv'
EVALUATION_MODE = 'business'  # 'business' or 'research' - affects matching criteria

# print(f"🔍 Looking for data at: {DATA_PATH}")
# print(f"🔍 Looking for results at: {RESULTS_PATH}")

# Load ground truth data
if DATA_PATH.exists():
    ground_truth_df = pd.read_csv(DATA_PATH)
    print(f"✅ Loaded ground truth data: {len(ground_truth_df)} transcripts")
    # print(f"📋 Columns: {list(ground_truth_df.columns)}")
else:
    print(f"❌ Ground truth data not found at {DATA_PATH}")
    raise FileNotFoundError(f"Please ensure {DATA_PATH} exists")

 
# Initialize and run Presidio framework
framework = PurePresidioFramework(enable_mlflow=True)

# Process dataset
results_df = framework.process_dataset(
    csv_path=str(DATA_PATH),
    output_path=str(RESULTS_PATH)
)

print(f"✅ Framework processing complete: {len(results_df)} transcripts processed")
# print(f"💾 Results saved to {RESULTS_PATH}")

print(f"\n📊 DATASET OVERVIEW:")
print(f"   Ground Truth Transcripts: {len(ground_truth_df)}")
print(f"   Processed Results:        {len(results_df)}")
print(f"   Evaluation Mode:          {EVALUATION_MODE.upper()}")


✅ Loaded ground truth data: 100 transcripts
✅ MLflow experiment tracking enabled
🚀 Starting Pure Presidio Framework processing...
📊 Loaded 100 call transcripts
Processing transcript 100/100...
✅ Processing complete! Final metrics:
   total_transcripts: 100
   total_pii_detected: 779
   avg_pii_per_transcript: 7.79
   total_processing_time_seconds: 1.3932
   avg_processing_time_per_transcript_seconds: 0.0139
   estimated_time_for_1m_transcripts: 3.87 hours
   pii_types_distribution: {'EMAIL_ADDRESS': 100, 'LOCATION': 51, 'DATE_TIME': 164, 'PERSON': 369, 'PHONE_NUMBER': 95}
✅ MLflow metrics logged successfully
✅ Framework processing complete: 100 transcripts processed

📊 DATASET OVERVIEW:
   Ground Truth Transcripts: 100
   Processed Results:        100
   Evaluation Mode:          BUSINESS


## 📈 1. Performance Overview

High-level performance metrics for the baseline Presidio framework.

In [14]:
# Initialize evaluator
evaluator = PIIEvaluator(matching_mode=EVALUATION_MODE)

# Calculate overall framework performance
print("🔄 Calculating comprehensive framework evaluation...")
evaluation_results = evaluator.evaluate_framework_results(results_df, ground_truth_df)

# Print detailed evaluation summary
evaluator.print_evaluation_summary(evaluation_results)

# Extract key metrics for dashboard
overall_metrics = evaluation_results['overall_metrics']
overall_recall = overall_metrics['overall_recall']
overall_precision = overall_metrics['overall_precision']
overall_f1 = overall_metrics['overall_f1_score']
recall_target_met = overall_metrics.get('recall_target_achievement', False)

print(f"\n🎯 KEY PERFORMANCE INDICATORS:")
print(f"   📊 Overall Recall:     {overall_recall:.1%} {'✅' if recall_target_met else '❌'}")
print(f"   📊 Overall Precision:  {overall_precision:.1%}")
print(f"   📊 Overall F1-Score:   {overall_f1:.1%}")
print(f"   🎯 100% Recall Target: {'ACHIEVED' if recall_target_met else 'NOT ACHIEVED'}")

# Quick diagnostic insight
if overall_recall < 0.95:
    print(f"\n⚠️  DIAGNOSTIC INSIGHT: Recall below 95% - focus on missed PII analysis below")
if overall_precision < 0.80:
    print(f"\n⚠️  DIAGNOSTIC INSIGHT: Precision below 80% - focus on over-detection analysis below")


🔧 PIIEvaluator initialized with 'business' matching mode
   ✅ Business Focus: Any PII detection over ground truth = SUCCESS
🔄 Calculating comprehensive framework evaluation...

🎯 PII DEIDENTIFICATION EVALUATION RESULTS

📊 OVERALL PERFORMANCE:
   Precision:           0.751
   Recall:              0.792 ❌
   F1-Score:            0.771
   PII Protection Rate: 0.820 🛡️

📈 DETAILED COUNTS:
   True Positives:  605.9
   False Positives: 201
   False Negatives: 159

🔍 ENTITY TYPE BREAKDOWN:
   member_first_name    | P: 1.000 | R: 0.839 | F1: 0.912
   member_full_name     | P: 1.000 | R: 0.990 | F1: 0.995
   member_email         | P: 1.000 | R: 1.000 | F1: 1.000
   member_mobile        | P: 1.000 | R: 0.940 | F1: 0.969
   member_address       | P: 1.000 | R: 0.324 | F1: 0.490
   member_number        | P: 1.000 | R: 0.500 | F1: 0.667
   consultant_first_name | P: 1.000 | R: 0.713 | F1: 0.832
   DATE_TIME            | P: 0.000 | R: 0.000 | F1: 0.000
   PERSON               | P: 0.000 | R: 0.000 |

## 🔍 2. Top 5 Cases with Least Recall (Missed PII)

Identify transcripts where the most PII was missed to understand failure patterns.

In [15]:
# Get worst recall cases
worst_recall_cases = get_transcript_cases_by_performance(
    results_df=results_df,
    ground_truth_df=ground_truth_df,
    metric='recall',
    n_cases=5,
    ascending=True,  # True = worst performers first
    matching_mode=EVALUATION_MODE
)

# Create diagnostic HTML table
worst_recall_html = create_diagnostic_html_table_configurable(
    transcript_data=worst_recall_cases,
    title="🔴 Top 5 Worst Recall Cases - Missed PII Analysis",
    description=f"""These transcripts had the lowest recall scores, meaning significant PII was missed.
    <strong>Red highlights</strong> show missed PII that should have been detected.
    Focus on patterns in missed PII to improve detection rules.""",
    matching_mode=EVALUATION_MODE
)

display(HTML(worst_recall_html))

# Summary insights for worst recall cases
print(f"\n💡 RECALL IMPROVEMENT INSIGHTS:")
recall_scores = [case['performance_metrics']['recall'] for case in worst_recall_cases]
avg_worst_recall = np.mean(recall_scores)
print(f"   📉 Average recall in worst cases: {avg_worst_recall:.1%}")
print(f"   🎯 These cases need the most attention for PII detection improvements")
print(f"   🔍 Look for patterns in the missed PII (red highlights) above")


🔍 ANALYZING TRANSCRIPT PERFORMANCE BY RECALL
🔧 PIIEvaluator initialized with 'business' matching mode
   ✅ Business Focus: Any PII detection over ground truth = SUCCESS

📊 WORST 5 PERFORMERS BY RECALL:
  1. Call 70: recall=25.3%, Recall=25.3%, Precision=58.1%, F1=35.2%
  2. Call 99: recall=33.8%, Recall=33.8%, Precision=57.5%, F1=42.6%
  3. Call 71: recall=36.5%, Recall=36.5%, Precision=59.3%, F1=45.2%
  4. Call 87: recall=45.8%, Recall=45.8%, Precision=64.7%, F1=53.7%
  5. Call 4: recall=46.3%, Recall=46.3%, Precision=64.9%, F1=54.1%

✅ Prepared 5 cases for analysis
🔧 PIIEvaluator initialized with 'business' matching mode
   ✅ Business Focus: Any PII detection over ground truth = SUCCESS


📊 Metrics & Performance,📋 Original Transcript,🛡️ Cleaned Transcript
📋 CALL ID: 70  🎯 Total PII Occurrences: 11  📈 PERFORMANCE (BUSINESS):  • Recall: 25.3%  • Precision: 58.1%  • 🛡️ PII Protection: 54.0%  🎯 STATUS:  🔴 Needs Improvement,"Agent: Hi, this is Ava from Bricks Health Insurance. Agent: May I have your mobile number? Customer: 042285 817 432. Agent: And your email address, please? Customer: ava.taylor@example.com. Agent: Could you confirm your full name, please? Customer: Ava Michael Taylor. Agent: Could I please have your Bricks membership number? Customer: 56014981. Agent: Finally, could you provide your residential address? Customer: 330 Victoria Road, Perth WA 6000. Agent: Thank you for verifying, Ava. How can I assist you today?","Agent: Hi, this is Ava from Bricks Health Insurance. Agent: May I have your mobile number? Customer: <PHONE_NUMBER>. Agent: And your email address, please? Customer: <EMAIL_ADDRESS>. Agent: Could you confirm your full name, please? Customer: Ava <PERSON>. Agent: Could I please have your <PERSON> membership number? Customer: 56014981. Agent: Finally, could you provide your residential address? Customer: 330 Victoria Road, Perth WA 6000. Agent: Thank you for verifying, Ava. How can I assist you <DATE_TIME>?"
📋 CALL ID: 99  🎯 Total PII Occurrences: 8  📈 PERFORMANCE (BUSINESS):  • Recall: 33.8%  • Precision: 57.5%  • 🛡️ PII Protection: 51.9%  🎯 STATUS:  🔴 Needs Improvement,"Agent: Hi, this is James from Bricks Health Insurance. Agent: May I have your mobile number? Customer: 042006 843 674. Agent: Could you confirm your full name, please? Customer: Ella Marie Taylor. Agent: Could I please have your Bricks membership number? Customer: 98345291. Agent: Finally, could you provide your residential address? Customer: 948 Harbour Road, Sydney NSW 2000. Agent: And your email address, please? Customer: ella.taylor@example.com. Agent: Thank you for verifying, Ella. How can I assist you today?","Agent: Hi, this is James from Bricks Health Insurance. Agent: May I have your mobile number? Customer: <PHONE_NUMBER>. Agent: Could you confirm your full name, please? Customer: Ella <PERSON>. Agent: Could I please have your <PERSON> membership number? Customer: 98345291. Agent: Finally, could you provide your residential address? Customer: 948 Harbour Road, Sydney NSW 2000. Agent: And your email address, please? Customer: <EMAIL_ADDRESS>. Agent: Thank you for verifying, Ella. How can I assist you <DATE_TIME>?"
📋 CALL ID: 71  🎯 Total PII Occurrences: 8  📈 PERFORMANCE (BUSINESS):  • Recall: 36.5%  • Precision: 59.3%  • 🛡️ PII Protection: 84.0%  🎯 STATUS:  🔴 Needs Improvement,"Agent: Hi, this is James from Bricks Health Insurance. Agent: Could you confirm your full name, please? Customer: Ella Patrick Wilson. Agent: And your email address, please? Customer: ella.wilson@example.com. Agent: Finally, could you provide your residential address? Customer: 327 Victoria Road, Darwin NT 0800. Agent: Could I please have your Bricks membership number? Customer: 96961359. Agent: May I have your mobile number? Customer: 044701 480 783. Agent: Thank you for verifying, Ella. How can I assist you today?","Agent: Hi, this is James from Bricks Health Insurance. Agent: Could you confirm your full name, please? Customer: Ella <PERSON>. Agent: And your email address, please? Customer: <EMAIL_ADDRESS>. Agent: Finally, could you provide your residential address? Customer: 327 Victoria Road, <PERSON> NT 0800. Agent: Could I please have your <PERSON> membership number? Customer: 96961359. Agent: May I have your mobile number? Customer: <PHONE_NUMBER>. Agent: Thank you for verifying, Ella. How can I assist you <DATE_TIME>?"
📋 CALL ID: 87  🎯 Total PII Occurrences: 8  📈 PERFORMANCE (BUSINESS):  • Recall: 45.8%  • Precision: 64.7%  • 🛡️ PII Protection: 55.6%  🎯 STATUS:  🔴 Needs Improvement,"Agent: Hi, this is James from Bricks Health Insurance. Agent: Finally, could you provide your residential address? Customer: 159 Pine Street, Newcastle NSW 2300. Agent: Could I please have your Bricks membership number? Customer: 48520151. Agent: And your email address, please? Customer: grace.white@example.com. Agent: May I have your mobile number? Customer: 042434 480 515. Agent: Could you confirm your full name, please? Customer: Grace Louise White. Agent: Thank you for verifying, Grace. How can I assist you today?","Agent: Hi, this is James from Bricks Health Insurance. Agent: Finally, could you provide your residential address? Customer: 159 Pine Street, Newcastle NSW 2300. Agent: Could I please have your <PERSON> membership number? Customer: 48520151. Agent: And your email address, please? Customer: <EMAIL_ADDRESS>. Agent: May I have your mobile number? Customer: <PHONE_NUMBER>. Agent: Could you confirm your full name, please? Customer: Grace <PERSON>. Agent: Thank you for verifying, <PERSON>. How can I assist you <DATE_TIME>?"
📋 CALL ID: 4  🎯 Total PII Occurrences: 8  📈 PERFORMANCE (BUSINESS):  • Recall: 46.3%  • Precision: 64.9%  • 🛡️ PII Protection: 56.4%  🎯 STATUS:  🔴 Needs Improvement,"Agent: Hi, this is Noah from Bricks Health Insurance. Agent: And your email address, please? Customer: ella.white@example.com. Agent: Could you confirm your full name, please? Customer: Ella Andrew White. Agent: May I have your mobile number? Customer: 044928 834 779. Agent: Finally, could you provide your residential address? Customer: 592 Pine Street, Sydney NSW 2000. Agent: Could I please have your Bricks membership number? Customer: 53376329. Agent: Thank you for verifying, Ella. How can I assist you today?","Agent: Hi, this is <PERSON> from Bricks Health Insurance. Agent: And your email address, please? Customer: <EMAIL_ADDRESS>. Agent: Could you confirm your full name, please? Customer: Ella <PERSON>. Agent: May I have your mobile number? Customer: <PHONE_NUMBER>. Agent: Finally, could you provide your residential address? Customer: 592 Pine Street, Sydney NSW 2000. Agent: Could I please have your <PERSON> membership number? Customer: 53376329. Agent: Thank you for verifying, Ella. How can I assist you <DATE_TIME>?"



💡 RECALL IMPROVEMENT INSIGHTS:
   📉 Average recall in worst cases: 37.5%
   🎯 These cases need the most attention for PII detection improvements
   🔍 Look for patterns in the missed PII (red highlights) above


## 🎯 3. Top 5 Cases with Least Precision (Over-Detection)

Identify transcripts with the most false positives to understand over-detection patterns.


In [16]:
# Get worst precision cases
worst_precision_cases = get_transcript_cases_by_performance(
    results_df=results_df,
    ground_truth_df=ground_truth_df,
    metric='precision',
    n_cases=5,
    ascending=True,  # True = worst performers first
    matching_mode=EVALUATION_MODE
)

# Create diagnostic HTML table
worst_precision_html = create_diagnostic_html_table_configurable(
    transcript_data=worst_precision_cases,
    title="🟡 Top 5 Worst Precision Cases - Over-Detection Analysis",
    description=f"""These transcripts had the lowest precision scores, meaning many false positives occurred.
    <strong>Green highlights</strong> show detected/anonymized text - some may be over-detections.
    Focus on reducing false positives while maintaining recall.""",
    matching_mode=EVALUATION_MODE
)

display(HTML(worst_precision_html))

# Summary insights for worst precision cases
print(f"\n💡 PRECISION IMPROVEMENT INSIGHTS:")
precision_scores = [case['performance_metrics']['precision'] for case in worst_precision_cases]
avg_worst_precision = np.mean(precision_scores)
print(f"   📉 Average precision in worst cases: {avg_worst_precision:.1%}")
print(f"   🎯 Focus on reducing false positives in these patterns")
print(f"   ⚖️  Balance: Maintain high recall while improving precision")


🔍 ANALYZING TRANSCRIPT PERFORMANCE BY PRECISION
🔧 PIIEvaluator initialized with 'business' matching mode
   ✅ Business Focus: Any PII detection over ground truth = SUCCESS

📊 WORST 5 PERFORMERS BY PRECISION:
  1. Call 99: precision=57.5%, Recall=33.8%, Precision=57.5%, F1=42.6%
  2. Call 70: precision=58.1%, Recall=25.3%, Precision=58.1%, F1=35.2%
  3. Call 71: precision=59.3%, Recall=36.5%, Precision=59.3%, F1=45.2%
  4. Call 1: precision=64.0%, Recall=66.8%, Precision=64.0%, F1=65.4%
  5. Call 87: precision=64.7%, Recall=45.8%, Precision=64.7%, F1=53.7%

✅ Prepared 5 cases for analysis
🔧 PIIEvaluator initialized with 'business' matching mode
   ✅ Business Focus: Any PII detection over ground truth = SUCCESS


📊 Metrics & Performance,📋 Original Transcript,🛡️ Cleaned Transcript
📋 CALL ID: 99  🎯 Total PII Occurrences: 8  📈 PERFORMANCE (BUSINESS):  • Recall: 33.8%  • Precision: 57.5%  • 🛡️ PII Protection: 51.9%  🎯 STATUS:  🔴 Needs Improvement,"Agent: Hi, this is James from Bricks Health Insurance. Agent: May I have your mobile number? Customer: 042006 843 674. Agent: Could you confirm your full name, please? Customer: Ella Marie Taylor. Agent: Could I please have your Bricks membership number? Customer: 98345291. Agent: Finally, could you provide your residential address? Customer: 948 Harbour Road, Sydney NSW 2000. Agent: And your email address, please? Customer: ella.taylor@example.com. Agent: Thank you for verifying, Ella. How can I assist you today?","Agent: Hi, this is James from Bricks Health Insurance. Agent: May I have your mobile number? Customer: <PHONE_NUMBER>. Agent: Could you confirm your full name, please? Customer: Ella <PERSON>. Agent: Could I please have your <PERSON> membership number? Customer: 98345291. Agent: Finally, could you provide your residential address? Customer: 948 Harbour Road, Sydney NSW 2000. Agent: And your email address, please? Customer: <EMAIL_ADDRESS>. Agent: Thank you for verifying, Ella. How can I assist you <DATE_TIME>?"
📋 CALL ID: 70  🎯 Total PII Occurrences: 11  📈 PERFORMANCE (BUSINESS):  • Recall: 25.3%  • Precision: 58.1%  • 🛡️ PII Protection: 54.0%  🎯 STATUS:  🔴 Needs Improvement,"Agent: Hi, this is Ava from Bricks Health Insurance. Agent: May I have your mobile number? Customer: 042285 817 432. Agent: And your email address, please? Customer: ava.taylor@example.com. Agent: Could you confirm your full name, please? Customer: Ava Michael Taylor. Agent: Could I please have your Bricks membership number? Customer: 56014981. Agent: Finally, could you provide your residential address? Customer: 330 Victoria Road, Perth WA 6000. Agent: Thank you for verifying, Ava. How can I assist you today?","Agent: Hi, this is Ava from Bricks Health Insurance. Agent: May I have your mobile number? Customer: <PHONE_NUMBER>. Agent: And your email address, please? Customer: <EMAIL_ADDRESS>. Agent: Could you confirm your full name, please? Customer: Ava <PERSON>. Agent: Could I please have your <PERSON> membership number? Customer: 56014981. Agent: Finally, could you provide your residential address? Customer: 330 Victoria Road, Perth WA 6000. Agent: Thank you for verifying, Ava. How can I assist you <DATE_TIME>?"
📋 CALL ID: 71  🎯 Total PII Occurrences: 8  📈 PERFORMANCE (BUSINESS):  • Recall: 36.5%  • Precision: 59.3%  • 🛡️ PII Protection: 84.0%  🎯 STATUS:  🔴 Needs Improvement,"Agent: Hi, this is James from Bricks Health Insurance. Agent: Could you confirm your full name, please? Customer: Ella Patrick Wilson. Agent: And your email address, please? Customer: ella.wilson@example.com. Agent: Finally, could you provide your residential address? Customer: 327 Victoria Road, Darwin NT 0800. Agent: Could I please have your Bricks membership number? Customer: 96961359. Agent: May I have your mobile number? Customer: 044701 480 783. Agent: Thank you for verifying, Ella. How can I assist you today?","Agent: Hi, this is James from Bricks Health Insurance. Agent: Could you confirm your full name, please? Customer: Ella <PERSON>. Agent: And your email address, please? Customer: <EMAIL_ADDRESS>. Agent: Finally, could you provide your residential address? Customer: 327 Victoria Road, <PERSON> NT 0800. Agent: Could I please have your <PERSON> membership number? Customer: 96961359. Agent: May I have your mobile number? Customer: <PHONE_NUMBER>. Agent: Thank you for verifying, Ella. How can I assist you <DATE_TIME>?"
📋 CALL ID: 1  🎯 Total PII Occurrences: 8  📈 PERFORMANCE (BUSINESS):  • Recall: 66.8%  • Precision: 64.0%  • 🛡️ PII Protection: 90.7%  🎯 STATUS:  🔴 Needs Improvement,"Agent: Hi, this is Ava from Bricks Health Insurance. Agent: And your email address, please? Customer: sophia.thomas@example.com. Agent: Finally, could you provide your residential address? Customer: 942 Elizabeth Street, Gold Coast QLD 4217. Agent: May I have your mobile number? Customer: 044587 140 333. Agent: Could I please have your Bricks membership number? Customer: 15504108. Agent: Could you confirm your full name, please? Customer: Sophia Anthony Thomas. Agent: Thank you for verifying, Sophia. How can I assist you today?","Agent: Hi, this is Ava from Bricks Health Insurance. Agent: And your email address, please? Customer: <EMAIL_ADDRESS>. Agent: Finally, could you provide your residential address? Customer: 942 Elizabeth Street, <LOCATION> <DATE_TIME>. Agent: May I have your mobile number? Customer: <PHONE_NUMBER>. Agent: Could I please have your <PERSON> membership number? Customer: 15504108. Agent: Could you confirm your full name, please? Customer: <PERSON>. Agent: Thank you for verifying, <PERSON>. How can I assist you <DATE_TIME>?"
📋 CALL ID: 87  🎯 Total PII Occurrences: 8  📈 PERFORMANCE (BUSINESS):  • Recall: 45.8%  • Precision: 64.7%  • 🛡️ PII Protection: 55.6%  🎯 STATUS:  🔴 Needs Improvement,"Agent: Hi, this is James from Bricks Health Insurance. Agent: Finally, could you provide your residential address? Customer: 159 Pine Street, Newcastle NSW 2300. Agent: Could I please have your Bricks membership number? Customer: 48520151. Agent: And your email address, please? Customer: grace.white@example.com. Agent: May I have your mobile number? Customer: 042434 480 515. Agent: Could you confirm your full name, please? Customer: Grace Louise White. Agent: Thank you for verifying, Grace. How can I assist you today?","Agent: Hi, this is James from Bricks Health Insurance. Agent: Finally, could you provide your residential address? Customer: 159 Pine Street, Newcastle NSW 2300. Agent: Could I please have your <PERSON> membership number? Customer: 48520151. Agent: And your email address, please? Customer: <EMAIL_ADDRESS>. Agent: May I have your mobile number? Customer: <PHONE_NUMBER>. Agent: Could you confirm your full name, please? Customer: Grace <PERSON>. Agent: Thank you for verifying, <PERSON>. How can I assist you <DATE_TIME>?"



💡 PRECISION IMPROVEMENT INSIGHTS:
   📉 Average precision in worst cases: 60.7%
   🎯 Focus on reducing false positives in these patterns
   ⚖️  Balance: Maintain high recall while improving precision


## 📊 4. Category Analysis - Missed PII by Type

Detailed breakdown of missed PII by category to identify specific improvement areas.

In [17]:
# Analyze missed PII by categories
category_analysis = analyze_missed_pii_categories(
    results_df=results_df,
    ground_truth_df=ground_truth_df,
    matching_mode=EVALUATION_MODE
)

# Display detailed category insights
print(f"\n🔍 DETAILED CATEGORY ANALYSIS:")
print(f"=" * 60)

improvement_insights = category_analysis['improvement_insights']
missed_by_category = category_analysis['missed_by_category']
transcripts_with_misses = category_analysis['transcripts_with_misses']
transcripts_with_detections = category_analysis['transcripts_with_detections']

# Priority-based improvement recommendations
high_priority = [(cat, data) for cat, data in improvement_insights.items() if data['priority'] == 'HIGH']
medium_priority = [(cat, data) for cat, data in improvement_insights.items() if data['priority'] == 'MEDIUM']
low_priority = [(cat, data) for cat, data in improvement_insights.items() if data['priority'] == 'LOW']

if high_priority:
    print(f"\n🔴 HIGH PRIORITY IMPROVEMENTS:")
    for category, data in high_priority:
        print(f"   {category:20} | Miss Rate: {data['miss_rate']:.1%} | Total: {data['total_occurrences']}")
        
        # Show examples of missed vs detected for this category
        missed_examples = transcripts_with_misses.get(category, [])[:2]  # Top 2 examples
        detected_examples = transcripts_with_detections.get(category, [])[:2]  # Top 2 examples
        
        if missed_examples:
            print(f"     🔍 MISSED Examples:")
            for example in missed_examples:
                print(f"       Call {example['call_id']}: '{example['missed_value']}' in context: ...{example['context']}...")
        
        if detected_examples:
            print(f"     ✅ DETECTED Examples:")
            for example in detected_examples:
                print(f"       Call {example['call_id']}: '{example['detected_value']}' (conf: {example['overlap_ratio']:.2f})")
        print()

if medium_priority:
    print(f"\n🟡 MEDIUM PRIORITY IMPROVEMENTS:")
    for category, data in medium_priority:
        print(f"   {category:20} | Miss Rate: {data['miss_rate']:.1%} | Total: {data['total_occurrences']}")

if low_priority:
    print(f"\n🟢 LOW PRIORITY (Performing Well):")
    for category, data in low_priority:
        print(f"   {category:20} | Miss Rate: {data['miss_rate']:.1%} | Total: {data['total_occurrences']}")

# Strategic recommendations
print(f"\n🎯 STRATEGIC RECOMMENDATIONS:")
if high_priority:
    print(f"   1. Focus development efforts on HIGH priority categories above")
    print(f"   2. Analyze the missed vs detected examples for pattern differences")
    print(f"   3. Consider custom recognizers for problematic categories")
else:
    print(f"   🎉 No high-priority issues found - framework performing well across categories!")

print(f"   4. Monitor medium priority categories for regression")
print(f"   5. Use context patterns from examples to improve detection rules")


🔍 ANALYZING MISSED PII BY CATEGORIES
🔧 PIIEvaluator initialized with 'business' matching mode
   ✅ Business Focus: Any PII detection over ground truth = SUCCESS

📊 MISSED PII SUMMARY:
  member_number        | Missed:  50/100 (50.0%) | Priority: HIGH
  member_address       | Missed:  38/100 (38.0%) | Priority: HIGH
  member_first_name    | Missed:  33/205 (16.1%) | Priority: MEDIUM
  consultant_first_name | Missed:  31/108 (28.7%) | Priority: MEDIUM
  member_mobile        | Missed:   6/100 (6.0%) | Priority: LOW
  member_full_name     | Missed:   1/100 (1.0%) | Priority: LOW

🔍 DETAILED CATEGORY ANALYSIS:

🔴 HIGH PRIORITY IMPROVEMENTS:
   member_number        | Miss Rate: 50.0% | Total: 100
     🔍 MISSED Examples:
       Call 1: '15504108' in context: ...ase have your Bricks membership number?
Customer: 15504108.
Agent: Could you confirm your full name, please?...
       Call 2: '28001889' in context: ...ase have your Bricks membership number?
Customer: 28001889.
Agent: May I have your 

## 🧠 5. Confidence Analysis - Model Thinking Patterns

Analyze confidence levels vs correctness to understand when the model is uncertain vs overconfident.


In [18]:
# Analyze confidence vs correctness patterns
confidence_analysis = analyze_confidence_vs_correctness(
    results_df=results_df,
    ground_truth_df=ground_truth_df,
    matching_mode=EVALUATION_MODE
)

# Extract analysis results
high_conf_correct = confidence_analysis['high_confidence_correct']
high_conf_wrong = confidence_analysis['high_confidence_wrong']
low_conf_correct = confidence_analysis['low_confidence_correct']
low_conf_wrong = confidence_analysis['low_confidence_wrong']
summary_stats = confidence_analysis['summary_stats']

print(f"\n🧠 CONFIDENCE PATTERN ANALYSIS:")
print(f"=" * 60)

# Model thinking insights
print(f"\n🎯 MODEL THINKING INSIGHTS:")
print(f"   📊 High Confidence Accuracy: {summary_stats['high_conf_accuracy']:.1%}")
print(f"   📊 Low Confidence Accuracy:  {summary_stats['low_conf_accuracy']:.1%}")
print(f"   📊 Average Confidence:       {summary_stats['avg_confidence']:.3f}")

# Confidence calibration analysis
if summary_stats['high_conf_accuracy'] > 0.9:
    print(f"   ✅ Well-calibrated: High confidence predictions are highly accurate")
elif summary_stats['high_conf_accuracy'] < 0.7:
    print(f"   ⚠️  Over-confident: High confidence predictions often wrong - needs calibration")
else:
    print(f"   🟡 Moderately calibrated: Some high confidence errors occur")

# Show problematic high-confidence wrong cases
if high_conf_wrong:
    print(f"\n🔴 HIGH CONFIDENCE BUT WRONG (Overconfident Errors):")
    print(f"   These cases show where the model is confident but incorrect - critical to fix:")
    for i, case in enumerate(high_conf_wrong[:3], 1):  # Show top 3
        print(f"   {i}. Call {case['call_id']}: '{case['detected_value']}' as {case['detected_type']} (conf: {case['model_confidence']:.3f})")
        print(f"      Context: ...{case['context']}...")
        print()

# Show uncertain but correct cases (hidden gems)
if low_conf_correct:
    print(f"\n🟢 LOW CONFIDENCE BUT CORRECT (Hidden Gems):")
    print(f"   These cases show where the model correctly detected PII despite low confidence:")
    for i, case in enumerate(low_conf_correct[:3], 1):  # Show top 3
        print(f"   {i}. Call {case['call_id']}: '{case['detected_value']}' as {case['detected_type']} (conf: {case['model_confidence']:.3f})")
        print(f"      Context: ...{case['context']}...")
        print()

# Confidence threshold analysis
print(f"\n📊 CONFIDENCE THRESHOLD ANALYSIS:")
confidence_thresholds = confidence_analysis['confidence_thresholds']
for threshold, data in confidence_thresholds.items():
    print(f"   Threshold {threshold}: {data['percentage_above']:.1f}% of detections above")

# Recommendations
print(f"\n🎯 CONFIDENCE-BASED RECOMMENDATIONS:")
if len(high_conf_wrong) > len(high_conf_correct) * 0.2:  # >20% error rate in high confidence
    print(f"   1. 🔴 CRITICAL: Review high-confidence wrong cases - model overconfidence issue")
    print(f"   2. Consider confidence threshold tuning or model recalibration")
    
if len(low_conf_correct) > 10:
    print(f"   3. 🟢 OPPORTUNITY: Many low-confidence correct cases - confidence scores too conservative")
    print(f"   4. Investigate why correct detections have low confidence")
    
print(f"   5. Use confidence patterns to implement adaptive detection strategies")
print(f"   6. Focus manual review on low-confidence detections if implementing human-in-loop")


🔍 ANALYZING CONFIDENCE vs CORRECTNESS
🔧 PIIEvaluator initialized with 'business' matching mode
   ✅ Business Focus: Any PII detection over ground truth = SUCCESS

📊 CONFIDENCE vs CORRECTNESS SUMMARY:
  High Confidence + Correct:    458 cases
  High Confidence + Wrong:      226 cases
  Low Confidence + Correct:       0 cases
  Low Confidence + Wrong:         0 cases

🎯 INSIGHTS:
  High Confidence Accuracy: 67.0%
  Low Confidence Accuracy:  0.0%
  Avg Confidence Score:     0.857

🧠 CONFIDENCE PATTERN ANALYSIS:

🎯 MODEL THINKING INSIGHTS:
   📊 High Confidence Accuracy: 67.0%
   📊 Low Confidence Accuracy:  0.0%
   📊 Average Confidence:       0.857
   ⚠️  Over-confident: High confidence predictions often wrong - needs calibration

🔴 HIGH CONFIDENCE BUT WRONG (Overconfident Errors):
   These cases show where the model is confident but incorrect - critical to fix:
   1. Call 1: '4217' as DATE_TIME (conf: 0.850)
      Context: ...s?
Customer: 942 Elizabeth Street, Gold Coast QLD 4217.
Agent: M