# 🛡️ Presidio Baseline Framework - Technical Diagnostics

**Comprehensive diagnostic analysis for PII deidentification performance**

This notebook provides deep technical insights into the Microsoft Presidio baseline framework performance, identifying specific improvement opportunities and model behavior patterns.

## 📋 Analysis Sections
1. **Performance Overview** - Overall metrics and framework status
2. **Worst Recall Cases** - Top 5 cases with least recall (missed PII)
3. **Category Analysis** - Missed PII by type with improvement insights

## 🔧 Setup and Configuration

In [1]:
import pandas as pd
import numpy as np
import sys
import os
from pathlib import Path
from IPython.display import HTML, display
import warnings
warnings.filterwarnings('ignore')

# Add src to path for imports
project_root = Path().absolute().parent
sys.path.append(str(project_root / 'src'))

# Import evaluation functions (following the requirement to use only /evaluation functions)
from evaluation.metrics import PIIEvaluator
from evaluation.diagnostics import (
    get_transcript_cases_by_performance,
    create_diagnostic_html_table_configurable,
    analyze_missed_pii_categories,
    analyze_confidence_vs_correctness
)

# Import baseline framework for flexible integration
from baseline.presidio_framework import PurePresidioFramework


## 📊 Data Loading and Framework Execution

**Flexible Integration**: Load existing results if available, otherwise run the framework.

In [2]:
# Configuration
DATA_PATH = project_root / '.data' / 'synthetic_call_transcripts.csv'
RESULTS_PATH = project_root / 'demo' / 'presidio_baseline_results.csv'
EVALUATION_MODE = 'business'  # 'business' or 'research' - affects matching criteria

# print(f"🔍 Looking for data at: {DATA_PATH}")
# print(f"🔍 Looking for results at: {RESULTS_PATH}")

# Load ground truth data
if DATA_PATH.exists():
    ground_truth_df = pd.read_csv(DATA_PATH)
    print(f"✅ Loaded ground truth data: {len(ground_truth_df)} transcripts")
    # print(f"📋 Columns: {list(ground_truth_df.columns)}")
else:
    print(f"❌ Ground truth data not found at {DATA_PATH}")
    raise FileNotFoundError(f"Please ensure {DATA_PATH} exists")

 
# Initialize and run Presidio framework
framework = PurePresidioFramework(enable_mlflow=True)

# Process dataset
results_df = framework.process_dataset(
    csv_path=str(DATA_PATH),
    output_path=str(RESULTS_PATH)
)

print(f"✅ Framework processing complete: {len(results_df)} transcripts processed")
# print(f"💾 Results saved to {RESULTS_PATH}")

print(f"\n📊 DATASET OVERVIEW:")
print(f"   Ground Truth Transcripts: {len(ground_truth_df)}")
print(f"   Processed Results:        {len(results_df)}")
print(f"   Evaluation Mode:          {EVALUATION_MODE.upper()}")


✅ Loaded ground truth data: 100 transcripts
✅ MLflow experiment tracking enabled
🚀 Starting Pure Presidio Framework processing...
📊 Loaded 100 call transcripts
Processing transcript 100/100...
✅ Processing complete! Final metrics:
   total_transcripts: 100
   total_pii_detected: 979
   avg_pii_per_transcript: 9.79
   total_processing_time_seconds: 1.4557
   avg_processing_time_per_transcript_seconds: 0.0146
   estimated_time_for_1m_transcripts: 4.04 hours
✅ MLflow metrics logged successfully
✅ Framework processing complete: 100 transcripts processed

📊 DATASET OVERVIEW:
   Ground Truth Transcripts: 100
   Processed Results:        100
   Evaluation Mode:          BUSINESS


## 📈 1. Performance Overview

High-level performance metrics for the baseline Presidio framework.

In [3]:
# Initialize evaluator
evaluator = PIIEvaluator(matching_mode=EVALUATION_MODE)

# Calculate overall framework performance
print("🔄 Calculating comprehensive framework evaluation...")
evaluation_results = evaluator.evaluate_framework_results(results_df, ground_truth_df)

# Print detailed evaluation summary
evaluator.print_evaluation_summary(evaluation_results)

🔧 PIIEvaluator initialized with 'business' matching mode
   ✅ Business Focus: Any PII detection over ground truth = SUCCESS
🔄 Calculating comprehensive framework evaluation...

🎯 PII DEIDENTIFICATION EVALUATION RESULTS

📊 OVERALL PERFORMANCE:
   Precision:           0.786
   Recall:              0.912 ❌
   F1-Score:            0.844
   PII Protection Rate: 0.975 🛡️

📈 DETAILED COUNTS:
   True Positives:  737.7
   False Positives: 201
   False Negatives: 71

🔍 ENTITY TYPE BREAKDOWN:
   member_email           | P: 1.000 | R: 1.000 | F1: 1.000
   member_address         | P: 1.000 | R: 1.000 | F1: 1.000
   member_number          | P: 1.000 | R: 1.000 | F1: 1.000
   member_full_name       | P: 1.000 | R: 0.990 | F1: 0.995
   member_mobile          | P: 1.000 | R: 0.940 | F1: 0.969
   member_first_name      | P: 1.000 | R: 0.839 | F1: 0.912
   consultant_first_name  | P: 1.000 | R: 0.713 | F1: 0.832
   DATE_TIME              | P: 0.000 | R: 0.000 | F1: 0.000
   PERSON                 | P: 0.

## 🔍 2. Top 5 Cases with Least Recall (Missed PII)

Identify transcripts where the most PII was missed to understand failure patterns.

In [4]:
# Get worst recall cases
worst_recall_cases = get_transcript_cases_by_performance(
    results_df=results_df,
    ground_truth_df=ground_truth_df,
    metric='recall',
    n_cases=5,
    ascending=True,  # True = worst performers first
    matching_mode=EVALUATION_MODE
)

# Create diagnostic HTML table
worst_recall_html = create_diagnostic_html_table_configurable(
    transcript_data=worst_recall_cases,
    title="🔴 Top 5 Worst Recall Cases - Missed PII Analysis",
    description=f"""These transcripts had the lowest recall scores, meaning significant PII was missed.
    <strong>Red highlights</strong> show missed PII that should have been detected.
    Focus on patterns in missed PII to improve detection rules.""",
    matching_mode=EVALUATION_MODE
)

display(HTML(worst_recall_html))

# Summary insights for worst recall cases
print(f"\n💡 RECALL IMPROVEMENT INSIGHTS:")
recall_scores = [case['performance_metrics']['recall'] for case in worst_recall_cases]
avg_worst_recall = np.mean(recall_scores)
print(f"   📉 Average recall in worst cases: {avg_worst_recall:.1%}")
print(f"   🎯 These cases need the most attention for PII detection improvements")
print(f"   🔍 Look for patterns in the missed PII (red highlights) above")


🔍 ANALYZING TRANSCRIPT PERFORMANCE BY RECALL
🔧 PIIEvaluator initialized with 'business' matching mode
   ✅ Business Focus: Any PII detection over ground truth = SUCCESS

📊 WORST 5 PERFORMERS BY RECALL:
  1. Call 70: recall=43.4%, Recall=43.4%, Precision=70.5%, F1=53.8%
  2. Call 99: recall=58.8%, Recall=58.8%, Precision=70.2%, F1=64.0%
  3. Call 71: recall=59.2%, Recall=59.2%, Precision=70.3%, F1=64.3%
  4. Call 94: recall=59.2%, Recall=59.2%, Precision=70.3%, F1=64.3%
  5. Call 95: recall=62.5%, Recall=62.5%, Precision=71.4%, F1=66.7%

✅ Prepared 5 cases for analysis
🔧 PIIEvaluator initialized with 'business' matching mode
   ✅ Business Focus: Any PII detection over ground truth = SUCCESS


📊 Metrics & Performance,📋 Original Transcript,🛡️ Cleaned Transcript
📋 CALL ID: 70  🎯 Total PII Occurrences: 11  📈 PERFORMANCE (BUSINESS):  • Recall: 43.4%  • Precision: 70.5%  • 🛡️ PII Protection: 94.0%  🎯 STATUS:  🔴 Needs Improvement,"Agent: Hi, this is Ava from Bricks Health Insurance. Agent: May I have your mobile number? Customer: 042285 817 432. Agent: And your email address, please? Customer: ava.taylor@example.com. Agent: Could you confirm your full name, please? Customer: Ava Michael Taylor. Agent: Could I please have your Bricks membership number? Customer: 56014981. Agent: Finally, could you provide your residential address? Customer: 330 Victoria Road, Perth WA 6000. Agent: Thank you for verifying, Ava. How can I assist you today?","Agent: Hi, this is Ava from Bricks Health Insurance. Agent: May I have your mobile number? Customer: <PHONE_NUMBER>. Agent: And your email address, please? Customer: <EMAIL_ADDRESS>. Agent: Could you confirm your full name, please? Customer: Ava <PERSON>. Agent: Could I please have your <PERSON> membership number? Customer: <MEMBER_NUMBER>. Agent: Finally, could you provide your residential address? Customer: <AU_ADDRESS>. Agent: Thank you for verifying, Ava. How can I assist you <DATE_TIME>?"
📋 CALL ID: 99  🎯 Total PII Occurrences: 8  📈 PERFORMANCE (BUSINESS):  • Recall: 58.8%  • Precision: 70.2%  • 🛡️ PII Protection: 91.3%  🎯 STATUS:  🔴 Needs Improvement,"Agent: Hi, this is James from Bricks Health Insurance. Agent: May I have your mobile number? Customer: 042006 843 674. Agent: Could you confirm your full name, please? Customer: Ella Marie Taylor. Agent: Could I please have your Bricks membership number? Customer: 98345291. Agent: Finally, could you provide your residential address? Customer: 948 Harbour Road, Sydney NSW 2000. Agent: And your email address, please? Customer: ella.taylor@example.com. Agent: Thank you for verifying, Ella. How can I assist you today?","Agent: Hi, this is James from Bricks Health Insurance. Agent: May I have your mobile number? Customer: <PHONE_NUMBER>. Agent: Could you confirm your full name, please? Customer: Ella <PERSON>. Agent: Could I please have your <PERSON> membership number? Customer: <MEMBER_NUMBER>. Agent: Finally, could you provide your residential address? Customer: <AU_ADDRESS>. Agent: And your email address, please? Customer: <EMAIL_ADDRESS>. Agent: Thank you for verifying, Ella. How can I assist you <DATE_TIME>?"
📋 CALL ID: 71  🎯 Total PII Occurrences: 8  📈 PERFORMANCE (BUSINESS):  • Recall: 59.2%  • Precision: 70.3%  • 🛡️ PII Protection: 91.5%  🎯 STATUS:  🔴 Needs Improvement,"Agent: Hi, this is James from Bricks Health Insurance. Agent: Could you confirm your full name, please? Customer: Ella Patrick Wilson. Agent: And your email address, please? Customer: ella.wilson@example.com. Agent: Finally, could you provide your residential address? Customer: 327 Victoria Road, Darwin NT 0800. Agent: Could I please have your Bricks membership number? Customer: 96961359. Agent: May I have your mobile number? Customer: 044701 480 783. Agent: Thank you for verifying, Ella. How can I assist you today?","Agent: Hi, this is James from Bricks Health Insurance. Agent: Could you confirm your full name, please? Customer: Ella <PERSON>. Agent: And your email address, please? Customer: <EMAIL_ADDRESS>. Agent: Finally, could you provide your residential address? Customer: <AU_ADDRESS>. Agent: Could I please have your <PERSON> membership number? Customer: <MEMBER_NUMBER>. Agent: May I have your mobile number? Customer: <PHONE_NUMBER>. Agent: Thank you for verifying, Ella. How can I assist you <DATE_TIME>?"
📋 CALL ID: 94  🎯 Total PII Occurrences: 8  📈 PERFORMANCE (BUSINESS):  • Recall: 59.2%  • Precision: 70.3%  • 🛡️ PII Protection: 93.3%  🎯 STATUS:  🔴 Needs Improvement,"Agent: Hi, this is Ava from Bricks Health Insurance. Agent: May I have your mobile number? Customer: 041648 996 374. Agent: And your email address, please? Customer: ella.wilson@example.com. Agent: Could I please have your Bricks membership number? Customer: 95924617. Agent: Finally, could you provide your residential address? Customer: 34 Church Street, Adelaide SA 5000. Agent: Could you confirm your full name, please? Customer: Ella Michael Wilson. Agent: Thank you for verifying, Ella. How can I assist you today?","Agent: Hi, this is Ava from Bricks Health Insurance. Agent: May I have your mobile number? Customer: <PHONE_NUMBER>. Agent: And your email address, please? Customer: <EMAIL_ADDRESS>. Agent: Could I please have your <PERSON> membership number? Customer: <MEMBER_NUMBER>. Agent: Finally, could you provide your residential address? Customer: <AU_ADDRESS>. Agent: Could you confirm your full name, please? Customer: Ella <PERSON>. Agent: Thank you for verifying, Ella. How can I assist you <DATE_TIME>?"
📋 CALL ID: 95  🎯 Total PII Occurrences: 8  📈 PERFORMANCE (BUSINESS):  • Recall: 62.5%  • Precision: 71.4%  • 🛡️ PII Protection: 77.8%  🎯 STATUS:  🔴 Needs Improvement,"Agent: Hi, this is James from Bricks Health Insurance. Agent: Could you confirm your full name, please? Customer: Olivia Louise Smith. Agent: Could I please have your Bricks membership number? Customer: 49977546. Agent: May I have your mobile number? Customer: 046703 170 674. Agent: Finally, could you provide your residential address? Customer: 498 Victoria Road, Perth WA 6000. Agent: And your email address, please? Customer: olivia.smith@example.com. Agent: Thank you for verifying, Olivia. How can I assist you today?","Agent: Hi, this is James from Bricks Health Insurance. Agent: Could you confirm your full name, please? Customer: Olivia Louise Smith. Agent: Could I please have your <PERSON> membership number? Customer: <MEMBER_NUMBER>. Agent: May I have your mobile number? Customer: <PHONE_NUMBER>. Agent: Finally, could you provide your residential address? Customer: <AU_ADDRESS>. Agent: And your email address, please? Customer: <EMAIL_ADDRESS>. Agent: Thank you for verifying, <LOCATION>. How can I assist you <DATE_TIME>?"



💡 RECALL IMPROVEMENT INSIGHTS:
   📉 Average recall in worst cases: 56.6%
   🎯 These cases need the most attention for PII detection improvements
   🔍 Look for patterns in the missed PII (red highlights) above


## 📊 3. Category Analysis - Missed PII by Type

Detailed breakdown of missed PII by category to identify specific improvement areas.

In [6]:
# Analyze missed PII by categories
category_analysis = analyze_missed_pii_categories(
    results_df=results_df,
    ground_truth_df=ground_truth_df,
    matching_mode=EVALUATION_MODE
)

# Display detailed category insights
print(f"\n🔍 DETAILED CATEGORY ANALYSIS:")
print(f"=" * 60)

improvement_insights = category_analysis['improvement_insights']
missed_by_category = category_analysis['missed_by_category']
transcripts_with_misses = category_analysis['transcripts_with_misses']
transcripts_with_detections = category_analysis['transcripts_with_detections']

# Priority-based improvement recommendations
high_priority = [(cat, data) for cat, data in improvement_insights.items() if data['priority'] == 'HIGH']
medium_priority = [(cat, data) for cat, data in improvement_insights.items() if data['priority'] == 'MEDIUM']
low_priority = [(cat, data) for cat, data in improvement_insights.items() if data['priority'] == 'LOW']

if high_priority:
    print(f"\n🔴 HIGH PRIORITY IMPROVEMENTS:")
    for category, data in high_priority:
        print(f"   {category:20} | Miss Rate: {data['miss_rate']:.1%} | Total: {data['total_occurrences']}")
        
        # Show examples of missed vs detected for this category
        missed_examples = transcripts_with_misses.get(category, [])[:2]  # Top 2 examples
        detected_examples = transcripts_with_detections.get(category, [])[:2]  # Top 2 examples
        
        if missed_examples:
            print(f"     🔍 MISSED Examples:")
            for example in missed_examples:
                print(f"       Call {example['call_id']}: '{example['missed_value']}' in context: ...{example['context']}...")
        
        if detected_examples:
            print(f"     ✅ DETECTED Examples:")
            for example in detected_examples:
                print(f"       Call {example['call_id']}: '{example['detected_value']}' (conf: {example['overlap_ratio']:.2f})")
        print()

if medium_priority:
    print(f"\n🟡 MEDIUM PRIORITY IMPROVEMENTS:")
    for category, data in medium_priority:
        print(f"   {category:20} | Miss Rate: {data['miss_rate']:.1%} | Total: {data['total_occurrences']}")

if low_priority:
    print(f"\n🟢 LOW PRIORITY (Performing Well):")
    for category, data in low_priority:
        print(f"   {category:20} | Miss Rate: {data['miss_rate']:.1%} | Total: {data['total_occurrences']}")

# Strategic recommendations
print(f"\n🎯 STRATEGIC RECOMMENDATIONS:")
if high_priority:
    print(f"   1. Focus development efforts on HIGH priority categories above")
    print(f"   2. Analyze the missed vs detected examples for pattern differences")
    print(f"   3. Consider custom recognizers for problematic categories")
else:
    print(f"   🎉 No high-priority issues found - framework performing well across categories!")

print(f"   4. Monitor medium priority categories for regression")
print(f"   5. Use context patterns from examples to improve detection rules")


🔍 ANALYZING MISSED PII BY CATEGORIES
🔧 PIIEvaluator initialized with 'business' matching mode
   ✅ Business Focus: Any PII detection over ground truth = SUCCESS

📊 MISSED PII SUMMARY:
  consultant_first_name | Recall: 71.3% | Missed:  31/108 | Priority: MEDIUM
  member_first_name    | Recall: 83.9% | Missed:  33/205 | Priority: MEDIUM
  member_mobile        | Recall: 94.0% | Missed:   6/100 | Priority: LOW
  member_full_name     | Recall: 99.0% | Missed:   1/100 | Priority: LOW

🔍 DETAILED CATEGORY ANALYSIS:

🟡 MEDIUM PRIORITY IMPROVEMENTS:
   consultant_first_name | Miss Rate: 28.7% | Total: 108
   member_first_name    | Miss Rate: 16.1% | Total: 205

🟢 LOW PRIORITY (Performing Well):
   member_mobile        | Miss Rate: 6.0% | Total: 100
   member_full_name     | Miss Rate: 1.0% | Total: 100
   member_email         | Miss Rate: 0.0% | Total: 100
   member_address       | Miss Rate: 0.0% | Total: 100
   member_number        | Miss Rate: 0.0% | Total: 100

🎯 STRATEGIC RECOMMENDATIONS:
