# üõ°Ô∏è Presidio Baseline Framework - Technical Diagnostics

**Comprehensive diagnostic analysis for PII deidentification performance**

This notebook provides deep technical insights into the Microsoft Presidio baseline framework performance, identifying specific improvement opportunities and model behavior patterns.

## üìã Analysis Sections
1. **Performance Overview** - Overall metrics and framework status
2. **Worst Recall Cases** - Top 5 cases with least recall (missed PII)
3. **Category Analysis** - Missed PII by type with improvement insights

## üîß Setup and Configuration

In [1]:
import pandas as pd
import numpy as np
import sys
import os
from pathlib import Path
from IPython.display import HTML, display
import warnings
warnings.filterwarnings('ignore')

# Add src to path for imports
project_root = Path().absolute().parent
sys.path.append(str(project_root / 'src'))

# Import evaluation functions (following the requirement to use only /evaluation functions)
from evaluation.metrics import PIIEvaluator
from evaluation.diagnostics import (
    get_transcript_cases_by_performance,
    create_diagnostic_html_table_configurable,
    analyze_missed_pii_categories
)

# Import baseline framework for flexible integration
from baseline.presidio_framework import PurePresidioFramework


## üìä Data Loading and Framework Execution

**Flexible Integration**: Load existing results if available, otherwise run the framework.

In [2]:
# Configuration
DATA_PATH = project_root / '.data' / 'synthetic_call_transcripts.csv'
RESULTS_PATH = project_root / 'demo' / 'presidio_baseline_results.csv'
EVALUATION_MODE = 'business'  # 'business' or 'research' - affects matching criteria

# print(f"üîç Looking for data at: {DATA_PATH}")
# print(f"üîç Looking for results at: {RESULTS_PATH}")

# Load ground truth data
if DATA_PATH.exists():
    ground_truth_df = pd.read_csv(DATA_PATH)
    print(f"‚úÖ Loaded ground truth data: {len(ground_truth_df)} transcripts")
    # print(f"üìã Columns: {list(ground_truth_df.columns)}")
else:
    print(f"‚ùå Ground truth data not found at {DATA_PATH}")
    raise FileNotFoundError(f"Please ensure {DATA_PATH} exists")

 
# Initialize and run Presidio framework
framework = PurePresidioFramework(enable_mlflow=True)

# Process dataset
results_df = framework.process_dataset(
    csv_path=str(DATA_PATH),
    output_path=str(RESULTS_PATH)
)

print(f"‚úÖ Framework processing complete: {len(results_df)} transcripts processed")
# print(f"üíæ Results saved to {RESULTS_PATH}")

print("\nüìä DATASET OVERVIEW:")
print(f"   Ground Truth Transcripts: {len(ground_truth_df)}")
print(f"   Processed Results:        {len(results_df)}")
print(f"   Evaluation Mode:          {EVALUATION_MODE.upper()}")


‚úÖ Loaded ground truth data: 100 transcripts
‚úÖ MLflow experiment tracking enabled
üöÄ Starting Pure Presidio Framework processing...
üìä Loaded 100 call transcripts
Processing transcript 100/100...
‚úÖ Processing complete! Final metrics:
  ‚Ä¢ total_transcripts: 100
  ‚Ä¢ total_pii_detected: 1417
  ‚Ä¢ avg_pii_per_transcript: 14.17
  ‚Ä¢ total_processing_time_seconds: 1.8071
  ‚Ä¢ avg_processing_time_per_transcript_seconds: 0.0181
  ‚Ä¢ estimated_time_for_1m_transcripts: 5.02 hours
‚úÖ MLflow metrics logged successfully
‚úÖ Framework processing complete: 100 transcripts processed

üìä DATASET OVERVIEW:
   Ground Truth Transcripts: 100
   Processed Results:        100
   Evaluation Mode:          BUSINESS


## üìà 1. Performance Overview

High-level performance metrics for the baseline Presidio framework.

In [3]:
# Initialize evaluator
evaluator = PIIEvaluator(matching_mode=EVALUATION_MODE)

# Calculate overall framework performance
print("üîÑ Calculating comprehensive framework evaluation...")
evaluation_results = evaluator.evaluate_framework_results(results_df, ground_truth_df)

# Print detailed evaluation summary
evaluator.print_evaluation_summary(evaluation_results)

üîß PIIEvaluator initialized with 'business' matching mode
   ‚úÖ Business Focus: Any PII detection over ground truth = SUCCESS
üîÑ Calculating comprehensive framework evaluation...



üéØ PII DEIDENTIFICATION EVALUATION RESULTS

üìä OVERALL PERFORMANCE:
   Precision:           0.786
   Recall:              0.926 ‚ùå
   F1-Score:            0.850
   PII Protection Rate: 0.978 üõ°Ô∏è

üìà DETAILED COUNTS:
   True Positives:  749.5
   False Positives: 204
   False Negatives: 60

üîç ENTITY TYPE BREAKDOWN:
   member_full_name       | P: 1.000 | R: 1.000 | F1: 1.000
   member_email           | P: 1.000 | R: 1.000 | F1: 1.000
   member_mobile          | P: 1.000 | R: 1.000 | F1: 1.000
   member_address         | P: 1.000 | R: 1.000 | F1: 1.000
   member_number          | P: 1.000 | R: 1.000 | F1: 1.000
   member_first_name      | P: 1.000 | R: 0.859 | F1: 0.924
   consultant_first_name  | P: 1.000 | R: 0.713 | F1: 0.832
   GENERIC_NUMBER         | P: 0.000 | R: 0.000 | F1: 0.000
   PERSON                 | P: 0.000 | R: 0.000 | F1: 0.000

‚ö†Ô∏è  ISSUES IDENTIFIED:
   Missed PII:       60
   Over-detections:  204
   Partial matches:  13

üéØ RECALL TARGET: ‚ùå NOT 

## üîç 2. Top 5 Cases with Least Recall (Missed PII)

Identify transcripts where the most PII was missed to understand failure patterns.

In [4]:
# Get worst recall cases
worst_recall_cases = get_transcript_cases_by_performance(
    results_df=results_df,
    ground_truth_df=ground_truth_df,
    metric='recall',
    n_cases=5,
    ascending=True,  # True = worst performers first
    matching_mode=EVALUATION_MODE
)

# Create diagnostic HTML table
worst_recall_html = create_diagnostic_html_table_configurable(
    transcript_data=worst_recall_cases,
    title="üî¥ Top 5 Worst Recall Cases - Missed PII Analysis",
    description="""These transcripts had the lowest recall scores, meaning significant PII was missed.
    <strong>Red highlights</strong> show missed PII that should have been detected.
    Focus on patterns in missed PII to improve detection rules.""",
    matching_mode=EVALUATION_MODE
)

display(HTML(worst_recall_html))

# Summary insights for worst recall cases
print("\nüí° RECALL IMPROVEMENT INSIGHTS:")
recall_scores = [case['performance_metrics']['recall'] for case in worst_recall_cases]
avg_worst_recall = np.mean(recall_scores)
print(f"   üìâ Average recall in worst cases: {avg_worst_recall:.1%}")
print("   üéØ These cases need the most attention for PII detection improvements")
print("   üîç Look for patterns in the missed PII (red highlights) above")


üîç ANALYZING TRANSCRIPT PERFORMANCE BY RECALL
üîß PIIEvaluator initialized with 'business' matching mode
   ‚úÖ Business Focus: Any PII detection over ground truth = SUCCESS

üìä WORST 5 PERFORMERS BY RECALL:
  1. Call 70: recall=43.4%, Recall=43.4%, Precision=70.5%, F1=53.8%
  2. Call 99: recall=58.8%, Recall=58.8%, Precision=70.2%, F1=64.0%
  3. Call 71: recall=59.2%, Recall=59.2%, Precision=70.3%, F1=64.3%
  4. Call 94: recall=59.2%, Recall=59.2%, Precision=70.3%, F1=64.3%
  5. Call 4: recall=71.3%, Recall=71.3%, Precision=74.0%, F1=72.7%

‚úÖ Prepared 5 cases for analysis
üîß PIIEvaluator initialized with 'business' matching mode
   ‚úÖ Business Focus: Any PII detection over ground truth = SUCCESS


üìä Metrics & Performance,üìã Original Transcript,üõ°Ô∏è Cleaned Transcript
üìã CALL ID: 70  üéØ Total PII Occurrences: 11  üìà PERFORMANCE (BUSINESS):  ‚Ä¢ Recall: 43.4%  ‚Ä¢ Precision: 70.5%  ‚Ä¢ üõ°Ô∏è PII Protection: 90.0%  üéØ STATUS:  üî¥ Needs Improvement,"Agent: Hi, this is Ava from Bricks Health Insurance. Agent: May I have your mobile number? Customer: 042285 817 432. Agent: And your email address, please? Customer: ava.taylor@example.com. Agent: Could you confirm your full name, please? Customer: Ava Michael Taylor. Agent: Could I please have your Bricks membership number? Customer: 56014981. Agent: Finally, could you provide your residential address? Customer: 330 Victoria Road, Perth WA 6000. Agent: Thank you for verifying, Ava. How can I assist you today?","Agent: Hi, this is Ava from Bricks Health Insurance. Agent: May I have your mobile number? Customer: <AU_PHONE_NUMBER>. Agent: And your email address, please? Customer: <EMAIL_ADDRESS>. Agent: Could you confirm your full name, please? Customer: Ava <PERSON>. Agent: Could I please have your <PERSON> membership number? Customer: <GENERIC_NUMBER>. Agent: Finally, could you provide your residential address? Customer: <AU_ADDRESS>. Agent: Thank you for verifying, Ava. How can I assist you today?"
üìã CALL ID: 99  üéØ Total PII Occurrences: 8  üìà PERFORMANCE (BUSINESS):  ‚Ä¢ Recall: 58.8%  ‚Ä¢ Precision: 70.2%  ‚Ä¢ üõ°Ô∏è PII Protection: 86.5%  üéØ STATUS:  üî¥ Needs Improvement,"Agent: Hi, this is James from Bricks Health Insurance. Agent: May I have your mobile number? Customer: 042006 843 674. Agent: Could you confirm your full name, please? Customer: Ella Marie Taylor. Agent: Could I please have your Bricks membership number? Customer: 98345291. Agent: Finally, could you provide your residential address? Customer: 948 Harbour Road, Sydney NSW 2000. Agent: And your email address, please? Customer: ella.taylor@example.com. Agent: Thank you for verifying, Ella. How can I assist you today?","Agent: Hi, this is James from Bricks Health Insurance. Agent: May I have your mobile number? Customer: <AU_PHONE_NUMBER>. Agent: Could you confirm your full name, please? Customer: Ella <PERSON>. Agent: Could I please have your <PERSON> membership number? Customer: <GENERIC_NUMBER>. Agent: Finally, could you provide your residential address? Customer: <AU_ADDRESS>. Agent: And your email address, please? Customer: <EMAIL_ADDRESS>. Agent: Thank you for verifying, Ella. How can I assist you today?"
üìã CALL ID: 71  üéØ Total PII Occurrences: 8  üìà PERFORMANCE (BUSINESS):  ‚Ä¢ Recall: 59.2%  ‚Ä¢ Precision: 70.3%  ‚Ä¢ üõ°Ô∏è PII Protection: 86.8%  üéØ STATUS:  üî¥ Needs Improvement,"Agent: Hi, this is James from Bricks Health Insurance. Agent: Could you confirm your full name, please? Customer: Ella Patrick Wilson. Agent: And your email address, please? Customer: ella.wilson@example.com. Agent: Finally, could you provide your residential address? Customer: 327 Victoria Road, Darwin NT 0800. Agent: Could I please have your Bricks membership number? Customer: 96961359. Agent: May I have your mobile number? Customer: 044701 480 783. Agent: Thank you for verifying, Ella. How can I assist you today?","Agent: Hi, this is James from Bricks Health Insurance. Agent: Could you confirm your full name, please? Customer: Ella <PERSON>. Agent: And your email address, please? Customer: <EMAIL_ADDRESS>. Agent: Finally, could you provide your residential address? Customer: <AU_ADDRESS>. Agent: Could I please have your <PERSON> membership number? Customer: <GENERIC_NUMBER>. Agent: May I have your mobile number? Customer: <AU_PHONE_NUMBER>. Agent: Thank you for verifying, Ella. How can I assist you today?"
üìã CALL ID: 94  üéØ Total PII Occurrences: 8  üìà PERFORMANCE (BUSINESS):  ‚Ä¢ Recall: 59.2%  ‚Ä¢ Precision: 70.3%  ‚Ä¢ üõ°Ô∏è PII Protection: 88.6%  üéØ STATUS:  üî¥ Needs Improvement,"Agent: Hi, this is Ava from Bricks Health Insurance. Agent: May I have your mobile number? Customer: 041648 996 374. Agent: And your email address, please? Customer: ella.wilson@example.com. Agent: Could I please have your Bricks membership number? Customer: 95924617. Agent: Finally, could you provide your residential address? Customer: 34 Church Street, Adelaide SA 5000. Agent: Could you confirm your full name, please? Customer: Ella Michael Wilson. Agent: Thank you for verifying, Ella. How can I assist you today?","Agent: Hi, this is Ava from Bricks Health Insurance. Agent: May I have your mobile number? Customer: <AU_PHONE_NUMBER>. Agent: And your email address, please? Customer: <EMAIL_ADDRESS>. Agent: Could I please have your <PERSON> membership number? Customer: <GENERIC_NUMBER>. Agent: Finally, could you provide your residential address? Customer: <AU_ADDRESS>. Agent: Could you confirm your full name, please? Customer: Ella <PERSON>. Agent: Thank you for verifying, Ella. How can I assist you today?"
üìã CALL ID: 4  üéØ Total PII Occurrences: 8  üìà PERFORMANCE (BUSINESS):  ‚Ä¢ Recall: 71.3%  ‚Ä¢ Precision: 74.0%  ‚Ä¢ üõ°Ô∏è PII Protection: 91.1%  üéØ STATUS:  üî¥ Needs Improvement,"Agent: Hi, this is Noah from Bricks Health Insurance. Agent: And your email address, please? Customer: ella.white@example.com. Agent: Could you confirm your full name, please? Customer: Ella Andrew White. Agent: May I have your mobile number? Customer: 044928 834 779. Agent: Finally, could you provide your residential address? Customer: 592 Pine Street, Sydney NSW 2000. Agent: Could I please have your Bricks membership number? Customer: 53376329. Agent: Thank you for verifying, Ella. How can I assist you today?","Agent: Hi, this is <PERSON> from Bricks Health Insurance. Agent: And your email address, please? Customer: <EMAIL_ADDRESS>. Agent: Could you confirm your full name, please? Customer: Ella <PERSON>. Agent: May I have your mobile number? Customer: <AU_PHONE_NUMBER>. Agent: Finally, could you provide your residential address? Customer: <AU_ADDRESS>. Agent: Could I please have your <PERSON> membership number? Customer: <GENERIC_NUMBER>. Agent: Thank you for verifying, Ella. How can I assist you today?"



üí° RECALL IMPROVEMENT INSIGHTS:
   üìâ Average recall in worst cases: 58.4%
   üéØ These cases need the most attention for PII detection improvements
   üîç Look for patterns in the missed PII (red highlights) above


## üìä 3. Category Analysis - Missed PII by Type

Detailed breakdown of missed PII by category to identify specific improvement areas.

In [5]:
# Analyze missed PII by categories
category_analysis = analyze_missed_pii_categories(
    results_df=results_df,
    ground_truth_df=ground_truth_df,
    matching_mode=EVALUATION_MODE
)

# Display detailed category insights
print("\nüîç DETAILED CATEGORY ANALYSIS:")
print("=" * 60)

improvement_insights = category_analysis['improvement_insights']
missed_by_category = category_analysis['missed_by_category']
transcripts_with_misses = category_analysis['transcripts_with_misses']
transcripts_with_detections = category_analysis['transcripts_with_detections']

# Priority-based improvement recommendations
high_priority = [(cat, data) for cat, data in improvement_insights.items() if data['priority'] == 'HIGH']
medium_priority = [(cat, data) for cat, data in improvement_insights.items() if data['priority'] == 'MEDIUM']
low_priority = [(cat, data) for cat, data in improvement_insights.items() if data['priority'] == 'LOW']

if high_priority:
    print("\nüî¥ HIGH PRIORITY IMPROVEMENTS:")
    for category, data in high_priority:
        print(f"   {category:20} | Miss Rate: {data['miss_rate']:.1%} | Total: {data['total_occurrences']}")
        
        # Show examples of missed vs detected for this category
        missed_examples = transcripts_with_misses.get(category, [])[:2]  # Top 2 examples
        detected_examples = transcripts_with_detections.get(category, [])[:2]  # Top 2 examples
        
        if missed_examples:
            print("     üîç MISSED Examples:")
            for example in missed_examples:
                print(f"       Call {example['call_id']}: '{example['missed_value']}' in context: ...{example['context']}...")
        
        if detected_examples:
            print("     ‚úÖ DETECTED Examples:")
            for example in detected_examples:
                print(f"       Call {example['call_id']}: '{example['detected_value']}' (conf: {example['overlap_ratio']:.2f})")
        print()

if medium_priority:
    print("\nüü° MEDIUM PRIORITY IMPROVEMENTS:")
    for category, data in medium_priority:
        print(f"   {category:20} | Miss Rate: {data['miss_rate']:.1%} | Total: {data['total_occurrences']}")

if low_priority:
    print("\nüü¢ LOW PRIORITY (Performing Well):")
    for category, data in low_priority:
        print(f"   {category:20} | Miss Rate: {data['miss_rate']:.1%} | Total: {data['total_occurrences']}")

# Strategic recommendations
print("\nüéØ STRATEGIC RECOMMENDATIONS:")
if high_priority:
    print("   1. Focus development efforts on HIGH priority categories above")
    print("   2. Analyze the missed vs detected examples for pattern differences")
    print("   3. Consider custom recognizers for problematic categories")
else:
    print("   üéâ No high-priority issues found - framework performing well across categories!")

print("   4. Monitor medium priority categories for regression")
print("   5. Use context patterns from examples to improve detection rules")


üîç ANALYZING MISSED PII BY CATEGORIES
üîß PIIEvaluator initialized with 'business' matching mode
   ‚úÖ Business Focus: Any PII detection over ground truth = SUCCESS

üìä MISSED PII SUMMARY:
  consultant_first_name | Recall: 71.3% | Missed:  31/108 | Priority: MEDIUM
  member_first_name    | Recall: 85.9% | Missed:  29/205 | Priority: MEDIUM

üîç DETAILED CATEGORY ANALYSIS:

üü° MEDIUM PRIORITY IMPROVEMENTS:
   consultant_first_name | Miss Rate: 28.7% | Total: 108
   member_first_name    | Miss Rate: 14.1% | Total: 205

üü¢ LOW PRIORITY (Performing Well):
   member_full_name     | Miss Rate: 0.0% | Total: 100
   member_number        | Miss Rate: 0.0% | Total: 100
   member_address       | Miss Rate: 0.0% | Total: 100
   member_mobile        | Miss Rate: 0.0% | Total: 100
   member_email         | Miss Rate: 0.0% | Total: 100

üéØ STRATEGIC RECOMMENDATIONS:
   üéâ No high-priority issues found - framework performing well across categories!
   4. Monitor medium priority categori