# 🛡️ Multilayer Enhanced Presidio Framework - Technical Diagnostics

**Comprehensive diagnostic analysis for PII deidentification performance**

This notebook provides deep technical insights into the Integrated Multi-layer **Normalization + Presidio + Sweeping** Framework performance, identifying specific improvement opportunities and model behavior patterns.

## 📋 Analysis Sections
1. **Performance Overview** - Overall metrics and framework status
2. **Worst Recall Cases** - Top 5 cases with least recall (missed PII)
3. **Category Analysis** - Missed PII by type with improvement insights

## 🔧 Setup and Configuration

In [1]:
import pandas as pd
import numpy as np
import sys
import os
from pathlib import Path
from IPython.display import HTML, display
import warnings
warnings.filterwarnings('ignore')

# Add src to path for imports
project_root = Path().absolute().parent
sys.path.append(str(project_root / 'src'))

# Import evaluation functions (following the requirement to use only /evaluation functions)
from evaluation.metrics import PIIEvaluator
from evaluation.diagnostics import (
    get_transcript_cases_by_performance,
    create_diagnostic_html_table_configurable,
    analyze_missed_pii_categories,
    create_simple_transcript_table
)

# Import baseline framework and normaliser
from workflows.normalization_presidio_workflow import process_transcript_with_normalization, process_dataset_with_normalization
from utils.text_normaliser import TextNormaliser
from utils.text_sweeper import TextSweeper
from baseline.presidio_framework import PurePresidioFramework

## 📊 Data Loading 

Load the voice to text transcript that is harder to handle. 

In [2]:
# Configuration
DATA_PATH = project_root / '.data' / 'synthetic_call_transcripts_voice_to_texts.csv'
EVALUATION_MODE = 'business'  # 'business' or 'research' - affects matching criteria

# Load ground truth data
if DATA_PATH.exists():
    raw_df = pd.read_csv(DATA_PATH)
    print(f"✅ Loaded ground truth data: {len(raw_df)} transcripts")
    # print(f"📋 Columns: {list(ground_truth_df.columns)}")
else:
    print(f"❌ Ground truth data not found at {DATA_PATH}")
    raise FileNotFoundError(f"Please ensure {DATA_PATH} exists")

✅ Loaded ground truth data: 2 transcripts


# Running the Framework 

## Framework Configuration

The framework is highly customizable. 

In [11]:
field_mapping ={
    '<FIRST_NAME>': ['member_first_name', 'agent_first_name'],
    '<LAST_NAME>': ['member_last_name'],
    '<PHONE_NUMBER>': ['member_mobile'],
    '<EMAIL>': ['member_email']
}
results = []
for idx, row in raw_df.iterrows():
    result = process_transcript_with_normalization(row['call_transcript'],    
                                                        presidio_framework=PurePresidioFramework(enable_mlflow=False),
                                                        apply_normalization=True,
                                                        apply_sweeping=True,
                                                        email_username_words=2,
                                                        custom_sweeping_dict=None,
                                                        transcript_row=row.to_dict(),
                                                        field_mapping=field_mapping
                                                        )
    
    result['call_id'] = row['call_id']
    result['original_transcript'] = row['call_transcript']
    result['normalized_transcript'] = result['normalized_text']
    result['presidio_transcript'] = result['presidio_text']
    result['anonymized_transcript'] = result['anonymized_text']
    
    results.append(result)

table = create_simple_transcript_table(results,
                               column_names = ['original_transcript', 'normalized_transcript', 'presidio_transcript', 'anonymized_transcript']
                               )
display(HTML(table))

Original Transcript,Normalized Transcript,Presidio Transcript,Anonymized Transcript
"Agent: Hi, this is Liam from Bricks Health Insurance. Agent: Could you confirm your full name, please? Customer: Chloe Smith. Agent: Could you provide your residential address? Customer: 709 King Street, Adelaide SA 5000. Agent: And your email address, please? Customer: chloe.smith@example.com. Agent: May I have your mobile number? Customer: my number is 048561 415 113. Agent: Could I please have your Bricks membership number? Customer: 58440378. Agent: Finally, could I confirm your birthday? Customer: It is 15th March, 1985. Agent: Thank you for verifying, Chloe. How can I assist you today?","Agent: Hi, this is Liam from Bricks Health Insurance. Agent: Could you confirm your full name, please? Customer: Chloe Smith. Agent: Could you provide your residential address? Customer: 709 King Street, Adelaide SA 5000. Agent: And your email address, please? Customer: chloe.smith@example.com. Agent: May I have your mobile number? Customer: my number is 048561 415 113. Agent: Could I please have your Bricks membership number? Customer: 58440378. Agent: Finally, could I confirm your birthday? Customer: It is 15th March, 1985. Agent: Thank you for verifying, Chloe. How can I assist you today?","Agent: Hi, this is <PERSON> from Bricks Health Insurance. Agent: Could you confirm your full name, please? Customer: <PERSON>. Agent: Could you provide your residential address? Customer: <AU_ADDRESS>. Agent: And your email address, please? Customer: <EMAIL_ADDRESS>. Agent: May I have your mobile number? Customer: my number is <AU_PHONE_NUMBER>. Agent: Could I please have your <PERSON> membership number? Customer: <GENERIC_NUMBER>. Agent: Finally, could I confirm your birthday? Customer: It is 15th March, <GENERIC_NUMBER>. Agent: Thank you for verifying, <PERSON>. How can I assist you today?","Agent: Hi, this is <PERSON> from Bricks Health Insurance. Agent: Could you confirm your full name, please? Customer: <PERSON>. Agent: Could you provide your residential address? Customer: <AU_ADDRESS>. Agent: And your email address, please? Customer: <EMAIL_ADDRESS>. Agent: May I have your mobile number? Customer: my number is <AU_PHONE_NUMBER>. Agent: Could I please have your <PERSON> membership number? Customer: <GENERIC_NUMBER>. Agent: Finally, could I confirm your birthday? Customer: It is <GENERIC_NUMBER> <MONTH>, <GENERIC_NUMBER>. Agent: Thank you for verifying, <PERSON>. How can I assist you today?"
"Agent: Hi, this is Liam from Bricks Health Insurance. Agent: Could you confirm your full name, please? Customer: Chloe C h l o e Chloe Smith. Agent: Could you provide your residential address? Customer: 709 King Street, Adelaide SA 5000. Agent: And your email address, please? Customer: that would be c h l o e smith at example dot com Agent: May I have your mobile number? Customer: my number is zero four eight five six one four one five one one three. Agent: Could I please have your Bricks membership number? Customer: five eight four four zero three seven eight. Agent: Finally, could I confirm your birthday? Customer: It is fifteenth March, 1985. Agent: Thank you for verifying, Chloe. How can I assist you today?","Agent: Hi, this is Liam from Bricks Health Insurance. Agent: Could you confirm your full name, please? Customer: Chloe Chloe Chloe Smith. Agent: Could you provide your residential address? Customer: 709 King Street, Adelaide SA 5000. Agent: And your email address, please? Customer: that would be chloe.smith@example.com Agent: May I have your mobile number? Customer: my number is 048561415113. Agent: Could I please have your Bricks membership number? Customer: 58440378. Agent: Finally, could I confirm your birthday? Customer: It is fifteenth March, 1985. Agent: Thank you for verifying, Chloe. How can I assist you today?","Agent: Hi, this is <PERSON> from Bricks Health Insurance. Agent: Could you confirm your full name, please? Customer: Chloe Chloe <PERSON>. Agent: Could you provide your residential address? Customer: <AU_ADDRESS>. Agent: And your email address, please? Customer: that would be <EMAIL_ADDRESS> Agent: May I have your mobile number? Customer: my number is <GENERIC_NUMBER>. Agent: Could I please have your <PERSON> membership number? Customer: <GENERIC_NUMBER>. Agent: Finally, could I confirm your birthday? Customer: It is fifteenth March, <GENERIC_NUMBER>. Agent: Thank you for verifying, <PERSON>. How can I assist you today?","Agent: Hi, this is <PERSON> from Bricks Health Insurance. Agent: Could you confirm your full name, please? Customer: <FIRST_NAME> <FIRST_NAME> <PERSON>. Agent: Could you provide your residential address? Customer: <AU_ADDRESS>. Agent: And your email address, please? Customer: that would be <EMAIL_ADDRESS> Agent: May I have your mobile number? Customer: my number is <GENERIC_NUMBER>. Agent: Could I please have your <PERSON> membership number? Customer: <GENERIC_NUMBER>. Agent: Finally, could I confirm your birthday? Customer: It is <GENERIC_NUMBER> <MONTH>, <GENERIC_NUMBER>. Agent: Thank you for verifying, <PERSON>. How can I assist you today?"


## Running Over the Whole Dataset

In [5]:
# Run the Integrated Framework
results_df = process_dataset_with_normalization(raw_df,
                                                id_column='call_id',
                                                apply_normalization=True,
                                                apply_sweeping=True,
                                                email_username_words=2,
                                                custom_sweeping_dict=None,
                                                field_mapping=field_mapping)
    

print(f"✅ Framework processing complete: {len(results_df)} transcripts processed")


print("\n📊 DATASET OVERVIEW:")
print(f"   Ground Truth Transcripts: {len(raw_df)}")
print(f"   Processed Results:        {len(results_df)}")
print(f"   Evaluation Mode:          {EVALUATION_MODE.upper()}")


🚀 Starting Integrated Normalization + Presidio + Sweeping Framework processing...
Processing transcript 2/2...
✅ Processing complete! Final metrics:
 • total_transcripts: 2
 • total_pii_detected: 28
 • avg_pii_per_transcript: 14.0
 • total_processing_time_seconds: 0.0338
 • presidio_processing_time_seconds: 0.0318
 • normalization_processing_time_seconds: 0.001
 • workflow_stages: Normalization + Presidio + Sweeping
 • avg_processing_time_per_transcript_seconds: 0.0169
 • estimated_time_for_1m_transcripts: 4.70 hours
 • sweeping_processing_time_seconds: 0.0
 • sweeping_percentage_of_total: 0.0
✅ Framework processing complete: 2 transcripts processed

📊 DATASET OVERVIEW:
   Ground Truth Transcripts: 2
   Processed Results:        2
   Evaluation Mode:          BUSINESS


## 📈 1. Performance Overview

High-level performance metrics for the baseline Presidio framework.

**TODO: Update the evaulation functions to work on the final anonymized text.** The evaluation currently rely on presidio output. Therefore, the improvement from the sweeping layer hasn't been reflected yet. The recall after sweeping on the synthetic data is actually 100%. 

In [6]:
# Initialize evaluator
evaluator = PIIEvaluator(matching_mode=EVALUATION_MODE, transcript_column="normalized_transcript", id_column="call_id")

# Calculate overall framework performance
print("🔄 Calculating comprehensive framework evaluation...")
evaluation_results = evaluator.evaluate_framework_results(results_df, raw_df)

# Print detailed evaluation summary
evaluator.print_evaluation_summary(evaluation_results)

🔧 PIIEvaluator initialized with 'business' matching mode
   ✅ Business Focus: Any PII detection over ground truth = SUCCESS
🔄 Calculating comprehensive framework evaluation...

🎯 PII DEIDENTIFICATION EVALUATION RESULTS

📊 OVERALL PERFORMANCE:
   Precision:           0.727
   Recall:              0.889 ❌
   F1-Score:            0.800
   PII Protection Rate: 0.953 🛡️

📈 DETAILED COUNTS:
   True Positives:  16.0
   False Positives: 6
   False Negatives: 2

🔍 ENTITY TYPE BREAKDOWN:
   member_full_name       | P: 1.000 | R: 1.000 | F1: 1.000
   member_email           | P: 1.000 | R: 1.000 | F1: 1.000
   member_mobile          | P: 1.000 | R: 1.000 | F1: 1.000
   member_address         | P: 1.000 | R: 1.000 | F1: 1.000
   member_number          | P: 1.000 | R: 1.000 | F1: 1.000
   consultant_first_name  | P: 1.000 | R: 1.000 | F1: 1.000
   member_first_name      | P: 1.000 | R: 0.667 | F1: 0.800
   GENERIC_NUMBER         | P: 0.000 | R: 0.000 | F1: 0.000
   PERSON                 | P: 0.000 

## 🔍 2. Top 5 Cases with Least Recall (Missed PII)

Identify transcripts where the most PII was missed to understand failure patterns.

In [7]:
# Get worst recall cases
worst_recall_cases = get_transcript_cases_by_performance(
    results_df=results_df,
    ground_truth_df=raw_df,
    transcript_column="normalized_transcript",
    metric='recall',
    n_cases=5,
    ascending=True,  # True = worst performers first
    matching_mode=EVALUATION_MODE
)

# Create diagnostic HTML table
worst_recall_html = create_diagnostic_html_table_configurable(
    transcript_data=worst_recall_cases,
    transcript_column="normalized_transcript",
    title="🔴 Top 5 Worst Recall Cases - Missed PII Analysis",
    description="""These transcripts had the lowest recall scores, meaning significant PII was missed.
    <strong>Red highlights</strong> show missed PII that should have been detected.
    Focus on patterns in missed PII to improve detection rules.""",
    matching_mode=EVALUATION_MODE
)

display(HTML(worst_recall_html))

# Summary insights for worst recall cases
print("\n💡 RECALL IMPROVEMENT INSIGHTS:")
recall_scores = [case['performance_metrics']['recall'] for case in worst_recall_cases]
avg_worst_recall = np.mean(recall_scores)
print(f"   📉 Average recall in worst cases: {avg_worst_recall:.1%}")
print("   🎯 These cases need the most attention for PII detection improvements")
print("   🔍 Look for patterns in the missed PII (red highlights) above")


🔍 ANALYZING TRANSCRIPT PERFORMANCE BY RECALL
🔧 PIIEvaluator initialized with 'business' matching mode
   ✅ Business Focus: Any PII detection over ground truth = SUCCESS

📊 WORST 5 PERFORMERS BY RECALL:
  1. Call 2: recall=80.0%, Recall=80.0%, Precision=72.7%, F1=76.2%
  2. Call 1: recall=100.0%, Recall=100.0%, Precision=72.7%, F1=84.2%

✅ Prepared 2 cases for analysis
🔧 PIIEvaluator initialized with 'business' matching mode
   ✅ Business Focus: Any PII detection over ground truth = SUCCESS


📊 Metrics & Performance,📋 Original Transcript,🛡️ Cleaned Transcript
📋 CALL ID: 2  🎯 Total PII Occurrences: 10  📈 PERFORMANCE (BUSINESS):  • Recall: 80.0%  • Precision: 72.7%  • 🛡️ PII Protection: 90.6%  🎯 STATUS:  🟡 Good Protection,"Agent: Hi, this is Liam from Bricks Health Insurance. Agent: Could you confirm your full name, please? Customer: Chloe Chloe Chloe Smith. Agent: Could you provide your residential address? Customer: 709 King Street, Adelaide SA 5000. Agent: And your email address, please? Customer: that would be chloe.smith@example.com Agent: May I have your mobile number? Customer: my number is 048561415113. Agent: Could I please have your Bricks membership number? Customer: 58440378. Agent: Finally, could I confirm your birthday? Customer: It is fifteenth March, 1985. Agent: Thank you for verifying, Chloe. How can I assist you today?","Agent: Hi, this is <PERSON> from Bricks Health Insurance. Agent: Could you confirm your full name, please? Customer: <FIRST_NAME> <FIRST_NAME> <PERSON>. Agent: Could you provide your residential address? Customer: <AU_ADDRESS>. Agent: And your email address, please? Customer: that would be <EMAIL_ADDRESS> Agent: May I have your mobile number? Customer: my number is <AU_PHONE_NUMBER>. Agent: Could I please have your <PERSON> membership number? Customer: <GENERIC_NUMBER>. Agent: Finally, could I confirm your birthday? Customer: It is <GENERIC_NUMBER> <MONTH>, <GENERIC_NUMBER>. Agent: Thank you for verifying, <PERSON>. How can I assist you today?"
📋 CALL ID: 1  🎯 Total PII Occurrences: 8  📈 PERFORMANCE (BUSINESS):  • Recall: 100.0%  • Precision: 72.7%  • 🛡️ PII Protection: 100.0%  🎯 STATUS:  🟡 Good Protection,"Agent: Hi, this is Liam from Bricks Health Insurance. Agent: Could you confirm your full name, please? Customer: Chloe Smith. Agent: Could you provide your residential address? Customer: 709 King Street, Adelaide SA 5000. Agent: And your email address, please? Customer: chloe.smith@example.com. Agent: May I have your mobile number? Customer: my number is 048561 415 113. Agent: Could I please have your Bricks membership number? Customer: 58440378. Agent: Finally, could I confirm your birthday? Customer: It is 15th March, 1985. Agent: Thank you for verifying, Chloe. How can I assist you today?","Agent: Hi, this is <PERSON> from Bricks Health Insurance. Agent: Could you confirm your full name, please? Customer: <PERSON>. Agent: Could you provide your residential address? Customer: <AU_ADDRESS>. Agent: And your email address, please? Customer: <EMAIL_ADDRESS>. Agent: May I have your mobile number? Customer: my number is <AU_PHONE_NUMBER>. Agent: Could I please have your <PERSON> membership number? Customer: <GENERIC_NUMBER>. Agent: Finally, could I confirm your birthday? Customer: It is <GENERIC_NUMBER> <MONTH>, <GENERIC_NUMBER>. Agent: Thank you for verifying, <PERSON>. How can I assist you today?"



💡 RECALL IMPROVEMENT INSIGHTS:
   📉 Average recall in worst cases: 90.0%
   🎯 These cases need the most attention for PII detection improvements
   🔍 Look for patterns in the missed PII (red highlights) above


## 📊 3. Category Analysis - Missed PII by Type

Detailed breakdown of missed PII by category to identify specific improvement areas.

In [8]:
# Analyze missed PII by categories
category_analysis = analyze_missed_pii_categories(
    results_df=results_df,
    ground_truth_df=raw_df,
    transcript_column="normalized_transcript",
    matching_mode=EVALUATION_MODE
)

# Display detailed category insights
print("\n🔍 DETAILED CATEGORY ANALYSIS:")
print("=" * 60)

improvement_insights = category_analysis['improvement_insights']
missed_by_category = category_analysis['missed_by_category']
transcripts_with_misses = category_analysis['transcripts_with_misses']
transcripts_with_detections = category_analysis['transcripts_with_detections']

# Priority-based improvement recommendations
high_priority = [(cat, data) for cat, data in improvement_insights.items() if data['priority'] == 'HIGH']
medium_priority = [(cat, data) for cat, data in improvement_insights.items() if data['priority'] == 'MEDIUM']
low_priority = [(cat, data) for cat, data in improvement_insights.items() if data['priority'] == 'LOW']

if high_priority:
    print("\n🔴 HIGH PRIORITY IMPROVEMENTS:")
    for category, data in high_priority:
        print(f"   {category:20} | Miss Rate: {data['miss_rate']:.1%} | Total: {data['total_occurrences']}")
        
        # Show examples of missed vs detected for this category
        missed_examples = transcripts_with_misses.get(category, [])[:2]  # Top 2 examples
        detected_examples = transcripts_with_detections.get(category, [])[:2]  # Top 2 examples
        
        if missed_examples:
            print("     🔍 MISSED Examples:")
            for example in missed_examples:
                print(f"       Call {example['call_id']}: '{example['missed_value']}' in context: ...{example['context']}...")
        
        if detected_examples:
            print("     ✅ DETECTED Examples:")
            for example in detected_examples:
                print(f"       Call {example['call_id']}: '{example['detected_value']}' (conf: {example['overlap_ratio']:.2f})")
        print()

if medium_priority:
    print("\n🟡 MEDIUM PRIORITY IMPROVEMENTS:")
    for category, data in medium_priority:
        print(f"   {category:20} | Miss Rate: {data['miss_rate']:.1%} | Total: {data['total_occurrences']}")

if low_priority:
    print("\n🟢 LOW PRIORITY (Performing Well):")
    for category, data in low_priority:
        print(f"   {category:20} | Miss Rate: {data['miss_rate']:.1%} | Total: {data['total_occurrences']}")

# Strategic recommendations
print("\n🎯 STRATEGIC RECOMMENDATIONS:")
if high_priority:
    print("   1. Focus development efforts on HIGH priority categories above")
    print("   2. Analyze the missed vs detected examples for pattern differences")
    print("   3. Consider custom recognizers for problematic categories")
else:
    print("   🎉 No high-priority issues found - framework performing well across categories!")

print("   4. Monitor medium priority categories for regression")
print("   5. Use context patterns from examples to improve detection rules")


🔍 ANALYZING MISSED PII BY CATEGORIES
🔧 PIIEvaluator initialized with 'business' matching mode
   ✅ Business Focus: Any PII detection over ground truth = SUCCESS

📊 MISSED PII SUMMARY:
  member_first_name    | Recall: 66.7% | Missed:   2/  6 | Priority: HIGH

🔍 DETAILED CATEGORY ANALYSIS:

🔴 HIGH PRIORITY IMPROVEMENTS:
   member_first_name    | Miss Rate: 33.3% | Total: 6
     🔍 MISSED Examples:
       Call 2: 'Chloe' in context: ...uld you confirm your full name, please?
Customer: Chloe Chloe Chloe Smith.
Agent: Could you provide your ...
       Call 2: 'Chloe' in context: ...u confirm your full name, please?
Customer: Chloe Chloe Chloe Smith.
Agent: Could you provide your reside...
     ✅ DETECTED Examples:
       Call 1: 'Chloe Smith' (conf: 1.00)
       Call 1: 'Chloe' (conf: 1.00)


🟢 LOW PRIORITY (Performing Well):
   consultant_first_name | Miss Rate: 0.0% | Total: 2
   member_full_name     | Miss Rate: 0.0% | Total: 2
   member_number        | Miss Rate: 0.0% | Total: 2
   membe