# 🛡️ Three-Stage Workflow Demo

**Demonstrating the three-stage PII deidentification workflow:**
- **(a) Raw transcript** - Raw input with variations like 'C h l o e'
- **(b) Normalized transcript** - After TextNormaliser converts 'C h l o e' → 'chloe' 
- **(c) Cleaned transcript** - After PII removal frameworks

## 📋 Demo Sections
1. **Performance Overview** - Framework metrics and evaluation
2. **Stage Comparison** - Visual comparison of all three stages

In [1]:
import pandas as pd
import numpy as np
import sys
import os
from pathlib import Path
from IPython.display import HTML, display
import warnings
warnings.filterwarnings('ignore')

# Add src to path for imports
project_root = Path().absolute().parent
sys.path.append(str(project_root / 'src'))

# Import existing functions
from evaluation.metrics import PIIEvaluator
from evaluation.diagnostics import (
    get_transcript_cases_by_performance,
    create_diagnostic_html_table_configurable    
)
from baseline.presidio_framework import PurePresidioFramework
from utils.text_normaliser import TextNormaliser

## 📊 Data Loading and Three-Stage Processing

In [2]:
# Configuration
DATA_PATH = project_root / '.data' / 'synthetic_call_transcripts_voice_to_texts.csv'
EVALUATION_MODE = 'business'  # Focus on PII protection rather than exact type matching
N_SAMPLES = 3  # Process first 3 transcripts for demo

if DATA_PATH.exists():
    raw_df = pd.read_csv(DATA_PATH)

demo_df = raw_df.head(N_SAMPLES).copy()


In [3]:
# Initialize components
normalizer = TextNormaliser()
framework = PurePresidioFramework(enable_mlflow=False)  # Disable MLflow for demo

# Process through three stages
print(f"🔄 Processing {len(demo_df)} transcripts through three-stage workflow...")

three_stage_results = []

for idx, row in demo_df.iterrows():
    call_id = row['call_id']
    
    # Stage A: Raw transcript
    stage_a = row['call_transcript']
    
    # Stage B: Normalized transcript
    stage_b = normalizer.normalize_text(stage_a)
    
    # Stage C: Process with PII framework (using normalized text)
    pii_result = framework.process_transcript(stage_b)
    stage_c = pii_result['anonymized_text']
    
    # Store results
    result = {
        'call_id': call_id,
        'raw_transcript': stage_a,
        'normalized_transcript': stage_b,
        'anonymized_transcript': stage_c,
        'detected_pii': pii_result['pii_detections'],
        'processing_metadata': {
            'normalization_applied': stage_a != stage_b,
            'original_length': len(stage_a),
            'normalized_length': len(stage_b),
            'cleaned_length': len(stage_c)
        },
        # Ground truth (canonical values)
        'member_first_name': row['member_first_name'],
        'member_full_name': row['member_full_name'],
        'member_email': row['member_email'],
        'member_mobile': row['member_mobile'],
        'member_address': row['member_address'],
        'member_number': str(row['member_number']),
        'consultant_first_name': row['consultant_first_name']
    }
    
    three_stage_results.append(result)
    print(f"   ✅ Processed call_id {call_id}")

print(f"\n🎉 Three-stage processing complete!")


🔄 Processing 2 transcripts through three-stage workflow...
   ✅ Processed call_id 1
   ✅ Processed call_id 2

🎉 Three-stage processing complete!


In [4]:
pd.DataFrame(three_stage_results)

Unnamed: 0,call_id,raw_transcript,normalized_transcript,anonymized_transcript,detected_pii,processing_metadata,member_first_name,member_full_name,member_email,member_mobile,member_address,member_number,consultant_first_name
0,1,"Agent: Hi, this is Liam from Bricks Health Ins...","Agent: Hi, this is Liam from Bricks Health Ins...","Agent: Hi, this is <PERSON> from Bricks Health...","[{'entity_type': 'GENERIC_NUMBER', 'start': 18...","{'normalization_applied': False, 'original_len...",Chloe,Chloe Smith,chloe.smith@example.com,048561 415 113,"709 King Street, Adelaide SA 5000",58440378,Liam
1,2,"Agent: Hi, this is Liam from Bricks Health Ins...","Agent: Hi, this is Liam from Bricks Health Ins...","Agent: Hi, this is <PERSON> from Bricks Health...","[{'entity_type': 'GENERIC_NUMBER', 'start': 19...","{'normalization_applied': True, 'original_leng...",Chloe,Chloe Smith,chloe.smith@example.com,048561 415 113,"709 King Street, Adelaide SA 5000",58440378,Liam


In [5]:
# Display the HTML table, use `normalized_transcript` as the original for PII highlighting
html = create_diagnostic_html_table_configurable(
    transcript_data=three_stage_results,
    transcript_column='normalized_transcript',
    title="Normalized Transcript Performance",
    description="Demonstration of normalized, and cleaned transcripts with PII detection results.",
    matching_mode='business'
)

from IPython.display import display, HTML
display(HTML(html))

🔧 PIIEvaluator initialized with 'business' matching mode
   ✅ Business Focus: Any PII detection over ground truth = SUCCESS


📊 Metrics & Performance,📋 Original Transcript,🛡️ Cleaned Transcript
📋 CALL ID: 1  🎯 Total PII Occurrences: 8  📈 PERFORMANCE (BUSINESS):  • Recall: 100.0%  • Precision: 72.7%  • 🛡️ PII Protection: 100.0%  🎯 STATUS:  🟡 Good Protection,"Agent: Hi, this is Liam from Bricks Health Insurance. Agent: Could you confirm your full name, please? Customer: Chloe Smith. Agent: Could you provide your residential address? Customer: 709 King Street, Adelaide SA 5000. Agent: And your email address, please? Customer: chloe.smith@example.com. Agent: May I have your mobile number? Customer: my number is 048561 415 113. Agent: Could I please have your Bricks membership number? Customer: 58440378. Agent: Finally, could I confirm your birthday? Customer: It is 15th March, 1985. Agent: Thank you for verifying, Chloe. How can I assist you today?","Agent: Hi, this is <PERSON> from Bricks Health Insurance. Agent: Could you confirm your full name, please? Customer: <PERSON>. Agent: Could you provide your residential address? Customer: <AU_ADDRESS>. Agent: And your email address, please? Customer: <EMAIL_ADDRESS>. Agent: May I have your mobile number? Customer: my number is <AU_PHONE_NUMBER>. Agent: Could I please have your <PERSON> membership number? Customer: <GENERIC_NUMBER>. Agent: Finally, could I confirm your birthday? Customer: It is 15th March, <GENERIC_NUMBER>. Agent: Thank you for verifying, <PERSON>. How can I assist you today?"
📋 CALL ID: 2  🎯 Total PII Occurrences: 10  📈 PERFORMANCE (BUSINESS):  • Recall: 80.0%  • Precision: 72.7%  • 🛡️ PII Protection: 90.6%  🎯 STATUS:  🟡 Good Protection,"Agent: Hi, this is Liam from Bricks Health Insurance. Agent: Could you confirm your full name, please? Customer: Chloe Chloe Chloe Smith. Agent: Could you provide your residential address? Customer: 709 King Street, Adelaide SA 5000. Agent: And your email address, please? Customer: that would be chloe.smith@example.com Agent: May I have your mobile number? Customer: my number is 048561415113. Agent: Could I please have your Bricks membership number? Customer: 58440378. Agent: Finally, could I confirm your birthday? Customer: It is fifteenth March, 1985. Agent: Thank you for verifying, Chloe. How can I assist you today?","Agent: Hi, this is <PERSON> from Bricks Health Insurance. Agent: Could you confirm your full name, please? Customer: Chloe Chloe <PERSON>. Agent: Could you provide your residential address? Customer: <AU_ADDRESS>. Agent: And your email address, please? Customer: that would be <EMAIL_ADDRESS> Agent: May I have your mobile number? Customer: my number is <GENERIC_NUMBER>. Agent: Could I please have your <PERSON> membership number? Customer: <GENERIC_NUMBER>. Agent: Finally, could I confirm your birthday? Customer: It is fifteenth March, <GENERIC_NUMBER>. Agent: Thank you for verifying, <PERSON>. How can I assist you today?"
