# üîí PII Detection and Privacy Protection in GenAI Applications

## Overview

**Personally Identifiable Information (PII)** refers to any data that can be used to identify a specific individual. Protecting PII is crucial for:

- **Legal Compliance**: GDPR, CCPA, HIPAA, and other regulations mandate PII protection
- **User Trust**: Users expect their personal data to be handled responsibly
- **Security**: Preventing data breaches and identity theft
- **AI Safety**: Preventing LLMs from leaking sensitive training data

### Types of PII

| Category | Examples |
|----------|----------|
| **Direct Identifiers** | Name, SSN, Email, Phone, Passport |
| **Quasi-Identifiers** | Date of Birth, ZIP Code, Gender (can identify when combined) |
| **Sensitive Data** | Medical records, Financial info, Biometrics |
| **Online Identifiers** | IP Address, Device ID, Cookies |

### Privacy Protection Techniques Covered

1. **PII Detection & Anonymization** ‚Äî Find and mask sensitive data using Presidio
2. **Synthetic Data Generation** ‚Äî Create realistic fake data using Faker
3. **Output Filtering** ‚Äî Block sensitive content in LLM responses
4. **Consent Management** ‚Äî Track and enforce user data permissions

---

## Part 1: PII Detection & Anonymization with Presidio

### What is Presidio?

**Microsoft Presidio** is an open-source SDK for detecting and anonymizing PII in text and images. It uses:

- **Named Entity Recognition (NER)** ‚Äî ML models to identify entities like names, locations
- **Pattern Matching** ‚Äî Regex patterns for structured data (SSN, credit cards, phones)
- **Checksum Validation** ‚Äî Verify formats (e.g., Luhn algorithm for credit cards)

### How Presidio Works

```
Input Text ‚Üí AnalyzerEngine (Detect PII) ‚Üí Results ‚Üí AnonymizerEngine (Mask/Replace) ‚Üí Safe Text
```

### Supported Entity Types

| Entity | Description | Example |
|--------|-------------|---------|
| `PERSON` | Full names | "John Smith" |
| `PHONE_NUMBER` | Phone numbers (various formats) | "212-555-5555" |
| `EMAIL_ADDRESS` | Email addresses | "john@example.com" |
| `CREDIT_CARD` | Credit card numbers | "4111-1111-1111-1111" |
| `US_SSN` | US Social Security Numbers | "123-45-6789" |
| `LOCATION` | Addresses, cities, countries | "New York, NY" |
| `DATE_TIME` | Dates and times | "January 1, 2025" |

### Installation

```bash
pip install presidio-analyzer presidio-anonymizer
python -m spacy download en_core_web_lg  # Required NLP model
```

In [None]:
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import OperatorConfig

# Initialize the detection and anonymization engines
analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()

# Sample text containing various PII types
text = """
Dear Support Team,
My name is Mitudru Dutta and I'm having issues with my account.
You can reach me at mitudru.dutta@email.com or call me at 212-555-5555.
My billing address is Tollygunge, Kolkata, West Bengal 700033.
Please reference my account number: 4532-8790-1234-5678.
"""

# Step 1: Analyze text to detect all PII entities
results = analyzer.analyze(
    text=text,
    language='en',
    entities=None  # None = detect all supported entities
)

# Display what was detected
print("=" * 50)
print("DETECTED PII ENTITIES:")
print("=" * 50)
for result in results:
    detected_text = text[result.start:result.end]
    print(f"  Type: {result.entity_type:20} | Text: '{detected_text}' | Confidence: {result.score:.2f}")

# Step 2: Anonymize all detected PII
anonymized_result = anonymizer.anonymize(
    text=text,
    analyzer_results=results
)

print("\n" + "=" * 50)
print("ANONYMIZED TEXT:")
print("=" * 50)
print(anonymized_result.text)

### Custom Anonymization Strategies

Presidio supports different anonymization operators:

| Operator | Description | Example |
|----------|-------------|---------|
| `replace` | Replace with placeholder | `<PHONE_NUMBER>` |
| `redact` | Remove entirely | (empty) |
| `hash` | Replace with hash | `a1b2c3d4...` |
| `mask` | Partial masking | `***-***-5555` |
| `encrypt` | Encrypt with key | (encrypted blob) |
| `custom` | Your own function | (any transformation) |

In [None]:
# Example: Custom anonymization with different strategies per entity type
from presidio_anonymizer.entities import OperatorConfig

# Define custom operators for each PII type
operators = {
    "PERSON": OperatorConfig("replace", {"new_value": "[REDACTED_NAME]"}),
    "PHONE_NUMBER": OperatorConfig("mask", {"chars_to_mask": 8, "masking_char": "*", "from_end": False}),
    "EMAIL_ADDRESS": OperatorConfig("replace", {"new_value": "***@***.com"}),
    "CREDIT_CARD": OperatorConfig("mask", {"chars_to_mask": 12, "masking_char": "X", "from_end": False}),
    "LOCATION": OperatorConfig("replace", {"new_value": "[LOCATION_HIDDEN]"}),
}

# Re-analyze and anonymize with custom operators
results = analyzer.analyze(text=text, language='en')
custom_anonymized = anonymizer.anonymize(
    text=text,
    analyzer_results=results,
    operators=operators
)

print("CUSTOM ANONYMIZATION:")
print("=" * 50)
print(custom_anonymized.text)

---

## Part 2: Synthetic Data Generation with Faker

### Why Use Synthetic Data?

When developing and testing AI systems, using real user data poses risks:

| Risk | Solution with Synthetic Data |
|------|------------------------------|
| **Privacy Violations** | Fake data has no real individuals to identify |
| **Regulatory Issues** | No GDPR/CCPA concerns with fake data |
| **Data Breach Impact** | Leaked test data has no real-world consequences |
| **Limited Test Data** | Generate unlimited realistic samples |

### What is Faker?

**Faker** is a Python library that generates realistic-looking fake data for:
- Names, addresses, emails, phone numbers
- Credit cards, SSNs, bank accounts
- Dates, times, lorem ipsum text
- Company names, job titles
- And much more (100+ data types)

### Installation

```bash
pip install faker
```

### Key Features

- **Localization**: Generate data in 50+ languages/locales
- **Seeding**: Reproducible fake data with seeds
- **Providers**: Extensible with custom data generators

In [None]:
from faker import Faker

# Initialize with a seed for reproducibility
fake = Faker()
Faker.seed(42)  # Same seed = same fake data every time

# Generate a single synthetic user profile
print("=" * 50)
print("SINGLE SYNTHETIC USER PROFILE:")
print("=" * 50)
synthetic_user = {
    "name": fake.name(),
    "email": fake.email(),
    "phone": fake.phone_number(),
    "address": fake.address(),
    "ssn": fake.ssn(),
    "credit_card": fake.credit_card_number(),
    "date_of_birth": fake.date_of_birth(minimum_age=18, maximum_age=80),
    "company": fake.company(),
    "job": fake.job(),
}

for key, value in synthetic_user.items():
    print(f"  {key:15}: {value}")

# Generate multiple records (useful for testing datasets)
print("\n" + "=" * 50)
print("BATCH SYNTHETIC DATA (5 users):")
print("=" * 50)
for i in range(5):
    print(f"  {i+1}. {fake.name():25} | {fake.email():30} | {fake.phone_number()}")

### Localized Fake Data

Faker can generate culturally appropriate data for different regions:

In [None]:
# Generate locale-specific fake data
locales = ['en_US', 'en_IN', 'de_DE', 'ja_JP', 'fr_FR']

print("LOCALIZED FAKE NAMES:")
print("=" * 50)
for locale in locales:
    local_fake = Faker(locale)
    print(f"  {locale}: {local_fake.name()} | {local_fake.address().split(chr(10))[0]}")

---

## Part 3: Output Filtering for LLM Responses

### Why Filter LLM Outputs?

Large Language Models can inadvertently generate or leak sensitive information:

| Risk | Example |
|------|---------|
| **Memorized Training Data** | LLM outputs verbatim PII from training |
| **Prompt Injection** | Malicious prompts trick LLM into revealing data |
| **Hallucinated PII** | LLM generates realistic-looking fake PII |
| **Sensitive Topics** | Responses about passwords, financial details |

### Defense Strategies

1. **Keyword Blocklists** ‚Äî Block outputs containing sensitive terms
2. **Regex Pattern Matching** ‚Äî Detect structured PII (SSN, credit cards)
3. **PII Detection (Presidio)** ‚Äî ML-based detection before returning output
4. **Content Classification** ‚Äî ML classifiers for sensitive content categories

### Implementation Approach

```
LLM Response ‚Üí Content Filter ‚Üí [Pass/Block Decision] ‚Üí User
```

In [None]:
import re
from typing import Tuple

class OutputFilter:
    """
    Multi-layer output filter for LLM responses.
    Combines keyword blocking, regex patterns, and PII detection.
    """
    
    def __init__(self):
        # Layer 1: Banned keywords (case-insensitive)
        self.banned_keywords = [
            'password', 'credit card', 'ssn', 'social security',
            'api key', 'secret key', 'private key', 'bank account'
        ]
        
        # Layer 2: Regex patterns for structured PII
        self.pii_patterns = {
            'credit_card': r'\b(?:\d{4}[-\s]?){3}\d{4}\b',
            'ssn': r'\b\d{3}-\d{2}-\d{4}\b',
            'phone': r'\b(?:\+1[-.\s]?)?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}\b',
            'email': r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b',
        }
    
    def check_keywords(self, text: str) -> Tuple[bool, str]:
        """Check for banned keywords."""
        text_lower = text.lower()
        for keyword in self.banned_keywords:
            if keyword in text_lower:
                return False, f"Blocked: Contains banned keyword '{keyword}'"
        return True, "Passed keyword check"
    
    def check_patterns(self, text: str) -> Tuple[bool, str]:
        """Check for PII patterns using regex."""
        for pii_type, pattern in self.pii_patterns.items():
            if re.search(pattern, text):
                return False, f"Blocked: Contains potential {pii_type.replace('_', ' ')}"
        return True, "Passed pattern check"
    
    def filter(self, output: str) -> Tuple[str, dict]:
        """
        Apply all filters and return safe output or redaction notice.
        Returns: (filtered_output, filter_report)
        """
        report = {"original_length": len(output), "checks": []}
        
        # Check 1: Keywords
        passed, msg = self.check_keywords(output)
        report["checks"].append({"type": "keyword", "passed": passed, "message": msg})
        if not passed:
            return "[CONTENT BLOCKED: Contains sensitive keywords]", report
        
        # Check 2: PII Patterns
        passed, msg = self.check_patterns(output)
        report["checks"].append({"type": "pattern", "passed": passed, "message": msg})
        if not passed:
            return "[CONTENT BLOCKED: Contains potential PII]", report
        
        report["status"] = "PASSED"
        return output, report


# Initialize the filter
output_filter = OutputFilter()

# Test cases
test_outputs = [
    "Here's a recipe for chocolate cake with flour and sugar.",
    "Your credit card number is 4532-8790-1234-5678.",
    "The password for the system is 'admin123'.",
    "Contact support at help@company.com or 555-123-4567.",
    "The meeting is scheduled for tomorrow at 3 PM.",
]

print("OUTPUT FILTERING RESULTS:")
print("=" * 60)
for i, output in enumerate(test_outputs, 1):
    filtered, report = output_filter.filter(output)
    status = "‚úÖ PASSED" if report.get("status") == "PASSED" else "‚ùå BLOCKED"
    print(f"\n[Test {i}] {status}")
    print(f"  Input:  {output[:50]}{'...' if len(output) > 50 else ''}")
    print(f"  Output: {filtered[:50]}{'...' if len(filtered) > 50 else ''}")

---

## Part 4: Consent Management

### Why Consent Management Matters

Under regulations like **GDPR** and **CCPA**, organizations must:

| Requirement | Description |
|-------------|-------------|
| **Obtain Consent** | Get explicit permission before processing personal data |
| **Record Keeping** | Maintain audit trail of when/how consent was given |
| **Right to Withdraw** | Allow users to revoke consent at any time |
| **Data Minimization** | Only process data user consented to |
| **Purpose Limitation** | Only use data for stated purposes |

### Consent Types

| Type | Use Case | Example |
|------|----------|---------|
| **Data Collection** | Collecting user information | Sign-up forms |
| **Data Processing** | Using data for AI training | Model fine-tuning |
| **Data Sharing** | Sharing with third parties | Analytics partners |
| **Marketing** | Sending promotional content | Email campaigns |

### Implementation Pattern

```python
if check_consent(user_id, purpose="ai_training"):
    process_data(user_data)
else:
    log_consent_denial(user_id, purpose)
    skip_processing()
```

In [None]:
from datetime import datetime
from typing import Optional, Dict, List

class ConsentManager:
    """
    Manages user consent for data processing in compliance with GDPR/CCPA.
    Tracks consent history, purposes, and provides audit capabilities.
    """
    
    def __init__(self):
        # In production, this would be a database
        self.consent_registry: Dict[str, Dict] = {}
        self.audit_log: List[Dict] = []
    
    def grant_consent(self, user_id: str, purposes: List[str], source: str = "web_form") -> bool:
        """Record user consent for specific purposes."""
        timestamp = datetime.now().isoformat()
        
        if user_id not in self.consent_registry:
            self.consent_registry[user_id] = {"purposes": {}, "created_at": timestamp}
        
        for purpose in purposes:
            self.consent_registry[user_id]["purposes"][purpose] = {
                "granted": True,
                "granted_at": timestamp,
                "source": source,
                "revoked_at": None
            }
        
        self._log_event(user_id, "CONSENT_GRANTED", purposes)
        return True
    
    def revoke_consent(self, user_id: str, purposes: List[str]) -> bool:
        """Revoke consent for specific purposes (Right to Withdraw)."""
        if user_id not in self.consent_registry:
            return False
        
        timestamp = datetime.now().isoformat()
        for purpose in purposes:
            if purpose in self.consent_registry[user_id]["purposes"]:
                self.consent_registry[user_id]["purposes"][purpose]["granted"] = False
                self.consent_registry[user_id]["purposes"][purpose]["revoked_at"] = timestamp
        
        self._log_event(user_id, "CONSENT_REVOKED", purposes)
        return True
    
    def check_consent(self, user_id: str, purpose: str) -> bool:
        """Check if user has active consent for a specific purpose."""
        if user_id not in self.consent_registry:
            return False
        
        purpose_data = self.consent_registry[user_id]["purposes"].get(purpose)
        if not purpose_data:
            return False
        
        return purpose_data["granted"]
    
    def get_user_consents(self, user_id: str) -> Optional[Dict]:
        """Get all consent records for a user (for transparency)."""
        return self.consent_registry.get(user_id)
    
    def _log_event(self, user_id: str, event_type: str, purposes: List[str]):
        """Maintain audit trail for compliance."""
        self.audit_log.append({
            "timestamp": datetime.now().isoformat(),
            "user_id": user_id,
            "event_type": event_type,
            "purposes": purposes
        })


# Demo: Consent Management in Action
consent_manager = ConsentManager()

print("CONSENT MANAGEMENT DEMO")
print("=" * 60)

# Step 1: User grants consent during signup
print("\n[1] User signs up and grants consent:")
consent_manager.grant_consent(
    user_id="user_123",
    purposes=["data_collection", "ai_training", "analytics"],
    source="signup_form"
)
print("    ‚úÖ Consent granted for: data_collection, ai_training, analytics")

# Step 2: Check consent before processing
print("\n[2] Checking consent before AI training:")
if consent_manager.check_consent("user_123", "ai_training"):
    print("    ‚úÖ User has consented to AI training. Proceeding...")
else:
    print("    ‚ùå No consent. Skipping user data.")

# Step 3: User revokes some consent
print("\n[3] User revokes AI training consent:")
consent_manager.revoke_consent("user_123", ["ai_training"])
print("    üîÑ Consent revoked for: ai_training")

# Step 4: Re-check after revocation
print("\n[4] Re-checking consent after revocation:")
if consent_manager.check_consent("user_123", "ai_training"):
    print("    ‚úÖ User has consented to AI training. Proceeding...")
else:
    print("    ‚ùå No consent for AI training. Skipping user data.")

# Step 5: View user's consent record (transparency)
print("\n[5] User's consent record (Right to Access):")
record = consent_manager.get_user_consents("user_123")
for purpose, data in record["purposes"].items():
    status = "‚úÖ Active" if data["granted"] else "‚ùå Revoked"
    print(f"    {purpose:20} | {status} | Source: {data['source']}")

# Step 6: Audit log (for compliance)
print("\n[6] Audit Log (for regulatory compliance):")
for event in consent_manager.audit_log:
    print(f"    {event['timestamp'][:19]} | {event['event_type']:20} | {event['purposes']}")

---

## Summary & Best Practices

### Key Takeaways

| Technique | Tool | Use Case |
|-----------|------|----------|
| **PII Detection** | Presidio | Scan inputs/outputs for sensitive data |
| **Anonymization** | Presidio | Replace PII with placeholders/masks |
| **Synthetic Data** | Faker | Generate realistic test data safely |
| **Output Filtering** | Custom/Presidio | Block sensitive LLM responses |
| **Consent Management** | Custom/Tools | Track user permissions |

### Best Practices Checklist

- [ ] **Scan all inputs** before sending to LLMs
- [ ] **Filter all outputs** before showing to users
- [ ] **Use synthetic data** for development/testing
- [ ] **Implement consent checks** before data processing
- [ ] **Maintain audit logs** for compliance
- [ ] **Minimize data collection** ‚Äî only collect what you need
- [ ] **Encrypt PII at rest and in transit**
- [ ] **Regular privacy audits** of your AI systems

### Regulatory Reference

| Regulation | Region | Key Requirements |
|------------|--------|------------------|
| **GDPR** | EU | Consent, Right to Erasure, Data Portability |
| **CCPA** | California | Right to Know, Delete, Opt-Out of Sale |
| **HIPAA** | US Healthcare | PHI Protection, Minimum Necessary |
| **PDPA** | Singapore | Consent, Purpose Limitation |

### Further Reading

- [Microsoft Presidio Documentation](https://microsoft.github.io/presidio/)
- [Faker Documentation](https://faker.readthedocs.io/)
- [GDPR Official Text](https://gdpr-info.eu/)
- [NIST Privacy Framework](https://www.nist.gov/privacy-framework)