### Introduction

In this notebook, we explore how to manage **Personally Identifiable Information (PII)** in text data using two different approaches:

1. **Presidio-based PII Redaction**: This part demonstrates how to use the **Presidio** library, an open-source tool for detecting and anonymizing PII, such as names, email addresses, and phone numbers. The notebook explains how to set up Presidio's Analyzer and Anonymizer, providing a solution for identifying and redacting sensitive information from text.

2. **Hugging Face's Named Entity Recognition (NER) with Custom Scrubbers**: In this section, we use Hugging Face's **NER model** (BERT-based) for detecting PII in text, and custom **scrubber functions** to redact various types of PII, such as phone numbers, credit card details, and email addresses. This part also integrates a **summarization** function that generates a summary of the scrubbed text, demonstrating how to preprocess and protect sensitive data while maintaining the utility of the text.

Together, these approaches provide two ways to handle PII in text data, ensuring privacy and data security while still allowing for text processing, summarization, and analysis.


In [None]:
# !pip uninstall presidio-analyzer presidio-anonymizer
# !pip install presidio-analyzer presidio-anonymizer

### Title: PII Detection and Scrubbing with Hugging Face's NER and Custom Scrubbers

This code demonstrates how to detect and scrub **Personally Identifiable Information (PII)** from text using **Hugging Face's NER** (Named Entity Recognition) models and custom scrubbing functions. Here's a breakdown of the steps:

1. **Import Necessary Libraries**
   - **Transformers**: For using Hugging Face's pre-trained NER model.
   - **Datasets**: For loading the PII Masking dataset.
   - **Torch**: Required for model execution.

2. **Load Hugging Face's NER Pipeline**
   - Initializes a **Hugging Face NER pipeline** with a pre-trained **BERT model** fine-tuned on the CoNLL-03 dataset for NER tasks.

3. **Define PII Scrubbers**
   - Defines functions to scrub various types of PII:
     - **Phone Numbers**: Scrubs phone numbers from the text.
     - **Credit Card Numbers**: Scrubs credit card numbers.
     - **Email Addresses**: Scrubs email addresses.
     - **Postal Codes**: Scrubs postal codes.
     - **SIN Numbers**: Scrubs Social Insurance Numbers (SIN).
   - Each function uses **regular expressions (regex)** to detect and replace PII with placeholder text like `[REDACTED PHONE NUMBER]`.

4. **Apply Scrubbers to Input Text**
   - Combines all scrubbers into a list and applies them to the input text via the `apply_scrubbers` function.

5. **PII Detection with Hugging Face**
   - The `detect_pii_with_huggingface` function uses the Hugging Face NER pipeline to detect PII in the text, specifically filtering entities such as **persons (PER)**, **locations (LOC)**, **organizations (ORG)**, and **miscellaneous entities (MISC)**.

6. **Hugging Face Request with Scrubbed Text**
   - The `send_huggingface_request` function first detects and scrubs PII from the input text, then sends a request to a Hugging Face model for text summarization (using **BART** in this case).
   - It returns the **summary** of the scrubbed text and the **detected PII**.

7. **Example Usage**
   - Example input text: `"Michael Smith (msmith@gmail.com, (+1) 111-111-1111) committed a mistake when he used PyTorch Trainer instead of HF Trainer."`
   - The code summarizes the input text after scrubbing the PII and detects PII entities like **name**, **email**, and **phone number**.

8. **Testing PII Scrubbing on Real Data (Optional)**
   - Loads the **AI4Privacy PII Masking dataset** containing text with PII.
   - Applies the scrubbing mechanism to an example text from the dataset and prints the detected PII and the summary of the scrubbed text.

### Key Steps:
- **PII Detection**: Detect PII using **Hugging Face's NER model**.
- **PII Scrubbing**: Scrub detected PII using custom scrubbers with **regex**.
- **Text Summarization**: Generate a summary of the scrubbed text using Hugging Face's **BART** model.
- **PII Masking on Real Data**: Optionally test PII scrubbing on a dataset with real-world examples.

This workflow helps with **privacy protection** by detecting and scrubbing PII from texts before processing, generating insights without compromising personal information.


In [6]:
# Import necessary libraries
import os
import re
from transformers import pipeline
from datasets import load_dataset
import torch

# Set your environment variables (API keys, if needed, for cloud models)
# For Hugging Face models, you generally don’t need an API key unless using specific models on Hugging Face Hub

# Step 1: Load Hugging Face's NER pipeline
# We will use a pretrained BERT model for Named Entity Recognition (NER)
ner_pipeline = pipeline("ner", model="dbmdz/bert-large-cased-finetuned-conll03-english")

# Step 2: Define scrubbers for different types of PII
def scrub_phone_numbers(text: str) -> str:
    return re.sub(r"\(?\+?[0-9]*\)?[-.\s]?[0-9]+[-.\s]?[0-9]+[-.\s]?[0-9]+", "[REDACTED PHONE NUMBER]", text)

def scrub_credit_card_numbers(text: str) -> str:
    return re.sub(r"\b(?:\d[ -]*?){13,16}\b", "[REDACTED CREDIT CARD]", text)

def scrub_email_addresses(text: str) -> str:
    return re.sub(r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}", "[REDACTED EMAIL ADDRESS]", text)

def scrub_postal_codes(text: str) -> str:
    return re.sub(r"\b\d{5}(-\d{4})?\b", "[REDACTED POSTAL CODE]", text)

def scrub_sin_numbers(text: str) -> str:
    return re.sub(r"\b\d{3}-\d{3}-\d{3}\b", "[REDACTED SIN]", text)

# Combine all scrubbers into a list
ALL_SCRUBBERS = [
    scrub_phone_numbers,
    scrub_credit_card_numbers,
    scrub_email_addresses,
    scrub_postal_codes,
    scrub_sin_numbers,
]

# Step 3: Function to apply all scrubbers to input text
def apply_scrubbers(text: str) -> str:
    """
    Apply all predefined scrubbers to the input text.
    """
    for scrubber in ALL_SCRUBBERS:
        text = scrubber(text)
    return text

# Step 4: Function to detect PII using Hugging Face's NER pipeline
def detect_pii_with_huggingface(text: str):
    """
    This function uses Hugging Face's NER model to detect PII in the text.
    """
    # Detect named entities using Hugging Face's NER pipeline
    ner_results = ner_pipeline(text)
    
    # Filter out non-PII entities (e.g., labels other than 'PER' for persons, 'ORG' for organizations)
    pii_entities = [entity for entity in ner_results if entity['entity'] in ['PER', 'LOC', 'ORG', 'MISC']]
    
    return pii_entities

# Step 5: Set up a function to send a request to a Hugging Face model (e.g., summarization or generation)
def send_huggingface_request(text: str, model="facebook/bart-large-cnn", max_tokens=50):
    """
    Sends a request to a Hugging Face model after scrubbing PII.
    """
    # Step 5.1: Detect and scrub PII using Hugging Face's NER model
    detected_pii = detect_pii_with_huggingface(text)
    scrubbed_text = apply_scrubbers(text)
    
    # Step 5.2: Load the Hugging Face model (for text summarization or generation)
    summarizer = pipeline("summarization", model=model)
    
    # Step 5.3: Generate a summary of the scrubbed text
    summary = summarizer(scrubbed_text, max_length=max_tokens, min_length=25, do_sample=False)
    
    return summary, detected_pii

# Step 6: Example usage of the function with a sample input
example_text = "Michael Smith (msmith@gmail.com, (+1) 111-111-1111) committed a mistake when he used PyTorch Trainer instead of HF Trainer."

# Send the request and print the response
summary, detected_pii = send_huggingface_request(
    text=f"{example_text}\n\nSummarize the above text in 1-2 sentences."
)
print("Summary:", summary)
print("Detected PII:", detected_pii)

# Step 7: Testing PII Scrubbing on Real Data (Optional)
# You can load a dataset containing PII and apply the scrubbing mechanism to it

# Load the AI4Privacy PII Masking dataset
pii_ds = load_dataset("ai4privacy/pii-masking-200k")

# Example input from the dataset
example_text = pii_ds["train"][36]["source_text"]

# Send the request with the scrubbed dataset example
summary, detected_pii = send_huggingface_request(
    text=f"{example_text}\n\nSummarize the above text in 1-2 sentences."
)
print("Summary:", summary)
print("Detected PII:", detected_pii)


Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0
Device set to use cuda:0
Your max_length is set to 50, but your input_length is only 49. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=24)


Summary: [{'summary_text': 'Michael Smith committed a mistake when he used PyTorch Trainer instead of HF Trainer. He used the wrong name for the training program.'}]
Detected PII: []


Device set to use cuda:0


Summary: [{'summary_text': "I need the latest update on assessment results. Please send the files to [REDACTED EMAIL ADDRESS]. For your extra time, we'll offer you Kip[REDACTED PHONE NUMBER]. But please provide your л"}]
Detected PII: []


### Title: PII Detection and Redaction using Presidio and Hugging Face NER Models

This code demonstrates how to use **Presidio** and **Hugging Face's NER models** for **PII detection and redaction**. Below is a breakdown of the steps performed:

1. **Import Necessary Libraries**
   - **Presidio**: For detecting and redacting PII (using `AnalyzerEngine` and `AnonymizerEngine`).
   - **Hugging Face**: For Named Entity Recognition (NER) using a pre-trained BERT model.

2. **Setup Presidio Analyzer**
   - Initializes **Presidio's AnalyzerEngine**, which is responsible for detecting various PII categories (e.g., names, emails, phone numbers).
   - Prints the available PII recognizers in Presidio.

3. **Setup Hugging Face NER Pipeline**
   - Loads a **Hugging Face NER pipeline** using a pre-trained **BERT model** fine-tuned for NER tasks. This model identifies entities like **persons**, **locations**, **organizations**, and **miscellaneous entities** in text.

4. **PII Detection and Redaction with Presidio**
   - The function `detect_and_redact_pii_with_presidio(text)` uses **Presidio** to detect and redact PII. 
   - It analyzes the text with the `AnalyzerEngine`, and then it anonymizes the detected PII using the `AnonymizerEngine`.
   - Returns the **anonymized text** and the **detected PII**.

5. **PII Detection with Hugging Face**
   - The function `detect_pii_with_huggingface(text)` uses **Hugging Face's NER pipeline** to detect PII entities in the text.
   - It filters and returns entities like **persons (PER)**, **locations (LOC)**, **organizations (ORG)**, and **miscellaneous (MISC)**.

6. **Comparison of Results**
   - The code compares the results from **Presidio** and **Hugging Face** for detecting PII in the same input text and prints out both results.

7. **Customizing Presidio for Specific PII Types**
   - Shows how **Presidio** can be customized by adding or modifying PII recognizers to detect specific keywords or types of information.
   - Demonstrates **custom redaction** using **custom recognizers** to anonymize custom data (e.g., redacting certain keywords).

### Example Input:
The text used for testing includes **John Doe's email** and **phone number**: 

```text
"John Doe's email is john.doe@example.com and his phone number is +1234567890."


In [4]:
# 1. Import necessary libraries

# Presidio is for PII detection and redaction
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine, OperatorConfig
from presidio_analyzer import RecognizerResult

# Hugging Face's transformers for NER (Named Entity Recognition)
from transformers import pipeline

# 2. Setup Presidio Analyzer

# Initialize Presidio's AnalyzerEngine, which will detect various PII categories
analyzer = AnalyzerEngine()

# List available PII recognizers in Presidio
available_recognizers = analyzer.get_recognizers()
print("Available recognizers in Presidio:", [r.name for r in available_recognizers])

# 3. Setup Hugging Face for NER

# Using Hugging Face pipeline for Named Entity Recognition (NER)
ner_pipeline = pipeline("ner", model="dbmdz/bert-large-cased-finetuned-conll03-english")

# 4. Function to detect and redact PII using Presidio

def detect_and_redact_pii_with_presidio(text: str):
    """
    This function uses Presidio to detect and redact PII from the text.
    """
    # Analyze the text to detect PII
    results = analyzer.analyze(text=text, language='en')
    
    # Create an anonymizer engine
    anonymizer = AnonymizerEngine()
    
    # Redact detected PII
    anonymized_text = anonymizer.anonymize(text, results)
    
    # Return the anonymized text and detected entities
    return anonymized_text, results

# Example input text with PII
input_text = "John Doe's email is john.doe@example.com and his phone number is +1234567890."

# Detect and redact PII using Presidio
anonymized_text_presidio, detected_pii_presidio = detect_and_redact_pii_with_presidio(input_text)
print("Anonymized text using Presidio:", anonymized_text_presidio)

# 5. Function to detect PII using Hugging Face's NER pipeline

from transformers import pipeline

# Using Hugging Face's NER pipeline
ner_pipeline = pipeline("ner", model="dbmdz/bert-large-cased-finetuned-conll03-english")

# Function to detect PII using Hugging Face's NER pipeline
def detect_pii_with_huggingface(text: str):
    """
    This function uses Hugging Face's NER model to detect PII in the text.
    """
    # Detect named entities using Hugging Face's NER pipeline
    ner_results = ner_pipeline(text)
    
    # Filter out non-PII entities (e.g., labels other than 'PER' for persons, 'ORG' for organizations)
    pii_entities = [entity for entity in ner_results if entity['entity'] in ['PER', 'LOC', 'ORG', 'MISC']]
    
    return pii_entities

# Example input text with PII
input_text = "John Doe's email is john.doe@example.com and his phone number is +1234567890."

# Detect PII using Hugging Face
detected_pii_huggingface = detect_pii_with_huggingface(input_text)
print("Detected PII using Hugging Face:", detected_pii_huggingface)


# Detect PII using Hugging Face
detected_pii_huggingface = detect_pii_with_huggingface(input_text)
print("Detected PII using Hugging Face:", detected_pii_huggingface)

# 6. Compare the results from both methods

print("Comparing Presidio and Hugging Face NER Results:")
print(f"Presidio detected PII: {detected_pii_presidio}")
print(f"Hugging Face detected PII: {detected_pii_huggingface}")

# 7. Customizing Presidio for specific PII types

# Example: Adding custom recognizers or modifying existing ones
custom_recognizers = analyzer.get_recognizers()
print("Customizable Recognizers:", custom_recognizers)

# Presidio allows customization of the recognizers, for example, to redact specific keywords
def custom_pii_redaction(text: str):
    """
    Custom PII redaction using custom recognizers in Presidio
    """
    custom_results = analyzer.analyze(text=text, language='en')
    custom_anonymizer = AnonymizerEngine()
    custom_anonymized_text = custom_anonymizer.anonymize(text, custom_results)
    return custom_anonymized_text

# Example input with custom text
custom_input_text = "Jane Doe was seen at 1234 Elm Street and her email is jane.doe@customdomain.com."
custom_anonymized_text = custom_pii_redaction(custom_input_text)
print("Custom Anonymized text using Presidio:", custom_anonymized_text)


[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
Available recognizers in Presidio: ['InVoterRecognizer', 'SpacyRecognizer', 'UsLicenseRecognizer', 'UsBankRecognizer', 'InPassportRecognizer', 'UsSsnRecognizer', 'DateRecognizer', 'CryptoRecognizer', 'AuTfnRecognizer', 'UsItinRecognizer', 'UsPassportRecognizer', 'NhsRecognizer', 'UkNinoRecognizer', 'AuMedicareRecognizer', 'InVehicleRegistrationRecognizer', 'AuAbnRecognizer', 'AuAcnRecognizer', 'InPanRecognizer', 'InAadhaarRecognizer', 'PhoneRecognizer', 'IbanRecognizer', 'EmailRecognizer', 'SgFinRecognizer', 'UrlRecognizer', 'CreditCardRecognizer', 'IpRecognizer', 'MedicalLicenseRecognizer']



Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0


Anonymized text using Presidio: text: <PERSON> email is <EMAIL_ADDRESS> and his phone number is +<US_BANK_NUMBER>.
items:
[
    {'start': 59, 'end': 75, 'entity_type': 'US_BANK_NUMBER', 'text': '<US_BANK_NUMBER>', 'operator': 'replace'},
    {'start': 18, 'end': 33, 'entity_type': 'EMAIL_ADDRESS', 'text': '<EMAIL_ADDRESS>', 'operator': 'replace'},
    {'start': 0, 'end': 8, 'entity_type': 'PERSON', 'text': '<PERSON>', 'operator': 'replace'}
]



Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0


Detected PII using Hugging Face: []
Detected PII using Hugging Face: []
Comparing Presidio and Hugging Face NER Results:
Presidio detected PII: [type: EMAIL_ADDRESS, start: 20, end: 40, score: 1.0, type: PERSON, start: 0, end: 10, score: 0.85, type: URL, start: 20, end: 27, score: 0.5, type: URL, start: 29, end: 40, score: 0.5, type: US_BANK_NUMBER, start: 66, end: 76, score: 0.05, type: US_DRIVER_LICENSE, start: 66, end: 76, score: 0.01]
Hugging Face detected PII: []
Customizable Recognizers: [<presidio_analyzer.predefined_recognizers.in_voter_recognizer.InVoterRecognizer object at 0x00000193D3898820>, <presidio_analyzer.predefined_recognizers.spacy_recognizer.SpacyRecognizer object at 0x00000193D3898040>, <presidio_analyzer.predefined_recognizers.us_driver_license_recognizer.UsLicenseRecognizer object at 0x00000193D38980A0>, <presidio_analyzer.predefined_recognizers.us_bank_recognizer.UsBankRecognizer object at 0x00000193D38980D0>, <presidio_analyzer.predefined_recognizers.in_passpor

### Conclusion

In this notebook, we demonstrated two techniques for identifying and redacting **PII** from text:

- **Presidio** provides a comprehensive, built-in solution for detecting and anonymizing various types of PII.
- **Hugging Face's NER** model, combined with custom scrubbers, enables us to detect and redact sensitive information, followed by further text processing such as summarization.

These methods can be adapted to suit different requirements for **privacy protection** in real-world applications. The workflow ensures that sensitive data is kept private and secure, while still allowing us to extract meaningful insights from the text. Whether you're working with personal data in a dataset or processing textual content with potential PII, these tools provide flexible and effective ways to manage privacy.