In [24]:
import warnings  # This module is used to control warning messages in Python
import logging   # This module is used to log messages for debugging and tracking issues

# Ignore specific warnings coming from the 'presidio_analyzer' module
warnings.filterwarnings("ignore", module="presidio_analyzer")

# Set the logging level of 'presidio-analyzer' to ERROR, so only critical errors are logged
logging.getLogger("presidio-analyzer").setLevel(logging.ERROR)

In [25]:
# Install required dependencies
!pip install --upgrade --quiet langchain langchain-openai langchain-experimental presidio-analyzer presidio-anonymizer spacy Faker

  - !pip install: This command is used to install Python libraries.
  The ! is needed in Jupyter Notebook to run shell commands.

  - --upgrade: Ensures that the latest versions of the packages are installed.

  - --quiet: Suppresses unnecessary installation messages to keep the output clean.

Installed Libraries:

    - langchain – A framework to build applications using large language models (LLMs).

    - langchain-openai – Helps connect LangChain with OpenAI’s GPT models.

    - langchain-experimental – Includes experimental features and tools for LangChain.

    - presidio-analyzer – Detects sensitive information like names, addresses, or credit card numbers in text.

    - presidio-anonymizer – Hides or replaces sensitive data detected by presidio-analyzer.

    - spacy – A popular library for natural language processing (NLP), useful for working with text data.

    - Faker – Generates fake names, addresses, and other data for testing purposes.

In [26]:
# Importing the required module for anonymization
from langchain_experimental.data_anonymizer import PresidioAnonymizer

# Initialize the anonymizer
anonymizer = PresidioAnonymizer()

# Anonymize a sample text containing personal information
anonymizer.anonymize(
    "My name is Denny Gosh, call me at 324-455-5634 or email me at danny.sh@gmail.com"
)

'My name is Erica Butler, call me at +1-734-200-4375x199 or email me at summer59@example.com'


Importing the Anonymizer

    PresidioAnonymizer is a tool that helps in hiding sensitive personal information like names, phone numbers, and email addresses.
    
Initializing the Anonymizer

    We create an anonymizer object using PresidioAnonymizer(). This object will process and mask personal data from text.

Anonymizing a Sample Text

    The anonymizer.anonymize(...) function scans the given text and replaces any personal details (like name, phone number, and email) with anonymized placeholders.

In [27]:
# Importing the required module for anonymization
from langchain_experimental.data_anonymizer import PresidioAnonymizer

# Initialize the anonymizer without specifying any particular data type
anonymizer = PresidioAnonymizer()

# Anonymizing all detected sensitive data types
anonymizer.anonymize(
    "My name is Denny Gosh, call me at 324-455-5634 and my repo number is 12434 and my email is real@Task.com"
)


'My name is Joseph Macias, call me at 682-504-0191x643 and my repo number is 1995-06-30 and my email is gparker@example.net'

Default Anonymization (No Specific Data Type Specified)

    Since we did not specify any analyzed_fields, the anonymizer will automatically detect and anonymize all sensitive information.
    This includes names, phone numbers, email addresses, and possibly other structured data like account numbers.
    What Happens to the Text?

The anonymizer.anonymize(...) function scans the input text for any personal or sensitive information and replaces it with placeholders.

In [28]:
# Importing the required module for anonymization
from langchain_experimental.data_anonymizer import PresidioAnonymizer

# Initialize the anonymizer to only anonymize PERSON names
anonymizer = PresidioAnonymizer(analyzed_fields=["PERSON"])

# Anonymizing only names (not phone numbers or email addresses)
anonymizer.anonymize(
    "My name is Denny Gosh, call me at 324-455-5634 or email me at danny.sh@gmail.com"
)

'My name is Lisa Thompson, call me at 324-455-5634 or email me at danny.sh@gmail.com'

Initializing with Specific Data Type (PERSON)

    The anonymizer = PresidioAnonymizer(analyzed_fields=["PERSON"]) means that only names (recognized as PERSON entities) will be anonymized.

    Other sensitive information, like phone numbers and emails, will remain unchanged.

Processing the Text

    The function scans the given text and replaces only the detected person's name while leaving other details as they are.

In [29]:
# Import Faker for generating fake data
from faker import Faker

# Initialize Faker with US locale (for generating fake data in English)
fake = Faker(locale="en_US")

# Function to generate a fake number, accepting a parameter (ignored)
def fake_number(_: str = None) -> str:
    return str(fake.random.randint(1000, 5000))  # Generates a random number between 1000 and 5000

# Import necessary modules for custom anonymization
from presidio_anonymizer.entities import OperatorConfig
from presidio_analyzer import PatternRecognizer, Pattern
from langchain_experimental.data_anonymizer import PresidioAnonymizer

# Initialize the Presidio anonymizer
anonymizer = PresidioAnonymizer()

# Define a custom pattern for recognizing repository numbers
repo_number_pattern = Pattern(
    name="repo_number_pattern",
    regex=r"(?<=\D)\d{5}(?=\D)",  # Regular expression to match a 5-digit repo number
    score=1  # Confidence score (1 means highly confident)
)

# Create a recognizer for repository numbers using the defined pattern
repo_recognizer = PatternRecognizer(
    supported_entity="REPO_NUMBER",  # Label it as "REPO_NUMBER"
    patterns=[repo_number_pattern]   # Use the pattern we defined
)

# Add the custom recognizer to the anonymizer
anonymizer.add_recognizer(repo_recognizer)

# Define a custom anonymization operator using the fake number generator
new_operator = {
    "REPO_NUMBER": OperatorConfig(
        "custom", {"lambda": fake_number}  # Ensuring it generates a fake number
    )
}

# Add the custom operator to the anonymizer
anonymizer.add_operators(new_operator)

# Anonymize a sample text containing a name, phone number, repo number, and email
anonymized_text = anonymizer.anonymize(
    "My name is Denny Gosh, call me at 324-455-5634 and my repo number is 12434 and my email is real@Task.com"
)

# Print the anonymized output
print(anonymized_text)


My name is Lisa Martin, call me at 001-507-766-3333x4423 and my repo number is 1015 and my email is qmitchell@example.com


Using Faker for Fake Data

    We use the Faker library to generate random fake data, like fake numbers.
    The function fake_number() generates a random number between 1000 and 5000.
    Defining a Pattern for "Repository Numbers"

    We create a custom pattern using regular expressions (regex) to detect 5-digit repository numbers (e.g., 12434).
    We use this pattern inside PatternRecognizer, which tells the system to treat such numbers as "REPO_NUMBER".

Adding a Custom Recognizer

    Normally, Presidio detects things like names and emails. But here, we add our own rule to detect repo numbers.

Anonymizing Repository Numbers with Fake Values

    Instead of just hiding the repo number, we replace it with a random fake number using fake_number().
    This ensures that our data stays consistent and anonymized.

Anonymization in Action

    The anonymizer.anonymize(...) function scans the input text and replaces:
    Names → <PERSON>
    Phone Numbers → <PHONE_NUMBER>
    Emails → <EMAIL_ADDRESS>
    Repo Numbers → Random fake number (e.g., 3875)

In [30]:
# Import the reversible anonymizer from LangChain Experimental
from langchain_experimental.data_anonymizer import PresidioReversibleAnonymizer

# Initialize the reversible anonymizer correctly
anonymizer = PresidioReversibleAnonymizer(
    add_default_faker_operators=False  # Ensures Faker-generated fake data is not used automatically
)

# Anonymize a sample text containing PII (name, phone number, repo number, and email)
anonymized_text = anonymizer.anonymize(
    "My name is Denny Gosh, call me at 324-455-5634 and my repo number is 12434 and my email is real@Task.com"
)

# Print the anonymized output
print(anonymized_text)

My name is <PERSON>, call me at <PHONE_NUMBER> and my repo number is <DATE_TIME> and my email is <EMAIL_ADDRESS>


Reversible Anonymization

    Unlike normal anonymization (where data is replaced permanently), reversible anonymization allows you to de-anonymize (restore original values if needed).
    This is useful when you need to process anonymized data but later retrieve the original values.

PresidioReversibleAnonymizer

    This tool replaces personal information (PII) with unique placeholders while allowing you to restore the data later.

What add_default_faker_operators=False Does

    Normally, Faker generates random replacements (e.g., fake names, numbers).
    Setting this to False means the original data is encoded uniquely instead of being replaced with random values.
    This makes de-anonymization (restoring the original data) possible.

Processing the Text

    The anonymizer.anonymize(...) function detects sensitive details and replaces them with placeholders.

In [31]:
anonymizer.deanonymizer_mapping

{'PERSON': {'<PERSON>': 'Denny Gosh'},
 'PHONE_NUMBER': {'<PHONE_NUMBER>': '324-455-5634'},
 'DATE_TIME': {'<DATE_TIME>': '12434'},
 'EMAIL_ADDRESS': {'<EMAIL_ADDRESS>': 'real@Task.com'}}

In [32]:
anonymizer.supported_languages

['en']

# Supported Fields that can be masked.

'PERSON', 'EMAIL_ADDRESS', 'PHONE_NUMBER', 'IBAN_CODE', 'CREDIT_CARD', 'CRYPTO', 'IP_ADDRESS', 'LOCATION', 'DATE_TIME', 'NRP', 'MEDICAL_LICENSE', 'URL', 'US_BANK_NUMBER', 'US_DRIVER_LICENSE', 'US_ITIN', 'US_PASSPORT', 'US_SSN'

References, Data anonymization with Microsoft Presidio: http://python.langchain.com/v0.1/docs/guides/productionization/safety/presidio_data_anonymization/