In [1]:
!pip install presidio_anonymizer
!pip install presidio_analyzer
!pip install python-dotenv
!pip install langchain
!pip install langchain_experimental
!pip install presidio-analyzer
!pip install presidio-anonymizer
!pip install Faker
!pip install pip-system-certs

Collecting python-dotenv
  Downloading python_dotenv-1.0.1-py3-none-any.whl.metadata (23 kB)
Downloading python_dotenv-1.0.1-py3-none-any.whl (19 kB)
Installing collected packages: python-dotenv
Successfully installed python-dotenv-1.0.1
Collecting pip-system-certs
  Downloading pip_system_certs-4.0-py2.py3-none-any.whl.metadata (1.6 kB)
Downloading pip_system_certs-4.0-py2.py3-none-any.whl (6.1 kB)
Installing collected packages: pip-system-certs
Successfully installed pip-system-certs-4.0


In [2]:
print("hello world")

hello world


In [1]:
document_content = """Date: October 19, 2021
Hello Customer Service,

Subject: Claim Regarding Multiple Lost Luggages

Hello Customer Service,

I am writing to report the loss of multiple pieces of luggage associated with my recent flights. Below, I'm providing you with the necessary details for each affected Passenger Name Record (PNR).

The flight information and corresponding PNRs are as follows:


1. PNR: LHKQK9 - E-ticket: 123-4567890123
2. PNR: RTKPP3 - E-ticket: 234-5678901234

My contact information remains the same: phone number 999-888-7777 and email johndoe@example.com.

Please treat this information with the utmost confidentiality and respect for my privacy. In case of any updates regarding my lost luggage, feel free to contact me via the provided phone number or email.

Your prompt assistance in resolving this matter is highly appreciated.

Thank you for your attention.

Sincerely,
John Doe

"""

In [2]:
from langchain.schema import Document

documents = [Document(page_content=document_content)]
print(document_content)

Date: October 19, 2021
Hello Customer Service,

Subject: Claim Regarding Multiple Lost Luggages

Hello Customer Service,

I am writing to report the loss of multiple pieces of luggage associated with my recent flights. Below, I'm providing you with the necessary details for each affected Passenger Name Record (PNR).

The flight information and corresponding PNRs are as follows:


1. PNR: LHKQK9 - E-ticket: 123-4567890123
2. PNR: RTKPP3 - E-ticket: 234-5678901234

My contact information remains the same: phone number 999-888-7777 and email johndoe@example.com.

Please treat this information with the utmost confidentiality and respect for my privacy. In case of any updates regarding my lost luggage, feel free to contact me via the provided phone number or email.

Your prompt assistance in resolving this matter is highly appreciated.

Thank you for your attention.

Sincerely,
John Doe




In [3]:
# Util function for coloring the PII markers
# NOTE: It will not be visible on documentation page, only in the notebook
import re

def print_colored_pii(string):
    colored_string = re.sub(
        r"(<[^>]*>)", lambda m: "\033[31m" + m.group(1) + "\033[0m", string
    )
    print(colored_string)

##### Let's proceed and try to anonymize the text with the default settings. For now, we don't replace the data with synthetic, we just
##### mark it with markers (e.g. `<PERSON>`), so we set `add_default_faker_operators=False`:

In [7]:
from langchain_experimental.data_anonymizer import PresidioReversibleAnonymizer

anonymizer = PresidioReversibleAnonymizer(
    add_default_faker_operators=False,
)

print_colored_pii(anonymizer.anonymize(document_content))

Date: [31m<DATE_TIME>[0m
Hello Customer Service,

Subject: Claim Regarding Multiple Lost Luggages

Hello Customer Service,

I am writing to report the loss of multiple pieces of luggage associated with my recent flights. Below, I'm providing you with the necessary details for each affected Passenger Name Record (PNR).

The flight information and corresponding PNRs are as follows:


1. PNR: LHKQK9 - E-ticket: 123-[31m<US_BANK_NUMBER>[0m
2. PNR: RTKPP3 - E-ticket: 234-[31m<US_BANK_NUMBER_2>[0m

My contact information remains the same: phone number [31m<PHONE_NUMBER>[0m and email [31m<EMAIL_ADDRESS>[0m.

Please treat this information with the utmost confidentiality and respect for my privacy. In case of any updates regarding my lost luggage, feel free to contact me via the provided phone number or email.

Your prompt assistance in resolving this matter is highly appreciated.

Thank you for your attention.

Sincerely,
[31m<PERSON>[0m




### Let's also look at the mapping between original and anonymized values:

In [8]:
import pprint

pprint.pprint(anonymizer.deanonymizer_mapping)

{'DATE_TIME': {'<DATE_TIME>': 'October 19, 2021'},
 'EMAIL_ADDRESS': {'<EMAIL_ADDRESS>': 'johndoe@example.com'},
 'PERSON': {'<PERSON>': 'John Doe'},
 'PHONE_NUMBER': {'<PHONE_NUMBER>': '999-888-7777'},
 'US_BANK_NUMBER': {'<US_BANK_NUMBER>': '4567890123',
                    '<US_BANK_NUMBER_2>': '5678901234'}}


### In general, the anonymizer works pretty well, but I can observe two things to improve here:

1. PNR - the PNR has unique pattern, which is not by default part of anonymizer recognizers. The value *LHKQK9* is not anonymized.
2. E-TICKET. The E-Ticket has a unique pattern, which is not by default part of anonymizer recognizers. The value 160-4837291830 is detected as 160 -US_BANK_number


The solution is simple: we need to add a new recognizers to the anonymizer. You can read more about it in 
[presidio documentation](https://microsoft.github.io/presidio/analyzer/adding_recognizers/).

(https://microsoft.github.io/presidio/samples/python/customizing_presidio_analyzer/)

Let's add new recognizers:

In [9]:
from presidio_analyzer import Pattern, PatternRecognizer

pnr_pattern = Pattern(
    name="pnr_pattern",
    regex="\\b[A-Z0-9]{6}\\b",
    score=1,
)

ticket_patern = Pattern(
    name="e-ticket_patern",
    regex="[0-9]{3}(-)?[0-9]{10}",
    score=1,
)

# Define the recognizer with one or more patterns
ticket_recognizer = PatternRecognizer(
    supported_entity="E-TICKET", patterns=[ticket_patern]
)
# Define the recognizer with one or more patterns
pnr_recognizer = PatternRecognizer(
    supported_entity="PNR", patterns=[pnr_pattern],context =["PNR", "PNRs", "PNRcodes"]
)
anonymizer.add_recognizer(ticket_recognizer)
anonymizer.add_recognizer(pnr_recognizer)


Note that our anonymization instance remembers previously detected and anonymized values, including those that were not detected correctly (e.g., PNR, E-TICKETs). So it's worth removing this value, or resetting the entire mapping now that our recognizers have been updated:

In [10]:
anonymizer.reset_deanonymizer_mapping()

In [11]:
print_colored_pii(anonymizer.anonymize(document_content))

Date: [31m<DATE_TIME>[0m
Hello Customer Service,

Subject: Claim Regarding Multiple Lost Luggages

Hello Customer Service,

I am writing to [31m<PNR>[0m the loss of multiple [31m<PNR_2>[0m of luggage associated with my [31m<PNR_3>[0m [31m<PNR_5>[0ms. Below, I'm providing you with the necessary details for each affected Passenger Name [31m<PNR_4>[0m (PNR).

The [31m<PNR_5>[0m information and corresponding PNRs are as follows:


1. PNR: [31m<PNR_6>[0m - E-[31m<PNR_7>[0m: [31m<E-TICKET>[0m
2. PNR: [31m<PNR_8>[0m - E-[31m<PNR_7>[0m: [31m<E-TICKET_2>[0m

My contact information remains the same: phone [31m<PNR_9>[0m [31m<PHONE_NUMBER>[0m and email [31m<EMAIL_ADDRESS>[0m.

[31m<PNR_10>[0m treat this information with the [31m<PNR_11>[0m confidentiality and respect for my privacy. In case of any updates regarding my lost luggage, feel free to contact me via the provided phone [31m<PNR_9>[0m or email.

Your [31m<PNR_12>[0m assistance in resolving this [31

#### As you can see, our new recognizers work as expected. The anonymizer has replaced the PNR and E-TICKET entities with the <PNR> and <E-TICKET> markers, and the deanonymizer mapping has been updated accordingly.
#### Now, when all PII values are detected correctly, we can proceed to the next step, which is replacing the original values with synthetic ones. To do this, we need to set add_default_faker_operators=True (or just remove this parameter, because it's set to True by default):

In [12]:
pprint.pprint(anonymizer.deanonymizer_mapping)

{'DATE_TIME': {'<DATE_TIME>': 'October 19, 2021'},
 'E-TICKET': {'<E-TICKET>': '123-4567890123', '<E-TICKET_2>': '234-5678901234'},
 'EMAIL_ADDRESS': {'<EMAIL_ADDRESS>': 'johndoe@example.com'},
 'PERSON': {'<PERSON>': 'John Doe'},
 'PHONE_NUMBER': {'<PHONE_NUMBER>': '999-888-7777'},
 'PNR': {'<PNR>': 'report',
         '<PNR_10>': 'Please',
         '<PNR_11>': 'utmost',
         '<PNR_12>': 'prompt',
         '<PNR_13>': 'matter',
         '<PNR_14>': 'highly',
         '<PNR_2>': 'pieces',
         '<PNR_3>': 'recent',
         '<PNR_4>': 'Record',
         '<PNR_5>': 'flight',
         '<PNR_6>': 'LHKQK9',
         '<PNR_7>': 'ticket',
         '<PNR_8>': 'RTKPP3',
         '<PNR_9>': 'number'}}


In [13]:
anonymizer = PresidioReversibleAnonymizer(
    add_default_faker_operators=True,
    # Faker seed is used here to make sure the same fake data is generated for the test purposes
    # In production, it is recommended to remove the faker_seed parameter (it will default to None)
    faker_seed=42,
)

anonymizer.add_recognizer(ticket_recognizer)
anonymizer.add_recognizer(pnr_recognizer)

print_colored_pii(anonymizer.anonymize(document_content))

Date: 2023-09-08
Hello Customer Service,

Subject: Claim Regarding Multiple Lost Luggages

Hello Customer Service,

I am writing to [31m<PNR>[0m the loss of multiple [31m<PNR_2>[0m of luggage associated with my [31m<PNR_3>[0m [31m<PNR_5>[0ms. Below, I'm providing you with the necessary details for each affected Passenger Name [31m<PNR_4>[0m (PNR).

The [31m<PNR_5>[0m information and corresponding PNRs are as follows:


1. PNR: [31m<PNR_6>[0m - E-[31m<PNR_7>[0m: [31m<E-TICKET>[0m
2. PNR: [31m<PNR_8>[0m - E-[31m<PNR_7>[0m: [31m<E-TICKET_2>[0m

My contact information remains the same: phone [31m<PNR_9>[0m 223-951-1615x594 and email jesseguzman@example.net.

[31m<PNR_10>[0m treat this information with the [31m<PNR_11>[0m confidentiality and respect for my privacy. In case of any updates regarding my lost luggage, feel free to contact me via the provided phone [31m<PNR_9>[0m or email.

Your [31m<PNR_12>[0m assistance in resolving this [31m<PNR_13>[0m is 

As you can see, almost all values have been replaced with synthetic ones. The only exception is the PNR the E-TICKET, which are not supported by the default faker operators. We can add new operators to the anonymizer, which will generate random data. 

In [14]:
from faker import Faker

fake = Faker()


def fake_pnr(_=None):
    return fake.bothify(text="?#?###").upper()


fake_pnr()

'E5Y886'

In [15]:
def fake_e_ticket(_=None):
    return fake.bothify(text="###-#########").upper()


### Let's add newly created operators to the anonymizer:

In [21]:
from presidio_anonymizer.entities import OperatorConfig

new_operators = {
    "PNR": OperatorConfig("custom", {"lambda": fake_pnr}),
    "E-TICKET": OperatorConfig("custom", {"lambda": fake_e_ticket}),
}

anonymizer.add_operators(new_operators)

In [22]:
anonymizer.reset_deanonymizer_mapping()


In [26]:
# let's anonymise once again 
anonymizer.reset_deanonymizer_mapping()
print_colored_pii(anonymizer.anonymize(document_content))

Date: 1982-01-04
Hello Customer Service,

Subject: Claim Regarding Multiple Lost Luggages

Hello Customer Service,

I am writing to Z7S384 the loss of multiple Z3Z307 of luggage associated with my C6Q820 V9Z145s. Below, I'm providing you with the necessary details for each affected Passenger Name K0R406 (PNR).

The V9Z145 information and corresponding PNRs are as follows:


1. PNR: J6Q760 - E-T4Z918: 064-665484954
2. PNR: P4D120 - E-T4Z918: 650-391443869

My contact information remains the same: phone H0Z465 973.460.2606x4746 and email jamesherrera@example.org.

A8Z140 treat this information with the N3I129 confidentiality and respect for my privacy. In case of any updates regarding my lost luggage, feel free to contact me via the provided phone H0Z465 or email.

Your E8F859 assistance in resolving this Y3O142 is N4F894 appreciated.

Thank you for your attention.

Sincerely,
John Daniel




In [27]:
pprint.pprint(anonymizer.deanonymizer_mapping)

{'DATE_TIME': {'1982-01-04': 'October 19, 2021'},
 'E-TICKET': {'064-665484954': '123-4567890123',
              '650-391443869': '234-5678901234'},
 'EMAIL_ADDRESS': {'jamesherrera@example.org': 'johndoe@example.com'},
 'PERSON': {'John Daniel': 'John Doe'},
 'PHONE_NUMBER': {'973.460.2606x4746': '999-888-7777'},
 'PNR': {'A8Z140': 'Please',
         'C6Q820': 'recent',
         'E8F859': 'prompt',
         'H0Z465': 'number',
         'J6Q760': 'LHKQK9',
         'K0R406': 'Record',
         'N3I129': 'utmost',
         'N4F894': 'highly',
         'P4D120': 'RTKPP3',
         'T4Z918': 'ticket',
         'V9Z145': 'flight',
         'Y3O142': 'matter',
         'Z3Z307': 'pieces',
         'Z7S384': 'report'}}


Voilà! Now all values are replaced with synthetic ones. Note that the deanonymizer mapping has been updated accordingly.