## PII data pseudonymization demo

In this demo we call Presidio (through it's Python interface) and then replace the detected entities with fake ones, using the same techniques in the `PresidioDataGenerator` object.

The `PresidioPerturb` class as a wrapper on top of `PresidioDataGenerator` which accepts a presidio analyzer response and creates fake sentences based on the original ones.


In [1]:
# install presidio via pip if not yet installed

#!pip install presidio-analyzer
#!pip install presidio-anonymizer
#!pip install presidio-evaluator

# install trained model for pipeline

#!python -m spacy download en_core_web_sm

In [2]:
from presidio_analyzer import AnalyzerEngine
from presidio_evaluator.data_generator import PresidioPseudonymization

import pandas as pd

In [3]:
# Instantiate Presidio Analyzer

analyzer = AnalyzerEngine()

In [4]:
pseudonymizer = PresidioPseudonymization()

In [5]:
original_text = "Hi my name is Doug Funny and this is my website: https://www.dougf.io"

presidio_response = analyzer.analyze(original_text, language="en")
presidio_response

[type: URL, start: 49, end: 69, score: 0.95,
 type: PERSON, start: 14, end: 24, score: 0.85]

In [6]:
# Simple pseudonymization

pseudonymizer.pseudonymize(
    original_text=original_text, presidio_response=presidio_response, count=5
)

['Hi my name is Tammy Ryan and this is my website: https://www.cardenas.info/',
 'Hi my name is Jessica Smith and this is my website: http://jones-hunt.info/',
 'Hi my name is Michele Marsh and this is my website: https://guerrero.com/',
 'Hi my name is Kathleen Miller and this is my website: https://lopez.com/',
 'Hi my name is Paul Brown and this is my website: http://www.banks-evans.info/']

In [7]:
# When Presidio fails to detect an entity, it will be available in the fake samples!

text = "Our son R2D2 used to work in Germany"

response = analyzer.analyze(text=text, language="en")
print(f"Presidio' response: {response}")


fake_samples = pseudonymizer.pseudonymize(
    original_text=text, presidio_response=response, count=5
)
print(f"-------------\nFake examples:\n")
print(*fake_samples, sep="\n")

Presidio' response: [type: LOCATION, start: 29, end: 36, score: 0.85]
-------------
Fake examples:

Our son R2D2 used to work in Nigeria
Our son R2D2 used to work in Guam
Our son R2D2 used to work in Reunion
Our son R2D2 used to work in Vanuatu
Our son R2D2 used to work in Malaysia
