# Introduction

On one of my projects I faced a problem of Personal Identifiable Information (PII). To share our data with third-party people we decided to add an anonymization step to the preprocessing. In this article I will describe an example of data anonymization using two awesome libraries: presidio and faker.   

Agenda:
1. presidio analyzer for finding sensitive data
2. presidio anonymizer
3. faker for generating diverse synthetic entities 
4. final pipeline for text anonymization

# Analyzer

Presidio supports both spaCy and Stanza as its internal NLP engine. https://microsoft.github.io/presidio/analyzer/customizing_nlp_models/  

I prefer spacy and that's why I will download a model for it. 

In [2]:
%%capture
!python -m spacy download en_core_web_md

Then we need to expricitly select this model as by default the analyser uses spacy en_core_web_lg for English language

In [3]:
# Create configuration containing engine name and models
configuration = {
    "nlp_engine_name": "spacy",
    "models": [{"lang_code": "en", "model_name": "en_core_web_md"}],
}

In [4]:
from presidio_analyzer import AnalyzerEngine
from presidio_analyzer.nlp_engine import NlpEngineProvider

# Create NLP engine based on configuration
provider = NlpEngineProvider(nlp_configuration=configuration)
nlp_engine = provider.create_engine()

# the languages are needed to load country-specific recognizers 
# for finding phones, passport numbers, etc.
analyzer = AnalyzerEngine(nlp_engine=nlp_engine,
                          supported_languages=["en"])

Let us analyze one example. The data is not processed so it is a bit harder for the algorighm to detect entities

In [5]:
example_text = "Hi. My name is Oleg. I was born in Saint-Petersburg, Russia in 1997. Some random phone number: 51-855-831-2384. Yesterday I ate soup. Send something there helpline@lgbt.foundation. IBAN example AT483200000012345864"

print(example_text)

Hi. My name is Oleg. I was born in Saint-Petersburg, Russia in 1997. Some random phone number: 51-855-831-2384. Yesterday I ate soup. Send something there helpline@lgbt.foundation. IBAN example AT483200000012345864


An here is an example of analyzer's output

In [6]:
results = analyzer.analyze(text=example_text,
                             language='en' # this is a required parameter. So if you don't know the language of each particular text, use language detector
        )

for res in results:
    print(res)

type: PHONE_NUMBER, start: 98, end: 110, score: 1.0
type: EMAIL_ADDRESS, start: 155, end: 179, score: 1.0
type: DOMAIN_NAME, start: 164, end: 179, score: 1.0
type: IBAN_CODE, start: 194, end: 214, score: 1.0
type: PERSON, start: 15, end: 19, score: 0.85
type: LOCATION, start: 35, end: 51, score: 0.85
type: LOCATION, start: 53, end: 59, score: 0.85
type: DATE_TIME, start: 63, end: 67, score: 0.85
type: DATE_TIME, start: 112, end: 121, score: 0.85


We see here entities, their start and end indices and kind of a confidence score. If you want to analyse only a limited set of entities, pass a list with their name to a corresponding parameter of the analyze function

And here is the same text with entities marked and coloured with a random non-black colour

In [7]:
from termcolor import colored
import random

def rand_col(text):
    return colored(text, color=random.choice(['grey', 'red', 'green', 'yellow', 'blue', 'magenta']))    

In [8]:
from copy import deepcopy
results_sorted = deepcopy(results)
results_sorted.sort(key=lambda x: x.start)

current_idx = 0
for current_entity in results_sorted:
#     print(rand_col(current_idx), end='\n')
    if current_idx <= current_entity.start:
        print(example_text[current_idx: current_entity.start], end='')
        print(rand_col(example_text[current_entity.start: current_entity.end] + f' [{current_entity.entity_type}]'), end='')
    else:
        print(rand_col(example_text[current_entity.start: current_entity.end] + f' [{current_entity.entity_type} collision]'), end='')
    
    current_idx = current_entity.end
    
print(example_text[current_idx: ])

Hi. My name is [32mOleg [PERSON][0m. I was born in [34mSaint-Petersburg [LOCATION][0m, [30mRussia [LOCATION][0m in [35m1997 [DATE_TIME][0m. Some random phone number: 51-[34m855-831-2384 [PHONE_NUMBER][0m. [34mYesterday [DATE_TIME][0m I ate soup. Send something there [34mhelpline@lgbt.foundation [EMAIL_ADDRESS][0m[33mlgbt.foundation [DOMAIN_NAME collision][0m. IBAN example [34mAT483200000012345864 [IBAN_CODE][0m


From the documentation "As the input text could potentially have overlapping PII entities, there are different anonymization scenarios:
No overlap (single PII) - single PII over text entity, uses a given or default anonymizer to anonymize and replace the PII text entity.
Full overlap of PIIs - When one text have several PIIs, the PII with the higher score will be taken. Between PIIs with identical scores, the selection
will be arbitrary.
One PII is contained in another - anonymizer will use the PII with larger text.
Partial intersection - both will be returned concatenated."  

Moreover, custom recognisers (https://microsoft.github.io/presidio/analyzer/adding_recognizers/) for your particular patterns may be added. These may be used, for examle, for finding urls, non-common symbol sequences, specific phrases.

A decision for each entity may be explained. Details on the link https://microsoft.github.io/presidio/analyzer/decision_process/ 

# Anonymizer

Anonymizer is the second pillar of the presidio library.

It also has an Engine

In [9]:
from presidio_anonymizer import AnonymizerEngine

anonymizer = AnonymizerEngine()

In [10]:
anonymized_text = anonymizer.anonymize(text=example_text, analyzer_results=results).text

print(anonymized_text)

Hi. My name is <PERSON>. I was born in <LOCATION>, <LOCATION> in <DATE_TIME>. Some random phone number: 51-<PHONE_NUMBER>. <DATE_TIME> I ate soup. Send something there <EMAIL_ADDRESS>. IBAN example <IBAN_CODE>


We can see that by default the entities are replaced with their entity name. Quite well. But can we make it more flexible? Of course! presidio has operators for this https://microsoft.github.io/presidio/anonymizer/

In [11]:
from presidio_anonymizer.entities.engine import OperatorConfig

operators={"PERSON": OperatorConfig(operator_name="replace", 
                                    params={"new_value": "REPLACED_NAME"}),
           "LOCATION": OperatorConfig(operator_name="mask", 
                                      params={'chars_to_mask': 10, 
                                              'masking_char': '*',
                                              'from_end': True}),
           "DEFAULT": OperatorConfig(operator_name="redact")}

anonymized_text = anonymizer.anonymize(text=example_text, 
                                       analyzer_results=results,
                                       operators=operators).text

print(anonymized_text)

Hi. My name is REPLACED_NAME. I was born in Saint-**********, ****** in . Some random phone number: 51-.  I ate soup. Send something there . IBAN example 


We have masked locations, replaced persons with a pre-defined value, and removed ('redact' key value) all other entities found. In addition to that, you may use hash, encrypt, and custom operator names. The latter is the most valuable, from my perspective. With custom operator we can, for examle, apply custom (surprisingly) logic to the original entity, select randomly from a set of pre-defined values, or even generate a new anonymized value from scratch!

In [12]:
operators={"PERSON": OperatorConfig(operator_name="custom", 
                                    params={"lambda": lambda x: random.choice(['Neo', 'Paul'])}),
           "DEFAULT": OperatorConfig(operator_name="custom", params={"lambda": lambda x: x[::-1]})}

anonymized_text = anonymizer.anonymize(text=example_text, 
                                       analyzer_results=results,
                                       operators=operators).text

print(anonymized_text)

Hi. My name is Paul. I was born in grubsreteP-tniaS, aissuR in 7991. Some random phone number: 51-4832-138-558. yadretseY I ate soup. Send something there noitadnuof.tbgl@enilpleh. IBAN example 468543210000002384TA


If you launch the cell for several times, you may notice that sometimes PERSON entities will be replaced with Neo and others with Paul. Other entities will be reversed

# Faker

Last but not least component of the pipeline is Faker https://github.com/joke2k/faker - library for generating fake data. Here is a basic example from the library

In [13]:
from faker import Faker
fake = Faker()

print('random name:', fake.name())
print('random address:', fake.address())
print('random phone number:', fake.phone_number())

random name: Jose Bailey
random address: 546 Brown Stravenue Apt. 463
West Stephenburgh, KY 12451
random phone number: +1-092-602-8214x7410


Generally, faker operates with large collections of local names, surnames, prefixes, etc. But simple interface and variety do their best to use this library instead of your own-defined values. Interestingly, we may limit the locales from which we generate our entities

In [14]:
fake = Faker(locale=['jp_JP'])

for i in range(5):
    print(fake.name())

藤井 知実
坂本 陽一
岡本 くみ子
橋本 幹
木村 明美


In [15]:
fake = Faker(locale=['en_US', 'en_GB', 'en_CA', 'fr_FR'])

for i in range(10):
    print(fake.name())

Leah Cook
Renee Rollins
Carla Wright
Stewart Knowles
Thierry-Bernard Valette
Owen Hall
Jasmine Austin
Mohamed Ward
Gail Clark-Osborne
Heather Cameron


To use faker during the anonymisation step, we need to create operators with lambda functions

In [16]:
fake_operators = {
    "PERSON": OperatorConfig("custom", {"lambda": lambda x: fake.name()}),
    "PHONE_NUMBER": OperatorConfig("custom", {"lambda": lambda x: fake.phone_number()}),
    "EMAIL_ADDRESS": OperatorConfig("custom", {"lambda": lambda x: fake.email()}),
    "LOCATION": OperatorConfig("replace", {"new_value": "USA"}),
    "DEFAULT": OperatorConfig(operator_name="mask", 
                              params={'chars_to_mask': 10, 
                                      'masking_char': '*',
                                      'from_end': False}),
}

In [17]:
anonymized_text = anonymizer.anonymize(text=example_text,
                                       analyzer_results=results,
                                       operators=fake_operators
                                       ).text
print(anonymized_text)

Hi. My name is Joshua Smith. I was born in USA, USA in ****. Some random phone number: 51-(571) 314-9216. ********* I ate soup. Send something there williammartinez@brown.org. IBAN example **********0012345864


And that's it! The tool works quite well out of the box and may be finalised using custom custom recognisers (https://microsoft.github.io/presidio/analyzer/adding_recognizers/) and via analysing the decision process https://microsoft.github.io/presidio/analyzer/decision_process/.

# Final pipeline and example

In [18]:
import pandas as pd

from presidio_analyzer.nlp_engine import NlpEngineProvider

from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities.engine import OperatorConfig

from faker import Faker


# Create configuration containing engine name and models
configuration = {
    "nlp_engine_name": "spacy",
    "models": [{"lang_code": "en", "model_name": "en_core_web_md"}],
}

# Create NLP engine based on configuration
provider = NlpEngineProvider(nlp_configuration=configuration)
nlp_engine = provider.create_engine()

fake = Faker(locale=['en_US', 'en_GB', 'en_CA', 'fr_FR'])
fake_operators = {
    "PERSON": OperatorConfig("custom", {"lambda": lambda x: fake.name()}),
    "PHONE_NUMBER": OperatorConfig("custom", {"lambda": lambda x: fake.phone_number()}),
    "EMAIL_ADDRESS": OperatorConfig("custom", {"lambda": lambda x: fake.email()}),
    "LOCATION": OperatorConfig("replace", {"new_value": "USA"}),
    "DEFAULT": OperatorConfig(operator_name="replace"),
}

# Set up the engines
analyzer = AnalyzerEngine(nlp_engine=nlp_engine,
                          supported_languages=["en"])
anonymizer = AnonymizerEngine()

# broadcast the engines to the cluster nodes
# uncomment this if spark is used
# broadcasted_analyzer = sc.broadcast(analyzer)
# broadcasted_anonymizer = sc.broadcast(anonymizer)


def anonymize_text(text: str) -> str:
    # uncomment this if spark is used
#     analyzer = broadcasted_analyzer.value
#     anonymizer = broadcasted_anonymizer.value

    # Call analyzer to get results
    results = analyzer.analyze(text=text,
                               language='en')

    # Analyzer results are passed to the AnonymizerEngine for anonymization
    anonymized_text = anonymizer.anonymize(text=text,
                                           analyzer_results=results,
                                           operators=fake_operators
                                           )

    return anonymized_text.text


def anonymize_series(s: pd.Series) -> pd.Series:
    return s.apply(anonymize_text)

It is obvious that we cannot use the data from the project. Therefore, I have tried to find a dataset full of personal information. Hillary Clinton email work quite well for this purpose.

https://www.kaggle.com/kaggle/hillary-clinton-emails

In [19]:
df = pd.read_csv('Emails.csv')
df.head()

Unnamed: 0,Id,DocNumber,MetadataSubject,MetadataTo,MetadataFrom,SenderPersonId,MetadataDateSent,MetadataDateReleased,MetadataPdfLink,MetadataCaseNumber,...,ExtractedTo,ExtractedFrom,ExtractedCc,ExtractedDateSent,ExtractedCaseNumber,ExtractedDocNumber,ExtractedDateReleased,ExtractedReleaseInPartOrFull,ExtractedBodyText,RawText
0,1,C05739545,WOW,H,"Sullivan, Jacob J",87.0,2012-09-12T04:00:00+00:00,2015-05-22T04:00:00+00:00,DOCUMENTS/HRC_Email_1_296/HRCH2/DOC_0C05739545...,F-2015-04841,...,,"Sullivan, Jacob J <Sullivan11@state.gov>",,"Wednesday, September 12, 2012 10:16 AM",F-2015-04841,C05739545,05/13/2015,RELEASE IN FULL,,UNCLASSIFIED\nU.S. Department of State\nCase N...
1,2,C05739546,H: LATEST: HOW SYRIA IS AIDING QADDAFI AND MOR...,H,,,2011-03-03T05:00:00+00:00,2015-05-22T04:00:00+00:00,DOCUMENTS/HRC_Email_1_296/HRCH1/DOC_0C05739546...,F-2015-04841,...,,,,,F-2015-04841,C05739546,05/13/2015,RELEASE IN PART,"B6\nThursday, March 3, 2011 9:45 PM\nH: Latest...",UNCLASSIFIED\nU.S. Department of State\nCase N...
2,3,C05739547,CHRIS STEVENS,;H,"Mills, Cheryl D",32.0,2012-09-12T04:00:00+00:00,2015-05-22T04:00:00+00:00,DOCUMENTS/HRC_Email_1_296/HRCH2/DOC_0C05739547...,F-2015-04841,...,B6,"Mills, Cheryl D <MillsCD@state.gov>","Abedin, Huma","Wednesday, September 12, 2012 11:52 AM",F-2015-04841,C05739547,05/14/2015,RELEASE IN PART,Thx,UNCLASSIFIED\nU.S. Department of State\nCase N...
3,4,C05739550,CAIRO CONDEMNATION - FINAL,H,"Mills, Cheryl D",32.0,2012-09-12T04:00:00+00:00,2015-05-22T04:00:00+00:00,DOCUMENTS/HRC_Email_1_296/HRCH2/DOC_0C05739550...,F-2015-04841,...,,"Mills, Cheryl D <MillsCD@state.gov>","Mitchell, Andrew B","Wednesday, September 12,2012 12:44 PM",F-2015-04841,C05739550,05/13/2015,RELEASE IN PART,,UNCLASSIFIED\nU.S. Department of State\nCase N...
4,5,C05739554,H: LATEST: HOW SYRIA IS AIDING QADDAFI AND MOR...,"Abedin, Huma",H,80.0,2011-03-11T05:00:00+00:00,2015-05-22T04:00:00+00:00,DOCUMENTS/HRC_Email_1_296/HRCH1/DOC_0C05739554...,F-2015-04841,...,,,,,F-2015-04841,C05739554,05/13/2015,RELEASE IN PART,"H <hrod17@clintonemail.com>\nFriday, March 11,...",B6\nUNCLASSIFIED\nU.S. Department of State\nCa...


In [20]:
df = df[~df['ExtractedBodyText'].isna()]

In [21]:
%%time
df['anonymized_text'] = anonymize_series(df['ExtractedBodyText'])

Wall time: 3min 23s


In [22]:
pd.set_option('max_colwidth', 300)
df[['ExtractedBodyText', 'anonymized_text']]

Unnamed: 0,ExtractedBodyText,anonymized_text
1,"B6\nThursday, March 3, 2011 9:45 PM\nH: Latest How Syria is aiding Qaddafi and more... Sid\nhrc memo syria aiding libya 030311.docx; hrc memo syria aiding libya 030311.docx\nMarch 3, 2011\nFor: Hillary",<US_DRIVER_LICENSE>\n<DATE_TIME> <DATE_TIME>\nH: Latest How USA is aiding USA and more... Sid\nhrc memo syria aiding USA <US_DRIVER_LICENSE>.docx; <NRP> memo syria aiding USA <DATE_TIME>\nFor: Anastasie Barbier-Barbe
2,Thx,Thx
4,"H <hrod17@clintonemail.com>\nFriday, March 11, 2011 1:36 PM\nHuma Abedin\nFw: H: Latest: How Syria is aiding Qaddafi and more... Sid\nhrc memo syria aiding libya 030311.docx\nPis print.",H <kthompson@gmail.com>\n<DATE_TIME> PM\nMeredith Cortez\nFw: H: Latest: How USA is aiding USA and more... Sid\nhrc memo syria aiding USA <US_DRIVER_LICENSE>.docx\nPis print.
5,"Pis print.\n-•-...-^\nH < hrod17@clintonernailcom>\nWednesday, September 12, 2012 2:11 PM\n°Russorv@state.gov'\nFw: Meet The Right-Wing Extremist Behind Anti-fvluslim Film That Sparked Deadly Riots\nFrom [meat)\nSent: Wednesday, September 12, 2012 01:00 PM\nTo: 11\nSubject: Meet The Right Wing E...",Pis print.\n-•-...-^\nH < hrod17@clintonernailcom>\n<DATE_TIME> PM\n°robert76@barton.com'\nFw: Meet The Right-Wing Extremist Behind Anti-fvluslim Film That Sparked Deadly Riots\nFrom [meat)\nSent: <DATE_TIME> <DATE_TIME>\nTo: 11\nSubject: Meet The Right Wing Extremist Behind Anti-Muslim Film Tha...
7,"H <hrod17@clintonemail.corn>\nFriday, March 11, 2011 1:36 PM\nHuma Abedin\nFw: H: Latest: How Syria is aiding Qaddafi and more... Sid\nhrc memo Syria aiding libya 030311.docx\nPis print.",H <hrod17@clintonemail.corn>\n<DATE_TIME> PM\nKimberly Gonzalez\nFw: H: Latest: How USA is aiding USA and more... Benjamin Singh memo USA aiding USA <US_DRIVER_LICENSE>.docx\nPis print.
...,...,...
7938,"Hi. Sorry I haven't had a chance to see you, but I did want you to hear directly from me that it was a great result in\nCancun. Way beyond any expectations. Many challenges ahead, but a very good day for us and a great day for Mexico.\nHave a very happy holiday if I don't see you before. Best, Todd","Hi. Sorry I haven't had a chance to see you, but I did want you to hear directly from me that it was a great result in\nCancun. Way beyond any expectations. Many challenges ahead, but a very good day for us and a great day for USA.\nHave a very happy holiday if I don't see you before. Best, Andr..."
7939,"B6\nI assume you saw this by now -- if not, it's worth a read.\nForwarded message","<US_DRIVER_LICENSE>\nI assume you saw this by now -- if not, it's worth a read.\nForwarded message"
7941,"Big change of plans in the Senate. Senator Reid just announced that he was no longer going to move forward with the\nomnibus appropriations bill. Instead, he filed cloture motions on the repeal of Don't Ask, Don't Tell and the DREAM\nAct.\nThose petitions will ripen on Saturday. So it looks like...","Big change of plans in the Senate. Senator Daniel Mendoza just announced that he was no longer going to move forward with the\nomnibus appropriations bill. Instead, he filed cloture motions on the repeal of Don't Ask, Don't Tell and the DREAM\nAct.\nThose petitions will ripen on <DATE_TIME>. So ..."
7943,"PVerveer B6\nFriday, December 17, 2010 12:12 AM\nFrom B6\nPlease\nlet me know if I can be of any help to your department and will happy to do and please thank\nMrs. Hillary Clinton on behalf of me and\n. supporting Afghan women.\n•Thank you,\nB6\nB6\nB6\nB6\nB6\nB6\nB6\nB6","PVerveer <US_DRIVER_LICENSE>\n<DATE_TIME>From <US_DRIVER_LICENSE>\nPlease\nlet me know if I can be of any help to your department and will happy to do and please thank\nMrs. Connor Robinson-Elliott on behalf of me and\n. supporting <NRP> women.\nDavid Hunt you,\n<US_DRIVER_LICENSE>\n<US_DRIVER_L..."


# Conclusion

The pipeline suggested may be easily adjusted for your personal needs with more heavy NER models (en_core_web_trf, for example), custom rules, another confidence scores, etc. Moreover, you can upgrade your pipeline for multiple languages (https://microsoft.github.io/presidio/analyzer/languages/).
From the productionalization point of view, the code may also be launched with Spark via simply adding several lines of code (check the full code). Or you can use Presidio as an HTTP service (https://microsoft.github.io/presidio/samples/docker/).
Moreover, Presidio works with images (https://microsoft.github.io/presidio/image-redactor/).