## NERPII Tutorial


First, lets start by importing the dependencies, and the NERPII lib itself.

We can already see, by the dependencies, NERPII's code structure. We notice its divided in two classes, **named_entity_recognizer** and **faker_generator**. 
Named_entity_recognizer's job is to call Presidio and BERT to identigy PIIs, and **faker_generator** uses Faker to generate new synthetic data.

In [1]:
import sys
sys.path.append('/opt/homebrew/lib/python3.11/site-packages')
from nerpii.named_entity_recognizer import NamedEntityRecognizer
from nerpii.faker_generator import FakerGenerator
import pandas as pd

## Without Costum Providers

With that being said, we start by generating a synthetic dataset using Faker.
For this example, we decided to generate random First and Last names, postal code, phone number, SSN in US format, and emails, so our dataset consists only of PII.

In [9]:
import csv
from faker import Faker

fake = Faker('pt_PT', use_weighting=True) # Podemos múltiplos idiomas ['en_US', 'pt_PT']
Faker.seed(123) # Opcional

output_file = 'random_data.csv'

fields = [
    "First Name", "Last Name", 
    "Postal Code", "Phone Number", 
    "Social Security Number", "Email"
]

num_records = 100

with open(output_file, mode='w', newline='', encoding='utf-8') as file:
    writer = csv.writer(file) 
    writer.writerow(fields)  
    
    for _ in range(num_records):
        writer.writerow([
            fake.first_name(),
            fake.last_name(),
            fake.postcode(),
            fake.phone_number(),
            fake.ssn(),  
            fake.email()
        ])

print(f"CSV file '{output_file}' with {num_records} random records created successfully.")

# Verify if the names are unique
# assert len(set(pd.read_csv(output_file)['First Name'])) == num_records

CSV file 'random_data.csv' with 100 random records created successfully.


After that, we employ NamedEntityRecognizer to try to identify the datasets entities.

In [10]:
df = pd.read_csv("random_data.csv")       
recognizer = NamedEntityRecognizer(df)

recognizer.assign_entities_with_presidio()
print(recognizer.dict_global_entities)

recognizer.assign_entities_manually()
print(recognizer.dict_global_entities)

recognizer.assign_organization_entity_with_model()
print(recognizer.dict_global_entities)

{'First Name': {'entity': 'PERSON', 'confidence_score': 0.9367088607594937}, 'Last Name': {'entity': 'PERSON', 'confidence_score': 0.9761904761904762}, 'Postal Code': {'entity': 'DATE_TIME', 'confidence_score': 0.8095238095238095}, 'Phone Number': {'entity': 'DATE_TIME', 'confidence_score': 0.25806451612903225}, 'Social Security Number': {'entity': 'US_SSN', 'confidence_score': 1.0}, 'Email': {'entity': 'EMAIL_ADDRESS', 'confidence_score': 1.0}}
{'First Name': {'entity': 'PERSON', 'confidence_score': 0.9367088607594937}, 'Last Name': {'entity': 'PERSON', 'confidence_score': 0.9761904761904762}, 'Postal Code': {'entity': 'ZIPCODE', 'confidence_score': 1.0}, 'Phone Number': {'entity': 'DATE_TIME', 'confidence_score': 0.25806451612903225}, 'Social Security Number': {'entity': 'US_SSN', 'confidence_score': 1.0}, 'Email': {'entity': 'EMAIL_ADDRESS', 'confidence_score': 1.0}}
{'First Name': {'entity': 'PERSON', 'confidence_score': 0.9367088607594937}, 'Last Name': {'entity': 'PERSON', 'confi

In [4]:
df2 = pd.read_csv("random_data.csv")       
recognizer2 = NamedEntityRecognizer(df2)

recognizer2.assign_entities_manually()
print(recognizer2.dict_global_entities)

recognizer2.assign_organization_entity_with_model()
print(recognizer2.dict_global_entities)

recognizer2.assign_entities_with_presidio()
print(recognizer2.dict_global_entities)

{'First Name': None, 'Last Name': None, 'Postal Code': {'entity': 'ZIPCODE', 'confidence_score': 1.0}, 'Phone Number': None, 'Social Security Number': None, 'Email': None}
{'First Name': None, 'Last Name': {'entity': 'ORGANIZATION', 'confidence_score': 0.1781609195402299}, 'Postal Code': {'entity': 'ZIPCODE', 'confidence_score': 1.0}, 'Phone Number': None, 'Social Security Number': None, 'Email': {'entity': 'ORGANIZATION', 'confidence_score': 0.2222222222222222}}
{'First Name': {'entity': 'PERSON', 'confidence_score': 0.9367088607594937}, 'Last Name': {'entity': 'PERSON', 'confidence_score': 0.9761904761904762}, 'Postal Code': {'entity': 'DATE_TIME', 'confidence_score': 0.8095238095238095}, 'Phone Number': {'entity': 'DATE_TIME', 'confidence_score': 0.25806451612903225}, 'Social Security Number': {'entity': 'US_SSN', 'confidence_score': 0.9595959595959596}, 'Email': {'entity': 'EMAIL_ADDRESS', 'confidence_score': 1.0}}


In [5]:
df3 = pd.read_csv("random_data_headless.csv")       
recognizer3 = NamedEntityRecognizer(df3)

recognizer3.assign_entities_manually()
print(recognizer3.dict_global_entities)

recognizer3.assign_entities_with_presidio()
print(recognizer3.dict_global_entities)


{'A': None, 'B': None, 'C': None, 'D': None, 'E': None, 'F': None}
{'A': {'entity': 'PERSON', 'confidence_score': 0.9367088607594937}, 'B': {'entity': 'PERSON', 'confidence_score': 0.9761904761904762}, 'C': {'entity': 'DATE_TIME', 'confidence_score': 0.8095238095238095}, 'D': {'entity': 'DATE_TIME', 'confidence_score': 0.25806451612903225}, 'E': {'entity': 'US_SSN', 'confidence_score': 0.9393939393939394}, 'F': {'entity': 'EMAIL_ADDRESS', 'confidence_score': 1.0}}


As we can see, the output of "dict_global_entities" is an array containing all the columns, with the entity NERPII, more specifically, Presidio and the BERT model believe its the correct entity of that column, based on that format. For example, we can see that the email was found with 100% confidence by NERPII, probaly because the emails format is default worldwide. 
However, for the phone number, it only found **PHONE_NUMBER** entity for column "Phone number" with 75% confidence, probably because different numbers have different formats worldwide.

In [11]:
df

Unnamed: 0,First Name,Last Name,Postal Code,Phone Number,Social Security Number,Email
0,Cláudio,Fonseca,6410-688,+351925022585,206-19-6105,cbarros@example.org
1,Marta,Campos,0469-747,(351) 934 587 398,068-68-9786,erica02@example.net
2,Diogo,Moreira,2558-617,+351295039150,368-80-4738,jcastro@example.net
3,Bruna,Lourenço,0027-304,+351934660191,116-08-6186,henriquesmicael@example.com
4,Bernardo,Pacheco,6502-612,+351963108684,465-28-5812,dmatias@example.net
...,...,...,...,...,...,...
95,Miguel,Garcia,7003-813,(351) 932564280,485-20-5647,ema12@example.net
96,Nelson,Soares,2154-522,(351) 936 651 339,542-31-8122,qbaptista@example.com
97,Nicole,Carvalho,1964-097,(351) 938 734 449,741-27-3975,rubenlourenco@example.com
98,Rafaela,Simões,2846-174,(351) 924512404,399-58-1283,naiaraaraujo@example.org


In [12]:
faker_generator = FakerGenerator(df, recognizer.dict_global_entities)
faker_generator.get_faker_generation()

Column [1;32mFirst Name[0m synthesized with Faker.
Column [1;32mLast Name[0m synthesized with Faker.
Column [1;32mEmail[0m synthesized with Faker.
Column [1;32mPostal Code[0m synthesized with Faker.
Column [1;32mSocial Security Number[0m synthesized with Faker.
Column [1;31mPhone Number[0m not synthesized with Faker.


In [26]:
df.head()

Unnamed: 0,First Name,Last Name,Postal Code,Phone Number,Social Security Number,Email
0,Elizabeth,Patterson,69767,+351925022585,283-60-3108,elizabeth.patterson@yahoo.com
1,Cheyenne,Davis,47647,(351) 934 587 398,632-37-5501,cheyenne.davis@hotmail.com
2,Christy,Hogan,91669,+351295039150,539-73-9327,christy.hogan@hotmail.com
3,Andrew,Reyes,65396,+351934660191,530-39-4012,andrew.reyes@hotmail.com
4,Crystal,Wright,22491,+351963108684,460-26-3865,crystal.wright@gmail.com


As we can see, the result is a dataset with the same columns, but all PIIs are totally changed and synthetic.

## With Costum Providers

In [8]:
import csv
from faker import Faker
from costum_providers import course_provider, AgeProvider, my_word_list

fake = Faker('pt_PT', use_weighting=True) # Podemos múltiplos idiomas ['en_US', 'pt_PT']
Faker.seed(123) # Opcional

fake.add_provider(course_provider)
fake.add_provider(AgeProvider)

output_file = 'random_data_costum.csv'

fields = [
    "First Name", "Last Name", 
    "Postal Code", "Phone Number", 
    "Social Security Number", "Email",
    "Course", "Age", "Lorem Ipsum Costum"
]

num_records = 100

with open(output_file, mode='w', newline='', encoding='utf-8') as file:
    writer = csv.writer(file) 
    writer.writerow(fields)  
    
    for _ in range(num_records):
        writer.writerow([
            fake.first_name(),
            fake.last_name(),
            fake.postcode(),
            fake.phone_number(),
            fake.ssn(),  
            fake.email(),
            fake.course(),
            fake.age(),
            fake.sentence(ext_word_list=my_word_list)
        ])

print(f"CSV file '{output_file}' with {num_records} random records created successfully.")

CSV file 'random_data_costum.csv' with 100 random records created successfully.


In [2]:
df4 = pd.read_csv("random_data_costum.csv")
recognizer4 = NamedEntityRecognizer(df4)

recognizer4.assign_entities_with_presidio()
print(recognizer4.dict_global_entities)

recognizer4.assign_entities_manually()
print(recognizer4.dict_global_entities)

recognizer4.assign_organization_entity_with_model()
print(recognizer4.dict_global_entities)

{'First Name': {'entity': 'PERSON', 'confidence_score': 0.8552631578947368}, 'Last Name': {'entity': 'PERSON', 'confidence_score': 0.9879518072289156}, 'Postal Code': None, 'Phone Number': {'entity': 'US_PASSPORT', 'confidence_score': 0.18181818181818182}, 'Social Security Number': {'entity': 'US_SSN', 'confidence_score': 0.98}, 'Email': {'entity': 'EMAIL_ADDRESS', 'confidence_score': 1.0}, 'Course': None, 'Age': None, 'Lorem Ipsum Costum': {'entity': 'LOCATION', 'confidence_score': 0.21951219512195122}}
{'First Name': {'entity': 'PERSON', 'confidence_score': 0.8552631578947368}, 'Last Name': {'entity': 'PERSON', 'confidence_score': 0.9879518072289156}, 'Postal Code': {'entity': 'ZIPCODE', 'confidence_score': 1.0}, 'Phone Number': {'entity': 'US_PASSPORT', 'confidence_score': 0.18181818181818182}, 'Social Security Number': {'entity': 'US_SSN', 'confidence_score': 0.98}, 'Email': {'entity': 'EMAIL_ADDRESS', 'confidence_score': 1.0}, 'Course': None, 'Age': None, 'Lorem Ipsum Costum': {'e

In [None]:
df4

Unnamed: 0,First Name,Last Name,Postal Code,Phone Number,Social Security Number,Email,Course,Age,Lorem Ipsum Costum
0,Cláudio,Fonseca,6410-688,+351925022585,206-19-6105,cbarros@example.org,LEIC,26,Lollipop Gummies sesame.
1,Catarina,Amorim,5873-989,+351910686897,580-21-7425,ines61@example.net,LCC,42,Bar danish Jelly bar wafer Ice oat.
2,Alícia,Nogueira,4738-722,276700273,454-66-0191,patricia11@example.com,MIA,68,Wafer Jelly beans danish wafer oat danish.
3,Íris,Jesus,5026-125,(351) 911086842,652-85-8129,dmatias@example.net,LEIC,47,Wafer Ice pie.
4,Petra,Simões,4602-858,938258185,951-48-3354,gandrade@example.org,LEIC,50,Pie Ice danish Lollipop Lollipop.
...,...,...,...,...,...,...,...,...,...
95,Vasco,Coelho,6387-164,(351) 262 852 068,113-77-0439,tome58@example.com,LEIC,30,Lollipop Lollipop cheesecake danish oat Lollipop.
96,Madalena,Pinto,4841-552,911 235 088,141-41-6530,goncalvesfrederico@example.org,LEIC,49,Ice sesame oat Lollipop beans Lollipop Ice.
97,Ivo,Azevedo,1575-580,(351) 936 346 903,432-12-0645,julianacunha@example.com,LCC,53,Danish danish Jelly wafer.
98,Teresa,Faria,3582-306,(351) 917 487 176,319-71-1145,mauro90@example.com,MIA,28,Cheesecake beans Ice pie.


In [5]:
faker_generator2 = FakerGenerator(df4, recognizer4.dict_global_entities)
faker_generator2.get_faker_generation()

Column [1;32mFirst Name[0m synthesized with Faker.
Column [1;32mLast Name[0m synthesized with Faker.
Column [1;32mEmail[0m synthesized with Faker.
Column [1;32mPostal Code[0m synthesized with Faker.
Column [1;32mSocial Security Number[0m synthesized with Faker.
Column [1;31mPhone Number[0m not synthesized with Faker.
Column [1;31mLorem Ipsum Costum[0m not synthesized with Faker.
Column [1;31mCourse[0m not synthesized with Faker.


In [7]:
df4.head()

Unnamed: 0,First Name,Last Name,Postal Code,Phone Number,Social Security Number,Email,Course,Age,Lorem Ipsum Costum
0,Raymond,Wang,26548,+351925022585,310-11-7645,raymond.wang@yahoo.com,LEIC,26,Lollipop Gummies sesame.
1,Raymond,Lewis,87992,+351910686897,222-89-8955,raymond.lewis@hotmail.com,LCC,42,Bar danish Jelly bar wafer Ice oat.
2,Harold,Cook,80955,276700273,301-79-3505,harold.cook@gmail.com,MIA,68,Wafer Jelly beans danish wafer oat danish.
3,Laurie,Howard,90777,(351) 911086842,137-22-4547,laurie.howard@yahoo.com,LEIC,47,Wafer Ice pie.
4,Victoria,Morgan,48053,938258185,416-01-8330,victoria.morgan@hotmail.com,LEIC,50,Pie Ice danish Lollipop Lollipop.


In [3]:
if recognizer4.dict_global_entities['Course']:
    recognizer4.dict_global_entities['Course']['entity'] = 'COURSE'
    recognizer4.dict_global_entities['Course']['confidence_score'] = 1.0
else:
    recognizer4.dict_global_entities['Course'] = {'entity': 'COURSE', 'confidence_score': 1.0}

if recognizer4.dict_global_entities['Age']:
    recognizer4.dict_global_entities['Age']['entity'] = 'AGE'
    recognizer4.dict_global_entities['Age']['confidence_score'] = 1.0
else:
    recognizer4.dict_global_entities['Age'] = {'entity': 'AGE', 'confidence_score': 1.0}

recognizer4.dict_global_entities

{'First Name': {'entity': 'PERSON', 'confidence_score': 0.8552631578947368},
 'Last Name': {'entity': 'PERSON', 'confidence_score': 0.9879518072289156},
 'Postal Code': {'entity': 'ZIPCODE', 'confidence_score': 1.0},
 'Phone Number': {'entity': 'US_PASSPORT',
  'confidence_score': 0.18181818181818182},
 'Social Security Number': {'entity': 'US_SSN', 'confidence_score': 0.98},
 'Email': {'entity': 'EMAIL_ADDRESS', 'confidence_score': 1.0},
 'Course': {'entity': 'COURSE', 'confidence_score': 1.0},
 'Age': {'entity': 'AGE', 'confidence_score': 1.0},
 'Lorem Ipsum Costum': {'entity': 'LOCATION',
  'confidence_score': 0.21951219512195122}}

In [4]:
from ExtendedFakerGenerator import ExtendedFakerGenerator

In [8]:
extended_faker_generator = ExtendedFakerGenerator(df4, recognizer4.dict_global_entities)
extended_faker_generator.get_faker_generation()

Column [1;32mFirst Name[0m synthesized with Faker.
Column [1;32mLast Name[0m synthesized with Faker.
Column [1;32mEmail[0m synthesized with Faker.
Column [1;32mPostal Code[0m synthesized with Faker.
Column [1;32mSocial Security Number[0m synthesized with Faker.
Column [1;31mPhone Number[0m not synthesized with Faker.
Column [1;31mLorem Ipsum Costum[0m not synthesized with Faker.
Column [1;31mCourse[0m not synthesized with Faker.
Column [1;31mAge[0m not synthesized with Faker.


In [9]:
df4.head()

Unnamed: 0,First Name,Last Name,Postal Code,Phone Number,Social Security Number,Email,Course,Age,Lorem Ipsum Costum
0,John,Johnson,41647,+351925022585,427-58-8350,john.johnson@gmail.com,LIACD,46,Lollipop Gummies sesame.
1,Ashley,Nelson,83626,+351910686897,085-44-5872,ashley.nelson@hotmail.com,LIACD,74,Bar danish Jelly bar wafer Ice oat.
2,Maria,Poole,79939,276700273,263-07-7493,maria.poole@yahoo.com,LCC,75,Wafer Jelly beans danish wafer oat danish.
3,Timothy,Sheppard,92369,(351) 911086842,197-23-0416,timothy.sheppard@hotmail.com,MIA,43,Wafer Ice pie.
4,Brittany,Gonzalez,95191,938258185,562-95-5277,brittany.gonzalez@gmail.com,MIA,45,Pie Ice danish Lollipop Lollipop.


# Applying to a real dataset

Now, lets apply the same process to a real dataset. We will use the Titanic dataset, and we will try to identify the PIIs in it.

In [2]:
df_rich_people = pd.read_csv("TopRichestInWorld.csv")

df_rich_people.head()

Unnamed: 0,First Name,NetWorth,Age,Country/Territory,Source,Industry
0,Elon Musk,"$219,000,000,000",50,United States,"Tesla, SpaceX",Automotive
1,Jeff Bezos,"$171,000,000,000",58,United States,Amazon,Technology
2,Bernard Arnault & family,"$158,000,000,000",73,France,LVMH,Fashion & Retail
3,Bill Gates,"$129,000,000,000",66,United States,Microsoft,Technology
4,Warren Buffett,"$118,000,000,000",91,United States,Berkshire Hathaway,Finance & Investments


In [3]:
recognizer_rich_people = NamedEntityRecognizer(df_rich_people)

recognizer_rich_people.assign_entities_with_presidio()
print(recognizer_rich_people.dict_global_entities)

recognizer_rich_people.assign_entities_manually()
print(recognizer_rich_people.dict_global_entities)

recognizer_rich_people.assign_organization_entity_with_model()
print(recognizer_rich_people.dict_global_entities)

{'First Name': {'entity': 'PERSON', 'confidence_score': 0.9883720930232558}, 'NetWorth': None, 'Age': None, 'Country/Territory': {'entity': 'LOCATION', 'confidence_score': 1.0}, 'Source': None, 'Industry': None}
{'First Name': {'entity': 'PERSON', 'confidence_score': 0.9883720930232558}, 'NetWorth': None, 'Age': None, 'Country/Territory': {'entity': 'LOCATION', 'confidence_score': 1.0}, 'Source': None, 'Industry': None}
{'First Name': {'entity': 'PERSON', 'confidence_score': 0.9883720930232558}, 'NetWorth': None, 'Age': None, 'Country/Territory': {'entity': 'LOCATION', 'confidence_score': 1.0}, 'Source': {'entity': 'ORGANIZATION', 'confidence_score': 0.39436619718309857}, 'Industry': {'entity': 'ORGANIZATION', 'confidence_score': 0.23039215686274508}}


In [4]:
faker_generator_rich_people = FakerGenerator(df_rich_people, recognizer_rich_people.dict_global_entities)
faker_generator_rich_people.get_faker_generation()

Column [1;32mFirst Name[0m synthesized with Faker.
Column [1;32mCountry/Territory[0m synthesized with Faker.
Column [1;31mIndustry[0m not synthesized with Faker.
Column [1;31mSource[0m not synthesized with Faker.


In [4]:
df_rich_people.head()

Unnamed: 0,First Name,NetWorth,Age,Country/Territory,Source,Industry
0,Elon Musk,"$219,000,000,000",50,United States,"Tesla, SpaceX",Automotive
1,Jeff Bezos,"$171,000,000,000",58,United States,Amazon,Technology
2,Bernard Arnault & family,"$158,000,000,000",73,France,LVMH,Fashion & Retail
3,Bill Gates,"$129,000,000,000",66,United States,Microsoft,Technology
4,Warren Buffett,"$118,000,000,000",91,United States,Berkshire Hathaway,Finance & Investments


In [5]:
if recognizer_rich_people.dict_global_entities['Age']:
    recognizer_rich_people.dict_global_entities['Age']['entity'] = 'AGE'
    recognizer_rich_people.dict_global_entities['Age']['confidence_score'] = 1.0
else:
    recognizer_rich_people.dict_global_entities['Age'] = {'entity': 'AGE', 'confidence_score': 1.0}

if recognizer_rich_people.dict_global_entities['Source']:
    recognizer_rich_people.dict_global_entities['Source']['entity'] = 'ORGANIZATION'
    recognizer_rich_people.dict_global_entities['Source']['confidence_score'] = 1.0
else:
    recognizer_rich_people.dict_global_entities['Source'] = {'entity': 'ORGANIZATION', 'confidence_score': 1.0}

if recognizer_rich_people.dict_global_entities['Industry']:
    recognizer_rich_people.dict_global_entities['Industry']['entity'] = 'INDUSTRY'
    recognizer_rich_people.dict_global_entities['Industry']['confidence_score'] = 1.0
else:
    recognizer_rich_people.dict_global_entities['Industry'] = {'entity': 'INDUSTRY', 'confidence_score': 1.0}

recognizer_rich_people.dict_global_entities

{'First Name': {'entity': 'PERSON', 'confidence_score': 0.9883720930232558},
 'NetWorth': None,
 'Age': {'entity': 'AGE', 'confidence_score': 1.0},
 'Country/Territory': {'entity': 'LOCATION', 'confidence_score': 1.0},
 'Source': {'entity': 'ORGANIZATION', 'confidence_score': 1.0},
 'Industry': {'entity': 'INDUSTRY', 'confidence_score': 1.0}}

In [10]:
from ExtendedFakerGenerator import ExtendedFakerGenerator

extended_faker_generator = ExtendedFakerGenerator(df_rich_people, recognizer_rich_people.dict_global_entities)
extended_faker_generator.get_faker_generation()

Column [1;32mFirst Name[0m synthesized with Faker.
Column [1;32mCountry/Territory[0m synthesized with Faker.
Column [1;31mAge[0m not synthesized with Faker.
Column [1;31mSource[0m not synthesized with Faker.
Column [1;31mIndustry[0m not synthesized with Faker.


In [11]:
df_rich_people.head()

Unnamed: 0,First Name,NetWorth,Age,Country/Territory,Source,Industry
0,Chad,"$219,000,000,000",51,Congo,Procter & Gamble,Construction
1,David,"$171,000,000,000",55,Qatar,Apple,Technology
2,Michael,"$158,000,000,000",21,Nigeria,Visa,Transportation
3,Timothy,"$129,000,000,000",40,Ethiopia,Google (Alphabet),Technology
4,William,"$118,000,000,000",26,Honduras,Roche,Energy
