## NERPII Tutorial


First, lets start by importing the dependencies, and the NERPII lib itself.

We can already see, by the dependencies, NERPII's code structure. We notice its divided in two classes, **named_entity_recognizer** and **faker_generator**. 
Named_entity_recognizer's job is to call Presidio and BERT to identigy PIIs, and **faker_generator** uses Faker to generate new synthetic data.

In [30]:
import sys
sys.path.append('/opt/homebrew/lib/python3.11/site-packages')
from nerpii.named_entity_recognizer import NamedEntityRecognizer
from nerpii.faker_generator import FakerGenerator
import pandas as pd
from nerpii.faker_generator import FakerGenerator

With that being said, we start by generating a synthetic dataset using Faker.
For this example, we decided to generate random First and Last names, postal code, phone number, SSN in US format, and emails, so our dataset consists only of PII.

In [None]:
import csv
from faker import Faker

fake = Faker('en_US') ## Podemos alterar o idioma para pt_PT, por exemplo.

output_file = 'random_data.csv'

fields = [
    "First Name", "Last Name", 
    "Postal Code", "Phone Number", 
    "Social Security Number", "Email"
]

num_records = 100

with open(output_file, mode='w', newline='', encoding='utf-8') as file:
    writer = csv.writer(file) 
    writer.writerow(fields)  
    
    for _ in range(num_records):
        writer.writerow([
            fake.first_name(),
            fake.last_name(),
            fake.postcode(),
            fake.phone_number(),
            fake.ssn(),  
            fake.email()
        ])

print(f"CSV file '{output_file}' with {num_records} random records created successfully.")

CSV file 'random_data.csv' with 100 random records created successfully.


After that, we employ NamedEntityRecognizer to try to identify the datasets entities.

In [None]:
df = pd.read_csv("random_data.csv")       
recognizer = NamedEntityRecognizer(df)


recognizer.assign_entities_with_presidio()
recognizer.assign_entities_manually()
recognizer.assign_organization_entity_with_model()

recognizer.dict_global_entities

{'First Name': {'entity': 'PERSON', 'confidence_score': 0.9574468085106383},
 'Last Name': {'entity': 'PERSON', 'confidence_score': 0.9878048780487805},
 'Postal Code': {'entity': 'ZIPCODE', 'confidence_score': 1.0},
 'Phone Number': {'entity': 'PHONE_NUMBER',
  'confidence_score': 0.7526881720430108},
 'Social Security Number': {'entity': 'US_SSN', 'confidence_score': 1.0},
 'Email': {'entity': 'EMAIL_ADDRESS', 'confidence_score': 1.0}}

As we can see, the output of "dict_global_entities" is an array containing all the columns, with the entity NERPII, more specifically, Presidio and the BERT model believe its the correct entity of that column, based on that format. For example, we can see that the email was found with 100% confidence by NERPII, probaly because the emails format is default worldwide. 
However, for the phone number, it only found **PHONE_NUMBER** entity for column "Phone number" with 75% confidence, probably because different numbers have different formats worldwide.

In [33]:
df

Unnamed: 0,First Name,Last Name,Postal Code,Phone Number,Social Security Number,Email
0,James,Bowen,72667,8625866053,357-27-7451,devans@example.com
1,Doris,Chambers,78502,431.881.8554x4344,593-74-5232,qhernandez@example.org
2,Jeffrey,Coleman,48087,(843)881-8470x770,438-61-8490,wellsamanda@example.net
3,Lauren,George,38251,(692)721-4505x344,763-98-6341,ryanburke@example.com
4,Wesley,Moran,4167,6694355908,580-64-9639,fordjoshua@example.com
...,...,...,...,...,...,...
95,Erica,Levy,70255,+1-942-799-8503,616-99-3581,dstout@example.net
96,Karen,Jimenez,71721,218-915-8446x64177,631-48-9497,adriennepena@example.org
97,Lisa,Walker,80926,(234)796-9337x34346,792-91-4606,bgonzalez@example.org
98,Karen,Mckenzie,1690,001-789-861-1861x91970,855-63-0273,ejones@example.org


In [34]:
faker_generator = FakerGenerator(df, recognizer.dict_global_entities)
faker_generator.get_faker_generation()

Column [1;32mPhone Number[0m synthesized with Faker.
Column [1;32mFirst Name[0m synthesized with Faker.
Column [1;32mLast Name[0m synthesized with Faker.
Column [1;32mEmail[0m synthesized with Faker.
Column [1;32mPostal Code[0m synthesized with Faker.
Column [1;32mSocial Security Number[0m synthesized with Faker.


In [35]:
df

Unnamed: 0,First Name,Last Name,Postal Code,Phone Number,Social Security Number,Email
0,Donald,Franklin,68763,339-730-4667x58034,869-38-2228,donald.franklin@gmail.com
1,Carlos,May,48758,+1-562-959-9719x2377,712-71-4691,carlos.may@hotmail.com
2,Morgan,Gutierrez,59750,694.982.8897x03520,127-87-0443,morgan.gutierrez@yahoo.com
3,Jason,Elliott,57469,7662076246,617-08-5511,jason.elliott@yahoo.com
4,Michael,Robinson,57198,001-693-616-8569x1195,592-87-3578,michael.robinson@yahoo.com
...,...,...,...,...,...,...
95,Tanya,Johnson,26264,(833)591-2972,474-14-1926,tanya.johnson@yahoo.com
96,Megan,Stone,78910,743.927.8310x0477,677-48-4430,megan.stone@gmail.com
97,George,Obrien,89381,+1-818-995-9373x12768,617-76-4260,george.obrien@gmail.com
98,Jill,Baker,13616,(592)776-0017,070-18-0148,jill.baker@hotmail.com


As we can see, the result is a dataset with the same columns, but all PIIs are totally changed and synthetic.