## Evaluate a custom Presidio Analyzer using the Presidio Evaluator framework

This notebook demonstrates how to evaluate a Presidio instance using the presidio-evaluator framework. It builds upon [example 4](4_Evaluate_Presidio_Analyzer.ipynb), with changes to the `PresidioAnalyzer` instance to improve detection accuracy. For more information on customizing the Presidio Analyzer, see the [Presidio Analyzer documentation](https://microsoft.github.io/presidio/analyzer/) or this [tutorial](https://microsoft.github.io/presidio/tutorial/).

Steps:
1. Load dataset from file
2. Simple dataset statistics
3. Define the AnalyzerEngine object (and its parameters)
4. Align the dataset's entities to Presidio's entities
5. Set up the Evaluator object
6. Run experiment
7. Evaluate results
8. Error analysis

In [21]:
# install presidio evaluator via pip if not yet installed

#!pip install presidio-evaluator
#!pip install "presidio-analyzer[transformers]"

In [22]:
from pathlib import Path
from pprint import pprint
from collections import Counter
from typing import Dict, List
import json

from presidio_evaluator import InputSample
from presidio_evaluator.evaluation import Evaluator, ModelError
from presidio_evaluator.models import PresidioAnalyzerWrapper
from presidio_evaluator.experiment_tracking import get_experiment_tracker

import pandas as pd

pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)
pd.set_option("display.max_colwidth", None)

%reload_ext autoreload
%autoreload 2
%matplotlib inline

## 1. Load dataset from file

In [23]:
dataset_name = "test_data.json"
dataset = InputSample.read_dataset_json(Path(Path.cwd().parent, "data modification", dataset_name))

print(len(dataset))



tokenizing input: 100%|██████████| 8701/8701 [03:47<00:00, 38.32it/s]


8701


In [24]:
def get_entity_counts(dataset: List[InputSample]) -> Dict:
    """Return a dictionary with counter per entity type."""
    entity_counter = Counter()
    for sample in dataset:
        for tag in sample.tags:
            entity_counter[tag] += 1
    return entity_counter


In [25]:
entity_counts = get_entity_counts(dataset)
print("Count per entity:")
pprint(entity_counts.most_common(), compact=True)

print("\nMin and max number of tokens in dataset: "\
f"Min: {min([len(sample.tokens) for sample in dataset])}, "\
f"Max: {max([len(sample.tokens) for sample in dataset])}")

print(f"Min and max sentence length in dataset: " \
f"Min: {min([len(sample.full_text) for sample in dataset])}, "\
f"Max: {max([len(sample.full_text) for sample in dataset])}")

print("\nExample InputSample:")
print(dataset[1])

Count per entity:
[('O', 213255), ('USERAGENT', 6902), ('IPV6', 3927), ('FIRSTNAME', 2674),
 ('PHONEIMEI', 2497), ('IP', 2219), ('DATE', 2134), ('JOBTITLE', 1719),
 ('COMPANYNAME', 1628), ('PHONENUMBER', 1589), ('ACCOUNTNAME', 1449),
 ('DOB', 1161), ('STREET', 1157), ('SECONDARYADDRESS', 1141),
 ('NEARBYGPSCOORDINATE', 1044), ('LASTNAME', 1031), ('USERNAME', 1020),
 ('SSN', 988), ('ZIPCODE', 979), ('AGE', 971), ('GENDER', 966), ('MAC', 930),
 ('STATE', 880), ('TIME', 851), ('CITY', 850), ('COUNTY', 832), ('EMAIL', 798),
 ('PREFIX', 683), ('CURRENCY', 680), ('ACCOUNTNUMBER', 567),
 ('MIDDLENAME', 565), ('PASSWORD', 561), ('JOBTYPE', 561), ('JOBAREA', 557),
 ('IPV4', 557), ('BUILDINGNUMBER', 552), ('URL', 550), ('SEX', 545),
 ('CURRENCYSYMBOL', 500), ('CREDITCARDNUMBER', 493), ('BITCOINADDRESS', 479),
 ('AMOUNT', 466), ('MASKEDNUMBER', 410), ('IBAN', 395), ('HEIGHT', 363),
 ('EYECOLOR', 358), ('CURRENCYNAME', 328), ('CREDITCARDISSUER', 313),
 ('ETHEREUMADDRESS', 297), ('ORDINALDIRECTION'

In [26]:
print("A few examples sentences containing each entity:\n")
for entity in entity_counts.keys():
    samples = [sample for sample in dataset if entity in set(sample.tags)]
    if len(samples) > 1 and entity != "O":
        print(f"Entity: <{entity}> two example sentences:\n"
              f"\n1) {samples[0].full_text}"
              f"\n2) {samples[1].full_text}"
              f"\n------------------------------------\n")

A few examples sentences containing each entity:

Entity: <ZIPCODE> two example sentences:

1) 89200-3325 schools are next in line for education reform pilot program. Mobility team, prepare accordingly!
2) Students, please be informed that new 79281-1741-compliant changes have been made to our Male school uniform policy. We expect your absolute adherence to ensure a disciplined environment for Metrics studies.
------------------------------------

Entity: <JOBAREA> two example sentences:

1) 89200-3325 schools are next in line for education reform pilot program. Mobility team, prepare accordingly!
2) Hi Miss,
I have been reading about animal-assisted therapy and would love to know more from your perspective in Operations. Could you share the details of the programs at Adams County office? You can forward the info to Jude82.
------------------------------------

Entity: <FIRSTNAME> two example sentences:

1) Jessyca, you should compare our performance to the industry averages. This incl

In [27]:
from presidio_analyzer import AnalyzerEngine
# Loading the vanilla Analyzer Engine, with the default NER model.
analyzer_engine = AnalyzerEngine(default_score_threshold=0.4)

pprint(f"Supported entities for English:")
pprint(analyzer_engine.get_supported_entities("en"), compact=True)

print(f"\nLoaded recognizers for English:")
pprint([rec.name for rec in analyzer_engine.registry.get_recognizers("en", all_fields=True)], compact=True)

print(f"\nLoaded NER models:")
pprint(analyzer_engine.nlp_engine.models)

'Supported entities for English:'
['IN_VEHICLE_REGISTRATION', 'IN_VOTER', 'EMAIL_ADDRESS', 'IN_PASSPORT',
 'US_PASSPORT', 'US_ITIN', 'DATE_TIME', 'CRYPTO', 'CREDIT_CARD', 'IN_AADHAAR',
 'AU_TFN', 'AU_ABN', 'US_DRIVER_LICENSE', 'NRP', 'IBAN_CODE', 'PERSON',
 'IN_PAN', 'US_BANK_NUMBER', 'URL', 'AU_MEDICARE', 'MEDICAL_LICENSE', 'US_SSN',
 'PHONE_NUMBER', 'LOCATION', 'IP_ADDRESS', 'AU_ACN', 'SG_NRIC_FIN', 'UK_NHS',
 'ORGANIZATION']

Loaded recognizers for English:
['CreditCardRecognizer', 'UsBankRecognizer', 'UsLicenseRecognizer',
 'UsItinRecognizer', 'UsPassportRecognizer', 'UsSsnRecognizer', 'NhsRecognizer',
 'SgFinRecognizer', 'AuAbnRecognizer', 'AuAcnRecognizer', 'AuTfnRecognizer',
 'AuMedicareRecognizer', 'InPanRecognizer', 'InAadhaarRecognizer',
 'InVehicleRegistrationRecognizer', 'InPassportRecognizer', 'CryptoRecognizer',
 'DateRecognizer', 'EmailRecognizer', 'IbanRecognizer', 'IpRecognizer',
 'MedicalLicenseRecognizer', 'PhoneRecognizer', 'UrlRecognizer',
 'InVoterRecognizer', 'Sp

In [28]:

presidio_entities_map1 = dict(
  FIRSTNAME=  "PERSON",
  LASTNAME = "PERSON",
  MIDDLENAME="PERSON",
  PERSON = "PERSON",

  DATE="DATE_TIME",
  TIME="DATE_TIME",
  DOB="DATE_TIME" ,
  DATE_TIME = "DATE_TIME",

  EMAIL="EMAIL_ADDRESS",
  EMAIL_ADDRESS="EMAIL_ADDRESS",

  PREFIX="TITLE",
  TITLE = "TITLE",

  URL="URL",

  STREET="LOCATION",
  STATE="LOCATION" , 
  CITY="LOCATION" , 
  COUNTY="LOCATION",
  SECONDARYADDRESS="LOCATION" ,
  BUILDINGNUMBER="LOCATION" ,
  ORDINALDIRECTION="LOCATION",
  LOCATION = "LOCATION",

  PHONEIMEI="PHONE_NUMBER",
  PHONENUMBER="PHONE_NUMBER",
  PHONE_NUMBER = "PHONE_NUMBER",

  IPV4="IP_ADDRESS",
  IPV6="IP_ADDRESS",
  IP="IP_ADDRESS",
  IP_ADDRESS = "IP_ADDRESS",

  CREDITCARDNUMBER="CREDIT_CARD",
  MASKEDNUMBER="CREDIT_CARD",
  CREDIT_CARD = "CREDIT_CARD",

  ZIPCODE="ZIP_CODE",
  ZIP_CODE ="ZIP_CODE",

  COMPANYNAME="ORGANIZATION",
  ORGANIZATION= "ORGANIZATION",

  IBAN="IBAN_CODE",
  IBAN_CODE = "IBAN_CODE",

  SSN="US_SSN",
  US_SSN = "US_SSN",

  AGE="AGE",


  AMOUNT="O",
  USERNAME="O",
  JOBTITLE="O",
  JOBAREA="O",
  ACCOUNTNAME="O",
  ACCOUNTNUMBER="O",
  JOBTYPE="O",
  CURRENCYSYMBOL="O" ,
  PASSWORD="O",
  SEX="O",
  GENDER="O",
  BITCOINADDRESS="O",
  USERAGENT="O",
  CURRENCY="O",
  ETHEREUMADDRESS="O",
  NEARBYGPSCOORDINATE="O",
  CREDITCARDISSUER="O",
  
  MAC="O" ,
  VEHICLEVRM="O",
  EYECOLOR="O",
  CREDITCARDCVV="O",
  HEIGHT="O" ,
  LITECOINADDRESS="O",
  VEHICLEVIN="O" ,
  CURRENCYCODE="O",
  CURRENCYNAME="O" ,
  BIC="O",
  PIN="O",
  O= "O",

)







In [29]:
#entities_mapping=PresidioAnalyzerWrapper.presidio_entities_map 
entities_mapping = presidio_entities_map1
print("Using this mapping between the dataset and Presidio's entities:")
pprint(entities_mapping, compact=True)


dataset = Evaluator.align_entity_types(
    dataset, 
    entities_mapping=entities_mapping, 
    allow_missing_mappings=True
)
new_entity_counts = get_entity_counts(dataset)
print("\nCount per entity after alignment:")
pprint(new_entity_counts.most_common(), compact=True)

dataset_entities = list(new_entity_counts.keys())


Using this mapping between the dataset and Presidio's entities:
{'ACCOUNTNAME': 'O',
 'ACCOUNTNUMBER': 'O',
 'AGE': 'AGE',
 'AMOUNT': 'O',
 'BIC': 'O',
 'BITCOINADDRESS': 'O',
 'BUILDINGNUMBER': 'LOCATION',
 'CITY': 'LOCATION',
 'COMPANYNAME': 'ORGANIZATION',
 'COUNTY': 'LOCATION',
 'CREDITCARDCVV': 'O',
 'CREDITCARDISSUER': 'O',
 'CREDITCARDNUMBER': 'CREDIT_CARD',
 'CREDIT_CARD': 'CREDIT_CARD',
 'CURRENCY': 'O',
 'CURRENCYCODE': 'O',
 'CURRENCYNAME': 'O',
 'CURRENCYSYMBOL': 'O',
 'DATE': 'DATE_TIME',
 'DATE_TIME': 'DATE_TIME',
 'DOB': 'DATE_TIME',
 'EMAIL': 'EMAIL_ADDRESS',
 'EMAIL_ADDRESS': 'EMAIL_ADDRESS',
 'ETHEREUMADDRESS': 'O',
 'EYECOLOR': 'O',
 'FIRSTNAME': 'PERSON',
 'GENDER': 'O',
 'HEIGHT': 'O',
 'IBAN': 'IBAN_CODE',
 'IBAN_CODE': 'IBAN_CODE',
 'IP': 'IP_ADDRESS',
 'IPV4': 'IP_ADDRESS',
 'IPV6': 'IP_ADDRESS',
 'IP_ADDRESS': 'IP_ADDRESS',
 'JOBAREA': 'O',
 'JOBTITLE': 'O',
 'JOBTYPE': 'O',
 'LASTNAME': 'PERSON',
 'LITECOINADDRESS': 'O',
 'LOCATION': 'LOCATION',
 'MAC': 'O',
 

In [30]:
# Set up the experiment tracker to log the experiment for reproducibility
experiment = get_experiment_tracker()
 
# Create a wrapper for Presidio to be used within the presidio-evaluator framework
model = PresidioAnalyzerWrapper(analyzer_engine, 
                                entity_mapping=entities_mapping)

# Create the evaluator object
evaluator = Evaluator(model=model)


# Track model and dataset params
params = {"dataset_name": dataset_name, "model_name": model.name}
params.update(model.to_log())
experiment.log_parameters(params)
experiment.log_dataset_hash(dataset)
experiment.log_parameter("entity_mappings", json.dumps(entities_mapping))

--------
Entities supported by this Presidio Analyzer instance:
IN_VEHICLE_REGISTRATION, IN_VOTER, EMAIL_ADDRESS, IN_PASSPORT, US_PASSPORT, US_ITIN, DATE_TIME, CRYPTO, CREDIT_CARD, IN_AADHAAR, AU_TFN, AU_ABN, US_DRIVER_LICENSE, NRP, IBAN_CODE, PERSON, IN_PAN, US_BANK_NUMBER, URL, AU_MEDICARE, MEDICAL_LICENSE, US_SSN, PHONE_NUMBER, LOCATION, IP_ADDRESS, AU_ACN, SG_NRIC_FIN, UK_NHS, ORGANIZATION


In [31]:
## Run experiment

evaluation_results = evaluator.evaluate_all(dataset)
results = evaluator.calculate_score(evaluation_results)

# Track experiment results
experiment.log_metrics(results.to_log())
entities, confmatrix = results.to_confusion_matrix()
experiment.log_confusion_matrix(matrix=confmatrix, 
                                labels=entities)

# Plot output
plotter = evaluator.Plotter(model=model, 
                            results=results, 
                            output_folder = ".", 
                            model_name = model.name, 
                            beta = 2)


# end experiment
experiment.end()

Mapping entity values using this dictionary: {'FIRSTNAME': 'PERSON', 'LASTNAME': 'PERSON', 'MIDDLENAME': 'PERSON', 'PERSON': 'PERSON', 'DATE': 'DATE_TIME', 'TIME': 'DATE_TIME', 'DOB': 'DATE_TIME', 'DATE_TIME': 'DATE_TIME', 'EMAIL': 'EMAIL_ADDRESS', 'EMAIL_ADDRESS': 'EMAIL_ADDRESS', 'PREFIX': 'TITLE', 'TITLE': 'TITLE', 'URL': 'URL', 'STREET': 'LOCATION', 'STATE': 'LOCATION', 'CITY': 'LOCATION', 'COUNTY': 'LOCATION', 'SECONDARYADDRESS': 'LOCATION', 'BUILDINGNUMBER': 'LOCATION', 'ORDINALDIRECTION': 'LOCATION', 'LOCATION': 'LOCATION', 'PHONEIMEI': 'PHONE_NUMBER', 'PHONENUMBER': 'PHONE_NUMBER', 'PHONE_NUMBER': 'PHONE_NUMBER', 'IPV4': 'IP_ADDRESS', 'IPV6': 'IP_ADDRESS', 'IP': 'IP_ADDRESS', 'IP_ADDRESS': 'IP_ADDRESS', 'CREDITCARDNUMBER': 'CREDIT_CARD', 'MASKEDNUMBER': 'CREDIT_CARD', 'CREDIT_CARD': 'CREDIT_CARD', 'ZIPCODE': 'ZIP_CODE', 'ZIP_CODE': 'ZIP_CODE', 'COMPANYNAME': 'ORGANIZATION', 'ORGANIZATION': 'ORGANIZATION', 'IBAN': 'IBAN_CODE', 'IBAN_CODE': 'IBAN_CODE', 'SSN': 'US_SSN', 'US_SSN':

In [32]:
plotter.plot_scores()

In [33]:
plotter.plot_confusion_matrix(entities=entities, confmatrix=confmatrix)

In [34]:
plotter.plot_most_common_tokens()

In [35]:
errors = results.model_errors

In [41]:
fps_df = ModelError.get_fps_dataframe(errors, entity=["PERSON"])
fps_df[["full_text", "token", "annotation", "prediction"]]

Unnamed: 0,full_text,token,annotation,prediction
0,"""To guarantee the fairness of our assessments we monitor data points such as FJD and 257k utilized for transactions. Do note that revealing your a4O0ARjeKE9G or 0544 to third parties breaks our security policy.""",FJD,O,PERSON
1,"A report from our Planner revealed missed payments of property tax for 35663729. Please, don't ignore this and make a payment of 211,884.23 KM.",211884.23,O,PERSON
2,"A report from our Planner revealed missed payments of property tax for 35663729. Please, don't ignore this and make a payment of 211,884.23 KM.",KM,O,PERSON
3,We’ve had breakthrough in diagnostic technology as per Consultant Nico Rath. The details can be accessed from https://vivid-processing.name. Please verify the system with the Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.2; Trident/3.1; .NET CLR 3.7.26850.4).,Mozilla/5.0,O,PERSON
4,"Dear team, a Male to female trans woman artist, username: Eleanore.Schimmel52 from 61.211.197.137 signed up for our digital arts workshop happening next month. Please make all necessary arrangements.",Eleanore,O,PERSON
5,Patient Demarcus Rau with a secondary address of Apt. 489 and Post Code 85119-2571. Their connection details: User Agent: Mozilla/5.0 (Windows NT 5.0; rv:14.0) Gecko/20100101 Firefox/14.0.8 and phone IMEI: 64-818844-223828-4,Mozilla/5.0,O,PERSON
6,"I spotted a potential patent infringement while auditing the web, and it's under this IP address: 21.43.7.22. I trust to hear your thoughts on how to react, Ms..",Ms,O,PERSON
7,"Dear Yasmine_Toy, due technical issue, your 10 AM telemedicine appointment has been rescheduled.",Yasmine_Toy,O,PERSON
8,"We're carrying out system updates. Kindly update your qpZRCkD7NKtt at the earliest, Amya.",qpZRCkD7NKtt,O,PERSON
9,"Dear Destany, what is the ratio of special education teachers to students in Buckinghamshire?",Dear,O,PERSON


In [40]:
fns_df = ModelError.get_fns_dataframe(errors, entity=["CREDIT_CARD"])
fns_df[["full_text", "token", "annotation", "prediction"]]

Unnamed: 0,full_text,token,annotation,prediction
0,"I need to order a new EKG machine. The invoice should be sent to our building number, 841 and make sure they charge it to the company card (6036994285533224).",6036994285533224,CREDIT_CARD,O
1,"For more details about the scholarship provided by our university for Integration, please contact us at Suite 636. Use reference number 4805690419738918.",4805690419738918,CREDIT_CARD,O
2,"Due to a new security update, 356 74 8854 and 6341731761145531 can no longer be used to authenticate your account. Please use your EMR number for future appointments.",6341731761145531,CREDIT_CARD,O
3,"Hello, your transaction with Harris - Dicki associated with 6421787706555142 was successful. Welcome to our extended family.",6421787706555142,CREDIT_CARD,O
4,"Hello, Lambert! A single step today might lead to significant career shifts tomorrow. Gaining insight through a session with us might just be that step. Our counselor usually asks for the 3908656268910792 at the end of the registration process.",3908656268910792,CREDIT_CARD,O
5,"Due to privacy regulations, we need to mask some phone numbers (9258975595688208) and SSN (75632877023). Let the IT department know if you need any assistance in handling private data.",9258975595688208,CREDIT_CARD,O
6,We have launched our new ultrasound diagnostic machine. An invoice of NZD873.85k will be issued to the following credit card: 2403093735349307.,2403093735349307,CREDIT_CARD,O
7,"Tiffany, Goldner - Runte needs confirmation regarding your attendance for a meeting on 5/82. Call number 7927178637037986 for details.",7927178637037986,CREDIT_CARD,O
8,"Record your daily health progress on our portal via https://winged-method.com, login with your unique 75611599693 and 4117940219039298.",4117940219039298,CREDIT_CARD,O
9,Marilyne our educational leadership program aligns with your vision. Clarify doubts at Zoom meeting with meeting id 7197894361626822.,7197894361626822,CREDIT_CARD,O


In [None]:
sent= "Number is 7075373064434325"
model.predict(InputSample(full_text=sent))

['O', 'O', 'O', 'O']