Presidio (Origin from Latin praesidium ‘protection, garrison’) helps to ensure sensitive data is properly managed and governed. It provides fast identification and anonymization modules for private entities in text and images such as credit card numbers, names, locations, social security numbers, bitcoin wallets, US phone numbers, financial data and more.


In [4]:
pip install presidio_analyzer

Note: you may need to restart the kernel to use updated packages.


#### Simple flow
A simple call to Presidio Analyzer:

In [6]:
from presidio_analyzer import AnalyzerEngine

text = "His name is Mr. Jones and his phone number is 212-555-5555"

analyzer = AnalyzerEngine()
analyzer_results = analyzer.analyze(text=text, language="en")

print(analyzer_results)

[type: PERSON, start: 16, end: 21, score: 0.85, type: PHONE_NUMBER, start: 46, end: 58, score: 0.75]


### Example 1: Deny-list based PII recognition

in this example, I will pass a short list of tokens which should be marked as PII if detected. First, let's define the tokens we want to treat as PII. In this case it would be a list of titles:


In [7]:
titles_list = [
    "Sir",
    "Ma'am",
    "Madam",
    "Mr.",
    "Mrs.",
    "Ms.",
    "Miss",
    "Dr.",
    "Professor",
]

Second, let's create a **PatternRecognizer** which would scan for those titles, by passing a deny_list:

In [8]:
from presidio_analyzer import PatternRecognizer

titles_recognizer = PatternRecognizer(supported_entity="TITLE", deny_list=titles_list)

In [9]:
## Calling our recognizer directly
text1 = "I suspect Professor Plum, in the Dining Room, with the candlestick"
result = titles_recognizer.analyze(text1, entities=["TITLE"])
print(f"Result:\n {result}")

Result:
 [type: TITLE, start: 10, end: 19, score: 1.0]


I can add another List od recognizers used by the Presidio **AnalyzerEngine**

In [10]:
from presidio_analyzer import AnalyzerEngine

analyzer = AnalyzerEngine()
analyzer.registry.add_recognizer(titles_recognizer)

**LEARNING POINT** - here, at initialization, Presidio loads all available recognizers, including the NLPEngine used to detect enitites, and extract tokens, lemmas and other linguistic features.

In [12]:
results = analyzer.analyze(text =text1, language="en")
print("Results : ")
print(results)

Results : 
[type: TITLE, start: 10, end: 19, score: 1.0, type: PERSON, start: 20, end: 24, score: 0.85]


In [13]:
print("Identified these PII entities:")
for result in results:
    print(f"- {text1[result.start:result.end]} as {result.entity_type}")

Identified these PII entities:
- Professor as TITLE
- Plum as PERSON


As expected, both the name "Plum" and the title were identified as PII.

### Example 2: Regular-expressions based PII recognition

This is another simple recognizer we can add is based on regular expressions.

In [16]:
from presidio_analyzer import Pattern, PatternRecognizer

# Define the regex pattern in a Presidio `Pattern` object:
numbers_pattern = Pattern(name="numbers_pattern", regex="\d+", score=0.5)

# Define the recognizer with one or more patterns
number_recognizer = PatternRecognizer(
    supported_entity="NUMBER", patterns=[numbers_pattern]
)

text2 = "I live in 510 Broad st."

numbers_result = number_recognizer.analyze(text=text2, entities=["NUMBER"])

print("Result:")
print(numbers_result)

Result:
[type: NUMBER, start: 10, end: 13, score: 0.5]


**LEARNING CURVE**: One can add different REGEX expression and distinguish result accordingly.

### Example 3: Rule based logic recognizer

detecting numbers in either numerical or alphabetic (e.g. Forty five) form:


In [19]:
from typing import List
from presidio_analyzer import EntityRecognizer, RecognizerResult
from presidio_analyzer.nlp_engine import NlpArtifacts


class NumbersRecognizer(EntityRecognizer):

    expected_confidence_level = 0.7  # expected confidence level for this recognizer

    def load(self) -> None:
        """No loading is required."""
        pass

    def analyze(
        self, text: str, entities: List[str], nlp_artifacts: NlpArtifacts
    ) -> List[RecognizerResult]:
        """
        Analyzes test to find tokens which represent numbers (either 123 or One Two Three).
        """
        results = []

        # iterate over the spaCy tokens, and call `token.like_num`
        for token in nlp_artifacts.tokens:
            if token.like_num:
                result = RecognizerResult(
                    entity_type="NUMBER",
                    start=token.idx,
                    end=token.idx + len(token),
                    score=self.expected_confidence_level,
                )
                results.append(result)
        return results

# Instantiate the new NumbersRecognizer:
new_numbers_recognizer = NumbersRecognizer(supported_entities=["NUMBER"])

Since this recognizer requires the NlpArtifacts, we would have to call it as part of the AnalyzerEngine flow:

In [20]:
from presidio_analyzer import AnalyzerEngine

text3 = "Roberto lives in Five 10 Broad st."
analyzer = AnalyzerEngine()
analyzer.registry.add_recognizer(new_numbers_recognizer)

numbers_results2 = analyzer.analyze(text=text3, language="en")
print("Results:")
print("\n".join([str(res) for res in numbers_results2]))

Results:
type: PERSON, start: 0, end: 7, score: 0.85
type: NUMBER, start: 17, end: 21, score: 0.7
type: NUMBER, start: 22, end: 24, score: 0.7


### Example 4: Supporting new models and languages

In [21]:
from presidio_analyzer import AnalyzerEngine
from presidio_analyzer.nlp_engine import NlpEngineProvider

# import spacy
# spacy.cli.download("es_core_news_md")

# Create configuration containing engine name and models
configuration = {
    "nlp_engine_name": "spacy",
    "models": [
        {"lang_code": "es", "model_name": "es_core_news_md"},
        {"lang_code": "en", "model_name": "en_core_web_lg"},
    ],
}

# Create NLP engine based on configuration
provider = NlpEngineProvider(nlp_configuration=configuration)
nlp_engine_with_spanish = provider.create_engine()

# Pass the created NLP engine and supported_languages to the AnalyzerEngine
analyzer = AnalyzerEngine(
    nlp_engine=nlp_engine_with_spanish, supported_languages=["en", "es"]
)

# Analyze in different languages
results_spanish = analyzer.analyze(text="Mi nombre es Morris", language="es")
print("Results from Spanish request:")
print(results_spanish)

results_english = analyzer.analyze(text="My name is Morris", language="en")
print("Results from English request:")
print(results_english)

Collecting es-core-news-md==3.7.0
  Downloading https://github.com/explosion/spacy-models/releases/download/es_core_news_md-3.7.0/es_core_news_md-3.7.0-py3-none-any.whl (42.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.3/42.3 MB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Installing collected packages: es-core-news-md
Successfully installed es-core-news-md-3.7.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('es_core_news_md')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
Results from Spanish request:
[type: PERSON, start: 13, end: 19, score: 0.85]
Results from English request:
[type: PERSON, start: 11, end: 17, score: 0.85]


### Example 5: Leveraging context words

US_ZIP_CODE recognizer

In [22]:
from presidio_analyzer import (
    Pattern,
    PatternRecognizer,
    RecognizerRegistry,
    AnalyzerEngine,
)

# Define the regex pattern
regex = r"(\b\d{5}(?:\-\d{4})?\b)"  # very weak regex pattern
zipcode_pattern = Pattern(name="zip code (weak)", regex=regex, score=0.01)

# Define the recognizer with the defined pattern
zipcode_recognizer = PatternRecognizer(
    supported_entity="US_ZIP_CODE", patterns=[zipcode_pattern]
)

registry = RecognizerRegistry()
registry.add_recognizer(zipcode_recognizer)
analyzer = AnalyzerEngine(registry=registry)

# Test
results = analyzer.analyze(text="My zip code is 90210", language="en")

print(f"Result:\n {results}")

Result:
 [type: US_ZIP_CODE, start: 15, end: 20, score: 0.01]


So this is working, but would catch any 5 digit string. This is why we set the score to 0.01. Let's use context words to increase score:

In [23]:
from presidio_analyzer import PatternRecognizer,  AnalyzerEngine, RecognizerRegistry


zipcode_recognizer_w_context  = PatternRecognizer(
    supported_entity="US_ZIP_CODE",
    patterns=[zipcode_pattern],
    context=["zip","zipcode"],
)

registry = RecognizerRegistry()
registry.add_recognizer(zipcode_recognizer_w_context)
analyzer = AnalyzerEngine(registry=registry)

# Test
results = analyzer.analyze(text="My zip code is 90210", language="en")
print("Result:")
print(results)

Result:
[type: US_ZIP_CODE, start: 15, end: 20, score: 0.4]


**LEARNING CURVE **- 
1. **AnalyzerEngine** will create **LemmaContextAwareEnhancer** by default if not passed, which will enhance score of each matched result if its recognizer holds context words and the lemma of those words are found in the surroundings of the matched entity.

2. since the **LemmaContextAwareEnhancer** default context similarity factor is 0.35 and default minimum score with context similarity is 0.4

3. We can change that by passing other values to the context_similarity_factor and min_score_with_context_similarity parameters of the LemmaContextAwareEnhancer object.

In [24]:
from presidio_analyzer import AnalyzerEngine, RecognizerRegistry
from presidio_analyzer.context_aware_enhancers import LemmaContextAwareEnhancer

context_aware_enhancer = LemmaContextAwareEnhancer(
    context_similarity_factor=0.45, min_score_with_context_similarity=0.4
)

registry = RecognizerRegistry()
registry.add_recognizer(zipcode_recognizer_w_context)
analyzer = AnalyzerEngine(
    registry=registry, context_aware_enhancer=context_aware_enhancer
)

# Test
results = analyzer.analyze(text="My zip code is 90210", language="en")
print("Result:")
print(results)

Result:
[type: US_ZIP_CODE, start: 15, end: 20, score: 0.46]


**LEARNING CURVE**: 
- In addition to surrounding words, **additional context words could be passed on the request level**. This is useful when there is context coming from metadata such as column names or a specific user input. In the following example, notice how the "zip" context word doesn't appear in the text but still enhances the confidence score from 0.01 to 0.4

In [36]:
from presidio_analyzer import AnalyzerEngine, RecognizerRegistry, PatternRecognizer
import pprint
# Define the recognizer with the defined pattern and context words
zipcode_recognizer = PatternRecognizer(
    supported_entity="US_ZIP_CODE",
    patterns=[zipcode_pattern],
    context=["zip", "zipcode"],
)
registry = RecognizerRegistry()
registry.add_recognizer(zipcode_recognizer)
analyzer = AnalyzerEngine(registry=registry)

# Test with an example record having a column name which could be injected as context
record = {"column_name": "zip", "text": "My code is 90210"}

result = analyzer.analyze(
    text=record["text"], language="en", context=[record["column_name"]], return_decision_process=True

)

print("Result")
print(result)


Result
[type: US_ZIP_CODE, start: 11, end: 16, score: 0.4]


### Example 6 : Tracing Decision Process  

Presidio-analyzer's decision process exposes information on why a specific PII was detected. Such information could contain:

- Which recognizer detected the entity
- Which regex pattern was used
- Interpretability mechanisms in ML models
- Which context words improved the score
- Confidence scores before and after each step And more.

In [32]:
from presidio_analyzer import AnalyzerEngine
import pprint

analyzer = AnalyzerEngine()

results = analyzer.analyze(
    text="My zip code is 90210", language="en", return_decision_process=True
)

#decision_process = results[0].analysis_explanation

pp = pprint.PrettyPrinter()
print("Decision process output:\n")
pp.pprint(results)
#pp.pprint(results.__dict__)

Decision process output:

[]


### Example 7: Creating no-code pattern recognizers

In [38]:
from presidio_analyzer import AnalyzerEngine, RecognizerRegistry

yaml_file = "/Users/ashleshkhajbage/Documents/GitHub/PII-Data-Detection/example_recognizers.yaml"
registry = RecognizerRegistry()
registry.add_recognizers_from_yaml(yaml_file)

analyzer = AnalyzerEngine(registry=registry)
analyzer.analyze(text="Mr. and Mrs. Smith", language="en")

[type: TITLE, start: 0, end: 3, score: 1.0,
 type: TITLE, start: 8, end: 12, score: 1.0]

This example adds the new recognizers to the predefined recognizers in Presidio

In [39]:
from presidio_analyzer import AnalyzerEngine, RecognizerRegistry

yaml_file = "/Users/ashleshkhajbage/Documents/GitHub/PII-Data-Detection/example_recognizers.yaml"
registry = RecognizerRegistry()
registry.load_predefined_recognizers()  # Loads all the predefined recognizers (Credit card, phone number etc.)

registry.add_recognizers_from_yaml(yaml_file)

analyzer = AnalyzerEngine()
analyzer.analyze(text="Mr. Plum wrote a book", language="en")

[type: PERSON, start: 4, end: 8, score: 0.85]