<a href="https://colab.research.google.com/github/LLMsLab/chat-gpt-api-lab/blob/exploration%2Fchatgpt-api-understanding/tutorial_02_presidio.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# A Basic Tutorial on Presidio

Presidio is an open-source data protection and data privacy library developed by Microsoft. It is designed to recognize and anonymize Personally Identifiable Information (PII) in text data. Presidio uses Named Entity Recognition (NER) techniques to detect various types of sensitive information in a given text. It can be used to ensure privacy and compliance with data protection regulations.

## Setup

To wrap the lines in the notebook's output use the following function.

Reference: [Line Wrapping in Collaboratory Google results](https://stackoverflow.com/questions/58890109/line-wrapping-in-collaboratory-google-results)

In [None]:
from IPython.display import HTML, display

def set_css():
    """
    Wraps the lines in the notebook's output.
    """
    display(HTML('''
    <style>
      pre {
        white-space: pre-wrap;
      }
    </style>
    '''))

get_ipython().events.register('pre_run_cell', set_css)

In [None]:
# Requirements
!pip install -qU python-dotenv openai gradio presidio-analyzer presidio-anonymizer
!python -m spacy download en_core_web_lg -qU

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m587.7/587.7 MB[0m [31m?[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')


In [None]:
# Imports
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine

## Understanding the Python Code

Our Python script uses the Presidio library to analyze and anonymize a given text. Here's a breakdown of what the code does:

1. **Imports the necessary modules**: The `presidio_analyzer` and `presidio_anonymizer` are imported from the Presidio library.

2. **Defines the `analyze_and_anonymize` function**: This function takes a `text` argument and returns the anonymized version of the text.

3. **Sets up the Analyzer engine**: The `AnalyzerEngine` is initialized. This engine is capable of recognizing various types of PII in text data.

4. **Specifies the entities to analyze**: A list of entities that should be detected in the text is defined. These entities include various types of personal and sensitive information, such as credit card numbers, email addresses, and social security numbers.

5. **Analyzes the text**: The `analyze` method of the `AnalyzerEngine` is called with the text and the entities as arguments. This method returns a list of recognized entities in the text.

6. **Sets up the Anonymizer engine**: The `AnonymizerEngine` is initialized. This engine is capable of anonymizing recognized entities in the text.

7. **Anonymizes the text**: The `anonymize` method of the `AnonymizerEngine` is called with the text and the recognized entities as arguments. This method returns the anonymized version of the text.

8. **Returns the anonymized text**: The anonymized text is returned by the `analyze_and_anonymize` function.

The script is then tested on a sample text that contains various types of PII.


In [None]:
def analyze_and_anonymize(text: str):
    """
    Analyze and anonymize PII data from a given text string.

    Parameters
    ----------
    text : str
        The text to be analyzed and anonymized.

    Returns
    -------
    str
        The anonymized text.

    Example
    -------
    >>> text = "Mr. Smith's phone number is 212-555-5555, his SSN is \
    >>> 432-56-5654, and his credit card number is 344078656339539"
    >>> print(analyze_and_anonymize(text))
    """
    # Set up the engine, loads the NLP module (spaCy model by default)
    # and other PII recognizers
    analyzer = AnalyzerEngine()

    # Define the entities to analyze
    entities = [
        "CREDIT_CARD",
        "CRYPTO",
        "DATE_TIME",
        "EMAIL_ADDRESS",
        "IBAN_CODE",
        "IP_ADDRESS",
        "NRP",
        "LOCATION",
        "PERSON",
        "PHONE_NUMBER",
        "MEDICAL_LICENSE",
        "URL",
        "US_BANK_NUMBER",
        "US_DRIVER_LICENSE",
        "US_ITIN",
        "US_PASSPORT",
        "US_SSN",
    ]

    # Call analyzer to get results
    results = analyzer.analyze(text=text, entities=entities, language="en")

    # Analyzer results are passed to the AnonymizerEngine for anonymization
    anonymizer = AnonymizerEngine()

    # Anonymize the text
    anonymized_text = anonymizer.anonymize(text=text, analyzer_results=results)

    return anonymized_text.text

In [None]:
text = """Mr. Smith's phone number is 212-555-5555, his SSN is 432-56-5654, 
and his credit card number is 344078656339539"""
print(analyze_and_anonymize(text))



Mr. <PERSON>'s phone number is <PHONE_NUMBER>, his SSN is <US_SSN>, 
and his credit card number is <CREDIT_CARD>


## Output Explanation

The output of the script is the anonymized version of the input text. Each recognized entity in the text is replaced with a placeholder. For example, a recognized person's name is replaced with `<PERSON>`, a recognized phone number is replaced with `<PHONE_NUMBER>`, and so on.

This ensures that the sensitive information in the text is obscured, but the general structure and non-sensitive content of the text is preserved. This can be useful in scenarios where you want to use or share the text data without compromising the privacy of individuals.

The warnings at the beginning of the output are informational messages from the Presidio library, indicating that it is using the default configuration and the `en_core_web_lg` spaCy model for English language processing.