<a href="https://colab.research.google.com/github/Gltknzk/Sensitive-Data-Detection/blob/master/Sensitive_Data_Detection_Anonymization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Advanced Sensitive Data Detection and Anonymization by using Hugging Face Transformers

Presidio (The Presidio analyzer is a Python based service for detecting Sensitive Data in text) helps to ensure sensitive data is properly managed and governed. It provides fast identification and anonymization modules for private entities in text such as credit card numbers, names, locations, social security numbers, bitcoin wallets, US phone numbers, financial data and more.

By Default Presidio is using Spacy for Sensitive Data detection and extraction. In this model are we going to replace spacy with a Hugging Face Transformer to perform detection and anonymization. Presidio supports already out of the box 24 PII entities including, CREDIT_CARD, IBAN_CODE, EMAIL_ADDRESS, US_BANK_NUMBER, US_ITIN... We are going to extend this available 24 entities with transformers to include LOCATION, PERSON & ORGANIZATION. But it is possible to use any "entity" extracted by the transformers model.

Loading important libraries

In [1]:
pip install presidio_analyzer

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting presidio_analyzer
  Downloading presidio_analyzer-2.2.29-py3-none-any.whl (66 kB)
[K     |████████████████████████████████| 66 kB 2.1 MB/s 
[?25hCollecting phonenumbers>=8.12
  Downloading phonenumbers-8.12.52-py2.py3-none-any.whl (2.6 MB)
[K     |████████████████████████████████| 2.6 MB 9.4 MB/s 
Collecting tldextract
  Downloading tldextract-3.3.1-py3-none-any.whl (93 kB)
[K     |████████████████████████████████| 93 kB 1.6 MB/s 
Collecting requests-file>=1.4
  Downloading requests_file-1.5.1-py2.py3-none-any.whl (3.7 kB)
Installing collected packages: requests-file, tldextract, phonenumbers, presidio-analyzer
Successfully installed phonenumbers-8.12.52 presidio-analyzer-2.2.29 requests-file-1.5.1 tldextract-3.3.1


In [2]:
pip install presidio_anonymizer

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting presidio_anonymizer
  Downloading presidio_anonymizer-2.2.29-py3-none-any.whl (25 kB)
Collecting pycryptodome>=3.10.1
  Downloading pycryptodome-3.15.0-cp35-abi3-manylinux2010_x86_64.whl (2.3 MB)
[K     |████████████████████████████████| 2.3 MB 6.0 MB/s 
[?25hInstalling collected packages: pycryptodome, presidio-anonymizer
Successfully installed presidio-anonymizer-2.2.29 pycryptodome-3.15.0


In [None]:
#python -m spacy download en_core_web_lg

In [3]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.20.1-py3-none-any.whl (4.4 MB)
[K     |████████████████████████████████| 4.4 MB 4.3 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.8.1-py3-none-any.whl (101 kB)
[K     |████████████████████████████████| 101 kB 10.8 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 46.6 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 41.4 MB/s 
Installing collected packages: pyyaml, tokenizers, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Found existing installation: PyYAML 3.13
    Uninstal

In [4]:
import pandas as pd

Creating Model

In [5]:
from presidio_anonymizer import AnonymizerEngine
from presidio_analyzer import AnalyzerEngine
from typing import List

from presidio_analyzer import AnalyzerEngine, EntityRecognizer, RecognizerResult
from presidio_analyzer.nlp_engine import NlpArtifacts
from transformers import pipeline

# load spacy model -> workaround
import os
os.system("spacy download en_core_web_lg")

# list of entities: https://microsoft.github.io/presidio/supported_entities/#list-of-supported-entities
DEFAULT_ANOYNM_ENTITIES = [
    "CREDIT_CARD",
    "CRYPTO",
    "DATE_TIME",
    "EMAIL_ADDRESS",
    "IBAN_CODE",
    "IP_ADDRESS",
    "NRP",
    "LOCATION",
    "PERSON",
    "PHONE_NUMBER",
    "MEDICAL_LICENSE",
    "URL",
    "ORGANIZATION",
    "US_SSN"
]

# init anonymize engine
engine = AnonymizerEngine()

class HFTransformersRecognizer(EntityRecognizer):
    def __init__(
        self,
        model_id_or_path=None,
        aggregation_strategy="simple",
        supported_language="en",
        ignore_labels=["O", "MISC"],
    ):
        # inits transformers pipeline for given mode or path
        self.pipeline = pipeline(
            "token-classification", model=model_id_or_path, aggregation_strategy=aggregation_strategy, ignore_labels=ignore_labels
        )
        # map labels to presidio labels
        self.label2presidio = {
            "PER": "PERSON",
            "LOC": "LOCATION",
            "ORG": "ORGANIZATION",
        }

        # passes entities from model into parent class
        super().__init__(supported_entities=list(self.label2presidio.values()), supported_language=supported_language)

    def load(self) -> None:
        """No loading is required."""
        pass

    def analyze(
        self, text: str, entities: List[str] = None, nlp_artifacts: NlpArtifacts = None
    ) -> List[RecognizerResult]:
        """
        Extracts entities using Transformers pipeline
        """
        results = []

        # keep max sequence length in mind
        predicted_entities = self.pipeline(text)
        if len(predicted_entities) > 0:
            for e in predicted_entities:
                converted_entity = self.label2presidio[e["entity_group"]]
                if converted_entity in entities or entities is None:
                    results.append(
                        RecognizerResult(
                            entity_type=converted_entity, start=e["start"], end=e["end"], score=e["score"]
                        )
                    )
        return results


def model_fn(model_dir):
    transformers_recognizer = HFTransformersRecognizer(model_dir)
    # Set up the engine, loads the NLP module (spaCy model by default) and other PII recognizers
    analyzer = AnalyzerEngine()
    analyzer.registry.add_recognizer(transformers_recognizer)
    return analyzer


def predict_fn(data, analyzer):
    sentences = data.pop("inputs", data)
    if "parameters" in data:
        anonymization_entities = data["parameters"].get("entities", DEFAULT_ANOYNM_ENTITIES)
        anonymize_text = data["parameters"].get("anonymize", False)
    else:
        anonymization_entities = DEFAULT_ANOYNM_ENTITIES
        anonymize_text = False

    # identify entities
    results = analyzer.analyze(text=sentences, entities=anonymization_entities, language="en")
    # anonymize text
    if anonymize_text:
        result = engine.anonymize(text=sentences, analyzer_results=results)
        return {"anonymized": result.text}

    return {"found": [entity.to_dict() for entity in results]}

Making Predictions

In [99]:
sentence="""
Hello, my name is Zack and I live in Istanbul.
I work for DataTera Tech. 
You can call me at (212) 555-1234.
My credit card number is 4095-2609-9393-4932 and my crypto wallet id is 16Yeky6GMjeNkAiNcBY7ZhrLoMSgg1BoyZ.
This is a valid International Bank Account Number: IL150120690000003111111.
My social security number is 078-05-1126.  My driver license number is 1234567A."""

In [100]:
data = {
  "inputs": sentence,
}

Find all default entities in the text

In [101]:
predict_fn(data,AnalyzerEngine())



{'found': [{'analysis_explanation': None,
   'end': 154,
   'entity_type': 'CREDIT_CARD',
   'recognition_metadata': {'recognizer_name': 'CreditCardRecognizer'},
   'score': 1.0,
   'start': 135},
  {'analysis_explanation': None,
   'end': 216,
   'entity_type': 'CRYPTO',
   'recognition_metadata': {'recognizer_name': 'CryptoRecognizer'},
   'score': 1.0,
   'start': 182},
  {'analysis_explanation': None,
   'end': 292,
   'entity_type': 'IBAN_CODE',
   'recognition_metadata': {'recognizer_name': 'IbanRecognizer'},
   'score': 1.0,
   'start': 269},
  {'analysis_explanation': None,
   'end': 23,
   'entity_type': 'PERSON',
   'recognition_metadata': {'recognizer_name': 'SpacyRecognizer'},
   'score': 0.85,
   'start': 19},
  {'analysis_explanation': None,
   'end': 46,
   'entity_type': 'LOCATION',
   'recognition_metadata': {'recognizer_name': 'SpacyRecognizer'},
   'score': 0.85,
   'start': 38},
  {'analysis_explanation': None,
   'end': 334,
   'entity_type': 'US_SSN',
   'recognit

Find only PERSON and LOCATION entities

In [102]:
data = {
  "inputs": sentence,
  "parameters": {
    "entities":["PERSON","LOCATION"]
  }
}

In [10]:
predict_fn(data,AnalyzerEngine())

{'found': [{'analysis_explanation': None,
   'end': 23,
   'entity_type': 'PERSON',
   'recognition_metadata': {'recognizer_name': 'SpacyRecognizer'},
   'score': 0.85,
   'start': 19},
  {'analysis_explanation': None,
   'end': 46,
   'entity_type': 'LOCATION',
   'recognition_metadata': {'recognizer_name': 'SpacyRecognizer'},
   'score': 0.85,
   'start': 38}]}

Anonymize all entities in the text

In [103]:
data = {
  "inputs": sentence,
  "parameters": {
    "anonymize": True,
  }
}

In [104]:
print(predict_fn(data,AnalyzerEngine())["anonymized"])




Hello, my name is <PERSON> and I live in <LOCATION>.
I work for DataTera Tech. 
You can call me at <PHONE_NUMBER>.
My credit card number is <CREDIT_CARD> and my crypto wallet id is <CRYPTO>.
This is a valid International Bank Account Number: <IBAN_CODE>.
My social security number is <US_SSN>.  My driver license number is 1234567A.


Anonymize only PERSON and LOCATION in the text

In [105]:
data = {
  "inputs": sentence,
  "parameters": {
    "anonymize": True,
    "entities":["PERSON","LOCATION"]
  }
}

In [106]:
print(predict_fn(data,AnalyzerEngine())["anonymized"])




Hello, my name is <PERSON> and I live in <LOCATION>.
I work for DataTera Tech. 
You can call me at (212) 555-1234.
My credit card number is 4095-2609-9393-4932 and my crypto wallet id is 16Yeky6GMjeNkAiNcBY7ZhrLoMSgg1BoyZ.
This is a valid International Bank Account Number: IL150120690000003111111.
My social security number is 078-05-1126.  My driver license number is 1234567A.
