# Experimenting with Mircosoft Presidio and GovTech Cloak

## Presidio

##### Used to identify and anomalyze/mask PII information
Presidio's features two main modules for anonymization PII in text:

    1. Presidio analyzer: Identification of PII in text
    2. Presidio anonymizer: De-identify detected PII entities using different operators

In most cases, we would run the Presidio analyzer to detect where PII entities exist, and then the Presidio anonymizer to remove those using specific operators (such as redact, replace, hash or encrypt)

In [5]:
import pprint

In [None]:
# install presidio
!pip install presidio_analyzer
!pip install presidio_anonymizer
!pip install presidio-image-redactor
!python -m spacy download en_core_web_lg

### 1. Presidio Analyzer

In [6]:
from presidio_analyzer import AnalyzerEngine

analyzer = AnalyzerEngine()

#### 1.1 Entities

##### Some Supported Entities
+ Global Entities:
    + Credit Card Number: CREDIT_CARD
    + Crypto Wallet Number: CRYPTO
    + Dates or Periods or Times smaller than a day: DATE_TIME
    + Email Address: EMAIL_ADDRESS
    + International Bank Account Number (IBAN): IBAN_CODE
    + IP address (either IPv4 or IPv6): IP_ADDRESS
    + Nationality, Religious or Political group: NRP
    + Location Name: LOCATION
    + Person Name: PERSON
    + Telephone Number: PHONE_NUMBER
    + Common Medical License Number: MEDICAL_LICENSE
    + URL: URL

+ Singapore Entities:
  + NRIC Number: SG_NRIC_FIN
  + UEN: SG_UEN

+ USA Entities:
    + Bank Account Number: US_BANK_NUMBER
    + Driver License: US_DRIVER_LICENSE
    + Individual Taxpayer Identification Number (ITIN): US_ITIN
    + Passport Number: US_PASSPORT
    + Social Security Number (SSN): US_SSN

##### 1.1.1 Phone Number

+ Mainly identify phone numbers by their country code, except for American numbers
+ Cannot identify Singapore numbers without the country code
+ Confidence score also decreases if the sentence does not explicitly state that the number is a phone number

In [8]:
# American number
phone_number_test1 = "His name is Mr. Jones and his phone number is 212-555-5555"
analyzer_results = analyzer.analyze(text=phone_number_test1, entities=["PHONE_NUMBER"], language="en") # Assigning only "PHONE_NUMBER" to entities will make the analyzer ignore other entities like "NAME".
print("Phone Number test 1:")
pprint.pp(analyzer_results) #confidence scores provided
print()

# Singapore number with country code
phone_number_test2 = "His name is Mr. Jones and his phone number is +65 8453 2456"
analyzer_results = analyzer.analyze(text=phone_number_test2, entities=["PHONE_NUMBER"], language="en")
print("Phone Number test 2:")
pprint.pp(analyzer_results)
print()

# Singapore number without country code **Cannot Detect**
phone_number_test3 = "His name is Mr. Jones and his phone number is 8453 2456"
analyzer_results = analyzer.analyze(text=phone_number_test3, entities=["PHONE_NUMBER"], language="en")
print("Phone Number test 3:")
pprint.pp(analyzer_results)
print()

# Indirectly stating the phone number (Confidence score decreases to 0.4)
phone_number_test4 = "You can call Mr. Jones at +65 8453 2456"
analyzer_results = analyzer.analyze(text=phone_number_test4, entities=["PHONE_NUMBER"], language="en")
print("Phone Number test 4:")
pprint.pp(analyzer_results)
print()

Phone Number test 1:
[type: PHONE_NUMBER, start: 46, end: 58, score: 0.75]

Phone Number test 2:
[type: PHONE_NUMBER, start: 46, end: 59, score: 0.75]

Phone Number test 3:
[]

Phone Number test 4:
[type: PHONE_NUMBER, start: 26, end: 39, score: 0.4]



##### 1.1.2 Name

+ Can identify lowcase and non-English names
+ Might not be able to identify ambiguous, lowercase names (eg. smith, sky, june)

In [6]:
# Normal name (Jones)
name_test1 = "How are you doing today Jones?"
print("Name test 1:")
pprint.pp(analyzer.analyze(text=name_test1, entities=["PERSON"], language="en"))
print()

# Lowercase name (jones)
name_test2 = "How are you doing today jones?"
print("Name test 2:")
pprint.pp(analyzer.analyze(text=name_test2, entities=["PERSON"], language="en"))
print()

# Chinese Name + lowercase (kok meng)
name_test3 = "How are you doing today kok meng?"
print("Name test 3:")
pprint.pp(analyzer.analyze(text=name_test3, entities=["PERSON"], language="en"))
print()

# Ambiguous name + lowercase (june) **Cannot Detect**
name_test4 = "We are on the way to visit june at her new home!"
print("Name test 4:")
pprint.pp(analyzer.analyze(text=name_test4, entities=["PERSON"], language="en"))
print()


Name test 1:
[type: PERSON, start: 24, end: 29, score: 0.85]

Name test 2:
[type: PERSON, start: 24, end: 29, score: 0.85]

Name test 3:
[type: PERSON, start: 24, end: 32, score: 0.85]

Name test 4:
[]



##### 1.1.3 Presidio Analyzer has strict rules for certain types of information:

+ IP Addresses: An IPv4 address is only considered valid if all numbers are between 0 and 255. For example, `256.255.1.1` is invalid because `256` is out of range.

+ Credit Card Numbers: Only numbers matching real card provider formats are detected. For example, a MasterCard must start with `51` or `55`.

In [7]:
# Normal IP Address
ip_test1 = "You can call me at ip address 255.255.255.1"
print("IP Address test 1:")
pprint.pp(analyzer.analyze(text=ip_test1, entities=["IP_ADDRESS"], language="en"))
print()

# Wrong IP Address format (number greater than 255) **Cannot Detect**
ip_test2 = "You can call me at ip address 256.255.255.1"
print("IP Address test 2:")
pprint.pp(analyzer.analyze(text=ip_test2, entities=["IP_ADDRESS"], language="en"))
print()

# Normal Credit Card Number
credit_test1 = "When paying for the gorceries, use the card 5555 5555 5555 4444 please"
print("Credit Card test 1:")
pprint.pp(analyzer.analyze(text=credit_test1, entities=["CREDIT_CARD"], language="en"))
print()

# Wrong Credit Card format (starting number is not a valid provider) **Cannot Detect**
credit_test2 = "When paying for the gorceries, use the card 5355 5555 5555 4444 please"
print("Credit Card test 2:")
pprint.pp(analyzer.analyze(text=credit_test2, entities=["CREDIT_CARD"], language="en"))
print()

# Wrong Credit Card format ("." between the numbers) **Cannot Detect**
credit_test3 = "When paying for the gorceries, use the card 5555.5555.5555.4444 please"
print("Credit Card test 3:")
pprint.pp(analyzer.analyze(text=credit_test3, entities=["CREDIT_CARD"], language="en"))
print()

IP Address test 1:
[type: IP_ADDRESS, start: 30, end: 43, score: 0.95]

IP Address test 2:
[]

Credit Card test 1:
[type: CREDIT_CARD, start: 44, end: 63, score: 1.0]

Credit Card test 2:
[]

Credit Card test 3:
[]



##### 1.1.4 Differentiating between entities

In [8]:
# IP Address (Might confuse with phone number)
diff_test2 = "Is your IP Address 255.255.255.1?"
analyzer_results = analyzer.analyze(text=diff_test2, entities=["PHONE_NUMBER", "IP_ADDRESS"], language="en")
print("Differentiation test 2:")
pprint.pp(analyzer_results)
print()

Differentiation test 2:
[type: IP_ADDRESS, start: 19, end: 32, score: 0.95,
 type: PHONE_NUMBER, start: 19, end: 32, score: 0.4]



#### 1.2 Extending the analyzer for additional PII entities
https://microsoft.github.io/presidio/analyzer/adding_recognizers/#extending-the-analyzer-for-additional-pii-entities



In [None]:
# Initially, requesting for a TITLE entity will raise an error:


import traceback

try:
    print(analyzer.analyze(text="Mr. Schmidt", entities=["TITLE"], language="en"))
except ValueError as e:
    traceback.print_exc()


Since Presidio Analyzer does not have a TITLE entity, we can add it ourselves

In [10]:
from presidio_analyzer import PatternRecognizer
titles_recognizer = PatternRecognizer(supported_entity="TITLE",
                                      deny_list=["Mr.","Mrs.","Miss"])

print(titles_recognizer.analyze(text="Mr. Schmidt", entities="TITLE"))

[type: TITLE, start: 0, end: 3, score: 1.0]


In [11]:
# Add this new recognizer to the list of recognizers used by the Presidio AnalyzerEngine
analyzer.registry.add_recognizer(titles_recognizer)

print(analyzer.analyze(text="Mr. Schmidt", entities=["TITLE"], language="en"))

[type: TITLE, start: 0, end: 3, score: 1.0]


### 2. Presidio Anonymizer

In [12]:
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import RecognizerResult, OperatorConfig

engine = AnonymizerEngine()

#### 2.1 Using Persidio Anonymizer

In [None]:
# Analyzer output
analyzer_results = [
    RecognizerResult(entity_type="PERSON", start=11, end=15, score=0.8),
    RecognizerResult(entity_type="PERSON", start=17, end=27, score=0.8),
]

# Invoke the anonymize function with the text,
# analyzer results (potentially coming from presidio-analyzer) and
# Operators to get the anonymization output:
result = engine.anonymize(
    text="My name is Bond, James Bond", analyzer_results=analyzer_results
)

print("De-identified text")
print(result.text)

De-identified text
My name is <PERSON>, <PERSON>


#### 2.2 Modify Anonymizer Masking

In [14]:
import json

# Define anonymization operators
operators = {
    "DEFAULT": OperatorConfig("replace", {"new_value": "<ANONYMIZED>"}), # Defult for everything
    "PHONE_NUMBER": OperatorConfig(
        "mask",
        {
            "type": "mask",
            "masking_char": "*",
            "chars_to_mask": 9,
            "from_end": True,
        },
    ),
    "CREDIT_CARD": OperatorConfig(
        "mask",
        {
            "type": "mask",
            "masking_char": "X",
            "chars_to_mask": 17,
            "from_end": True,
        },
    ),
}

In [15]:
text_to_anonymize = "My name is Ngoh Wei Jie and my phone number is +65 9777 1234, and my credit card number is 5555-5555-5555-4444"

analyzer_results = analyzer.analyze(text=text_to_anonymize, entities=["PHONE_NUMBER", "PERSON", "CREDIT_CARD"], language="en")

print("Analyzer Output:")
pprint.pp(analyzer_results) #confidence scores provided
print()

anonymized_results = engine.anonymize(
    text=text_to_anonymize, analyzer_results=analyzer_results, operators=operators
)

print("Text:")
print(anonymized_results.text)
print()

print("Detailed Result:")
pprint.pp(json.loads(anonymized_results.to_json()))

Analyzer Output:
[type: CREDIT_CARD, start: 91, end: 110, score: 1.0,
 type: PERSON, start: 11, end: 23, score: 0.85,
 type: PHONE_NUMBER, start: 47, end: 60, score: 0.75]

Text:
My name is <ANONYMIZED> and my phone number is +65 *********, and my credit card number is 55XXXXXXXXXXXXXXXXX

Detailed Result:
{'text': 'My name is <ANONYMIZED> and my phone number is +65 *********, and my '
         'credit card number is 55XXXXXXXXXXXXXXXXX',
 'items': [{'start': 91,
            'end': 110,
            'entity_type': 'CREDIT_CARD',
            'text': '55XXXXXXXXXXXXXXXXX',
            'operator': 'mask'},
           {'start': 47,
            'end': 60,
            'entity_type': 'PHONE_NUMBER',
            'text': '+65 *********',
            'operator': 'mask'},
           {'start': 11,
            'end': 23,
            'entity_type': 'PERSON',
            'text': '<ANONYMIZED>',
            'operator': 'replace'}]}


##### 2.2.1 Cons in modifying anonymizer using `OperatorConfig`

- `char_to_mask` is a required parameter that fixes the number of characters that will be masked, which might affect the anonymization of information if they are entered in different formats

In [16]:
operator = {
    "DEFAULT": OperatorConfig("replace", {"new_value": "<ANONYMIZED>"}), # Defult for everything
    "PHONE_NUMBER": OperatorConfig(
        "mask",
        {
            "type": "mask",
            "masking_char": "*",
            "chars_to_mask": 8, # The number of characters to mask is fixed to 9
            "from_end": True,
        },
    ),
}

format1 = "My phone number is 5551234567"
format2 = "My phone number is (555) 123 4567"

analyzer_results1 = analyzer.analyze(text=format1, language="en")
analyzer_results2 = analyzer.analyze(text=format2, language="en")

anonymized_results1 = engine.anonymize(
    text=format1, analyzer_results=analyzer_results1, operators=operator
)
anonymized_results2 = engine.anonymize(
    text=format2, analyzer_results=analyzer_results2, operators=operator
)

print("Format 1:", anonymized_results1.text)
print("Format 2:", anonymized_results2.text)

Format 1: My phone number is 55********
Format 2: My phone number is (555) ********


### 3. Encryption and Decryption

#### 3.1 Encryption

In [17]:
from presidio_anonymizer import AnonymizerEngine, DeanonymizeEngine
from presidio_anonymizer.entities import (
    RecognizerResult,
    OperatorResult,
    OperatorConfig,
)

In [18]:
crypto_key = "WmZq4t7w!z%C&F)J"

text = "I want a refund, I am james bond and email is jamesbond@htx.gov.sg, i have a sister Johnathan"

analyzer_results=AnalyzerEngine().analyze(text=text, language="en")

# Invoke the anonymize function with the text,
# analyzer results (potentially coming from presidio-analyzer)
# and an 'encrypt' operator to get an encrypted anonymization output:
anonymize_result = engine.anonymize(
    text=text,
    analyzer_results=analyzer_results,
    operators={"DEFAULT": OperatorConfig("encrypt", {"key": crypto_key})},
)

pprint.pp(json.loads(anonymize_result.to_json()))

{'text': 'I want a refund, I am w7HcKalEKU5YXSGtdTojZ6Or5mMCZjs3Gs7c4N4pfiI= '
         'and email is '
         'W6lg2JYdc8ecCtr5wi6Zubk4TFVDFViGpxixCO8XL4cZnwSEPAW8fHHYZQrcVJP9, i '
         'have a sister HciWYCvwzQyEHX2fpLnliwoE6axOkdLkNgeulEfc2Is=',
 'items': [{'start': 162,
            'end': 206,
            'entity_type': 'PERSON',
            'text': 'HciWYCvwzQyEHX2fpLnliwoE6axOkdLkNgeulEfc2Is=',
            'operator': 'encrypt'},
           {'start': 80,
            'end': 144,
            'entity_type': 'EMAIL_ADDRESS',
            'text': 'W6lg2JYdc8ecCtr5wi6Zubk4TFVDFViGpxixCO8XL4cZnwSEPAW8fHHYZQrcVJP9',
            'operator': 'encrypt'},
           {'start': 22,
            'end': 66,
            'entity_type': 'PERSON',
            'text': 'w7HcKalEKU5YXSGtdTojZ6Or5mMCZjs3Gs7c4N4pfiI=',
            'operator': 'encrypt'}]}


#### 3.2 Decryption

In [19]:
# Fetch the anonymized text from the result.
anonymized_text = anonymize_result.text

# Fetch the anonynized entities from the result.
anonymized_entities = anonymize_result.items

# Initialize the deanonymization engine:
engine = DeanonymizeEngine()

# Invoke the deanonymize function with the text, anonymizer results
# and a 'decrypt' operator to get the original text as output.
deanonymized_result = engine.deanonymize(
    text=anonymized_text,
    entities=anonymized_entities,
    operators={"DEFAULT": OperatorConfig("decrypt", {"key": crypto_key})},
)

pprint.pp(json.loads(deanonymized_result.to_json()))

{'text': 'I want a refund, I am james bond and email is jamesbond@htx.gov.sg, '
         'i have a sister Johnathan',
 'items': [{'start': 84,
            'end': 93,
            'entity_type': 'PERSON',
            'text': 'Johnathan',
            'operator': 'decrypt'},
           {'start': 46,
            'end': 66,
            'entity_type': 'EMAIL_ADDRESS',
            'text': 'jamesbond@htx.gov.sg',
            'operator': 'decrypt'},
           {'start': 22,
            'end': 32,
            'entity_type': 'PERSON',
            'text': 'james bond',
            'operator': 'decrypt'}]}


### 4. Pseudo Anonymizer
Reversible, for LLM input/output

#### 4.1 Pseudo Anonymize

In [20]:
# Use Presidio Analyzer to identify all the PII entities

from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine, OperatorConfig
from presidio_anonymizer.operators import Operator, OperatorType

from typing import Dict
import pprint

#text = "Please i want a refund,my name is james bond and email is jamesbond@htx.gov.sg, i live at htx avenue 51322. I have a brother, Johnathan Ong"
text = "Peter gave his book to Heidi which later gave it to Nicole. Peter lives in London and Nicole lives in Tashkent."

analyzer = AnalyzerEngine()
analyzer_results = analyzer.analyze(text=text, language="en")

In [21]:
print("original text:")
pprint.pp(text)

print("analyzer results:")
pprint.pp(analyzer_results)

original text:
('Peter gave his book to Heidi which later gave it to Nicole. Peter lives in '
 'London and Nicole lives in Tashkent.')
analyzer results:
[type: PERSON, start: 0, end: 5, score: 0.85,
 type: PERSON, start: 23, end: 28, score: 0.85,
 type: PERSON, start: 52, end: 58, score: 0.85,
 type: PERSON, start: 60, end: 65, score: 0.85,
 type: LOCATION, start: 75, end: 81, score: 0.85,
 type: PERSON, start: 86, end: 92, score: 0.85,
 type: LOCATION, start: 102, end: 110, score: 0.85]


In [22]:
# Make the Anonymizer Class

class InstanceCounterAnonymizer(Operator):
    """
    Anonymizer which replaces the entity value
    with an instance counter per entity.
    """

    REPLACING_FORMAT = "<{entity_type}_{index}>"

    def operate(self, text: str, params: Dict = None) -> str:
        """Anonymize the input text."""

        entity_type: str = params["entity_type"]

        # entity_mapping is a dict of dicts containing mappings per entity type
        entity_mapping: Dict[Dict:str] = params["entity_mapping"]

        entity_mapping_for_type = entity_mapping.get(entity_type)
        if not entity_mapping_for_type:
            new_text = self.REPLACING_FORMAT.format(
                entity_type=entity_type, index=0
            )
            entity_mapping[entity_type] = {}

        else:
            if text in entity_mapping_for_type:
                return entity_mapping_for_type[text]

            previous_index = self._get_last_index(entity_mapping_for_type)
            new_text = self.REPLACING_FORMAT.format(
                entity_type=entity_type, index=previous_index + 1
            )

        entity_mapping[entity_type][new_text] = text
        return new_text

    @staticmethod
    def _get_last_index(entity_mapping_for_type: Dict) -> int:
        """Get the last index for a given entity type."""

        def get_index(value: str) -> int:
            return int(value.split("_")[-1][:-1])

        indices = [get_index(v) for v in entity_mapping_for_type.keys()]
        return max(indices)

    def validate(self, params: Dict = None) -> None:
        """Validate operator parameters."""

        if "entity_mapping" not in params:
            raise ValueError("An input Dict called `entity_mapping` is required.")
        if "entity_type" not in params:
            raise ValueError("An entity_type param is required.")

    def operator_name(self) -> str:
        return "entity_counter"

    def operator_type(self) -> OperatorType:
        return OperatorType.Anonymize


In [23]:
# Create Anonymizer engine and add the custom anonymizer

anonymizer_engine = AnonymizerEngine()
anonymizer_engine.add_anonymizer(InstanceCounterAnonymizer)

# Create a mapping between entity types and counters
entity_mapping = dict()

# Anonymize the text

anonymized_result = anonymizer_engine.anonymize(
    text,
    analyzer_results,
    {
        "DEFAULT": OperatorConfig(
            "entity_counter", {"entity_mapping": entity_mapping}
        )
    },
)

In [24]:
print(anonymized_result.text)

<PERSON_4> gave his book to <PERSON_3> which later gave it to <PERSON_2>. <PERSON_1> lives in <LOCATION_1> and <PERSON_0> lives in <LOCATION_0>.


In [25]:
pprint.pp(json.loads(anonymized_result.to_json()))

{'text': '<PERSON_4> gave his book to <PERSON_3> which later gave it to '
         '<PERSON_2>. <PERSON_1> lives in <LOCATION_1> and <PERSON_0> lives in '
         '<LOCATION_0>.',
 'items': [{'start': 131,
            'end': 143,
            'entity_type': 'LOCATION',
            'text': '<LOCATION_0>',
            'operator': 'entity_counter'},
           {'start': 111,
            'end': 121,
            'entity_type': 'PERSON',
            'text': '<PERSON_0>',
            'operator': 'entity_counter'},
           {'start': 94,
            'end': 106,
            'entity_type': 'LOCATION',
            'text': '<LOCATION_1>',
            'operator': 'entity_counter'},
           {'start': 74,
            'end': 84,
            'entity_type': 'PERSON',
            'text': '<PERSON_1>',
            'operator': 'entity_counter'},
           {'start': 62,
            'end': 72,
            'entity_type': 'PERSON',
            'text': '<PERSON_2>',
            'operator': 'entity_counter

In [26]:
entity_mapping

{'LOCATION': {'<LOCATION_0>': 'Tashkent', '<LOCATION_1>': 'London'},
 'PERSON': {'<PERSON_0>': 'Nicole',
  '<PERSON_1>': 'Peter',
  '<PERSON_2>': 'Nicole',
  '<PERSON_3>': 'Heidi',
  '<PERSON_4>': 'Peter'}}

#### 4.2 Pseudo Deanonymize

In [27]:
# Use Presidio Analyzer to identify all the PII entities

# String that is outputed by LLM
llm_str = "<PERSON_4> gave his book to <PERSON_3> which later gave it to <PERSON_2>. <PERSON_1> lives in <LOCATION_1> and <PERSON_0> lives in <LOCATION_0>."

In [28]:
# Make manual map back function
def map_back(llm_output: str, entity_mapping):
    for entity in entity_mapping:
        for name in entity_mapping[entity]:
            llm_output = llm_output.replace(name, entity_mapping[entity][name])
    return llm_output

In [29]:
map_back(llm_str, entity_mapping)

'Peter gave his book to Heidi which later gave it to Nicole. Peter lives in London and Nicole lives in Tashkent.'

### 5. Image Redactor

In [30]:
from PIL import Image
from presidio_image_redactor import ImageRedactorEngine

#### Ensure Tesseract ORC is installed

+ On Ubuntu/Debian

`sudo apt update && sudo apt install tesseract-ocr -y`

+ On macOS (Using Homebrew)

`brew install tesseract`

+ On Windows

Download the installer from: [Tesseract OCR Releases](https://github.com/UB-Mannheim/tesseract/wiki)

Add the installation path to your system's PATH variable

In [31]:
# Get the image to redact using PIL lib (pillow)
image = Image.open("./sensitive_image.png")

# Initialize the engine
engine = ImageRedactorEngine()

# Redact the image with pink color
redacted_image = engine.redact(image)

# save the redacted image
redacted_image.save("redacted_image.png")
# uncomment to open the image for viewing
# redacted_image.show()

## Cloak API

### 1. Text Anonymisation

- Analyse
- Anonymise
- Transform

Entites Types: https://guide.cloak.gov.sg/free-text-anonymisation/free-text-anonymisation/entity-types

In [9]:
# import needed libraries
import requests
from cloak_help import generate_signature, extract_url_info

In [None]:
# define private and public key
public_key = "ENTER_YOUR_PUBLIC_KEY"
private_key = "ENTER_YOUR_PRIVATE_KEY"

#### 1.1 Analyse

##### `payload` parameters
| Parameter | Required/Optional | Default | Description |
|----------|----------|----------|----------|
| text  | required  | NIL  | text to be analyzed  |
| language  | required  | NIL  | language of the text  |
| score_threshold  | optional  | 0.3  | confidence score has to be above threshold score in order to be considered detected  |
| entities  | optional  | all entities  | entities to detect  |
| allow_list  | optional  | none  | terms that should not be detected or redacted as PII, even if they match an entity type  |
| analyze_parameters  | optional  | none  | allows fine-tuning of entity detection behavior  |

In [11]:
# Analyse Endpoint
http_method = "POST"

url = 'https://ext-api.cloak.gov.sg/prod/L4/analyze'

payload = {
    "text": "Dear Sir/Madam, I am writing to appeal to the Singapore Ministry of Manpower on behalf of my wife, Kim Harin (S8273756Y), whose work pass is due for renewal. Her date of birth is 11 November 1911. Our Singapore address is Block 555 Tampines North Drive 12 #11-11 Singapore 510555. My mobile number is 9384 5432 and Harin’s number is +65 88534123 or you can email us at kimfamily@gmail.com.",
    "language": "en",
    "score_threshold": 0.3,
    "entities": [
        "PERSON",
        "SG_NRIC_FIN",
        "SG_BANK_ACCOUNT_NUMBER",
        "SG_ADDRESS",
        "PHONE_NUMBER"
    ],
    "allow_list": ["giro"],
    "analyze_parameters": {
        "nric": {
            "checksum": False # analyzer will still detect NRIC entities even if they fail checksum validation, since it is set to False
        }
    }
}

service = "fta"

signed_headers = {
    "Content-Type": "application/json"
}

In [12]:
path, query_params = extract_url_info(url)
signature = generate_signature(http_method, path, query_params, signed_headers, payload, private_key, service)
authorization = f'CLOAK-AUTH Credential={public_key},SignedHeaders=content-type,Signature={signature}'
headers = {'Content-Type':'application/json', 'Accept':'application/json', 'Authorization': f'{authorization}', 'x-cloak-service': f'{service}'}
response = requests.post(url, headers=headers, json=payload, verify=False)
response.json()

[{'analysis_explanation': None,
  'end': 119,
  'entity_type': 'SG_NRIC_FIN',
  'recognition_metadata': {'recognizer_identifier': 'SgFinRecognizer_140706163607920',
   'recognizer_name': 'SgFinRecognizer'},
  'score': 1.0,
  'start': 110},
 {'analysis_explanation': None,
  'end': 279,
  'entity_type': 'SG_ADDRESS',
  'recognition_metadata': {'recognizer_identifier': 'SgAddressRecognizer_140706163607680',
   'recognizer_name': 'SgAddressRecognizer'},
  'score': 0.95,
  'start': 222},
 {'analysis_explanation': None,
  'end': 310,
  'entity_type': 'PHONE_NUMBER',
  'recognition_metadata': {'recognizer_identifier': 'PhoneRecognizer_140706163609312',
   'recognizer_name': 'PhoneRecognizer'},
  'score': 0.95,
  'start': 301},
 {'analysis_explanation': None,
  'end': 345,
  'entity_type': 'PHONE_NUMBER',
  'recognition_metadata': {'recognizer_identifier': 'PhoneRecognizer_140706163609312',
   'recognizer_name': 'PhoneRecognizer'},
  'score': 0.95,
  'start': 333},
 {'analysis_explanation': No

#### 1.2 Anonymise

##### `payload` parameters
| Parameter | Required/Optional | Default | Description |
|----------|----------|----------|----------|
| text  | required  | NIL  | text to be anonymised  |
| anonymizers  | optional  | default anonymisation for all entities: `<PERSON>`, `<PHONE_NUMBER>`, etc.  | modify how entities will be anonymized  |
| analyzer_results  | required  | NIL  | PII entities detected from Analyzer  |

In [13]:
# Anonymise Endpoint
http_method = "POST"

url = 'https://ext-api.cloak.gov.sg/prod/L4/anonymize'

payload = {
    "text": "Dear Sir/Madam, I am writing to appeal to the Singapore Ministry of Manpower on behalf of my wife, Kim Harin (S8273756Y), whose work pass is due for renewal. Her date of birth is 11 November 1911. Our Singapore address is Block 555 Tampines North Drive 12 #11-11 Singapore 510555. My mobile number is 9384 5432 and Harin’s number is +65 88534123 or you can email us at kimfamily@gmail.com.",
    "anonymizers": {
        "SG_NRIC_FIN": {
            "type": "replace",
            "new_value": "<MASKED NRIC>"
        },
    },
    "analyzer_results": [
        {
            "analysis_explanation": None,
            "end": 119,
            "entity_type": "SG_NRIC_FIN",
            "recognition_metadata": {
                "recognizer_identifier": "SgFinRecognizer_140322543810544",
                "recognizer_name": "SgFinRecognizer"
            },
            "score": 1.0,
            "start": 110
        },
        {
            "analysis_explanation": None,
            "end": 279,
            "entity_type": "SG_ADDRESS",
            "recognition_metadata": {
                "recognizer_identifier": "SgAddressRecognizer_140322543810736",
                "recognizer_name": "SgAddressRecognizer"
            },
            "score": 0.95,
            "start": 222
        },
        {
            "analysis_explanation": None,
            "end": 310,
            "entity_type": "PHONE_NUMBER",
            "recognition_metadata": {
                "recognizer_identifier": "PhoneRecognizer_140322543811840",
                "recognizer_name": "PhoneRecognizer"
            },
            "score": 0.95,
            "start": 301
        },
        {
            "analysis_explanation": None,
            "end": 345,
            "entity_type": "PHONE_NUMBER",
            "recognition_metadata": {
                "recognizer_identifier": "PhoneRecognizer_140322543811840",
                "recognizer_name": "PhoneRecognizer"
            },
            "score": 0.95,
            "start": 333
        },
        {
            "analysis_explanation": None,
            "end": 108,
            "entity_type": "PERSON",
            "recognition_metadata": {
                "recognizer_identifier": "SpacyRecognizer_140322543811792",
                "recognizer_name": "SpacyRecognizer"
            },
            "score": 0.85,
            "start": 99
        },
        {
            "analysis_explanation": None,
            "end": 320,
            "entity_type": "PERSON",
            "recognition_metadata": {
                "recognizer_identifier": "SpacyRecognizer_140322543811792",
                "recognizer_name": "SpacyRecognizer"
            },
            "score": 0.85,
            "start": 315
        },
        {
            "analysis_explanation": None,
            "end": 345,
            "entity_type": "SG_BANK_ACCOUNT_NUMBER",
            "recognition_metadata": {
                "recognizer_identifier": "SgBankAccountRecognizer_140322543810688",
                "recognizer_name": "SgBankAccountRecognizer"
            },
            "score": 0.45999999999999996,
            "start": 337
        }
    ]
}

service = "fta"

signed_headers = {
    "Content-Type": "application/json"
}

In [14]:
path, query_params = extract_url_info(url)
signature = generate_signature(http_method, path, query_params, signed_headers, payload, private_key, service)
authorization = f'CLOAK-AUTH Credential={public_key},SignedHeaders=content-type,Signature={signature}'
headers = {'Content-Type':'application/json', 'Accept':'application/json', 'Authorization': f'{authorization}', 'x-cloak-service': f'{service}'}
response = requests.post(url, headers=headers, json=payload, verify=False)
response.json()

{'text': 'Dear Sir/Madam, I am writing to appeal to the Singapore Ministry of Manpower on behalf of my wife, <PERSON> (<MASKED NRIC>), whose work pass is due for renewal. Her date of birth is 11 November 1911. Our Singapore address is <SG_ADDRESS>. My mobile number is <PHONE_NUMBER> and <PERSON>’s number is <PHONE_NUMBER> or you can email us at kimfamily@gmail.com.',
 'items': [{'start': 299,
   'end': 313,
   'entity_type': 'PHONE_NUMBER',
   'original_text': '+65 88534123',
   'text': '<PHONE_NUMBER>',
   'operator': 'replace'},
  {'start': 278,
   'end': 286,
   'entity_type': 'PERSON',
   'original_text': 'Harin',
   'text': '<PERSON>',
   'operator': 'replace'},
  {'start': 259,
   'end': 273,
   'entity_type': 'PHONE_NUMBER',
   'original_text': '9384 5432',
   'text': '<PHONE_NUMBER>',
   'operator': 'replace'},
  {'start': 225,
   'end': 237,
   'entity_type': 'SG_ADDRESS',
   'original_text': 'Block 555 Tampines North Drive 12 #11-11 Singapore 510555',
   'text': '<SG_ADDRESS>

#### 1.3 Transform

Combines Analyse and Anonymise process into one

In [15]:
# create the endpoint request
http_method = "POST"

url = 'https://ext-api.cloak.gov.sg/prod/L4/transform'

payload = {
    "text": "Dear Sir/Madam, I am writing to appeal to the Singapore Ministry of Manpower on behalf of my wife, Kim Harin (S8273756Y), whose work pass is due for renewal. Her date of birth is 11 November 1911. Our Singapore address is Block 555 Tampines North Drive 12 #11-11 Singapore 510555. My mobile number is 9384 5432 and Harin’s number is +65 88534123 or you can email us at kimfamily@gmail.com.",
    "language": "en",
    "entities": [
        "PERSON",
        "SG_NRIC_FIN",
        "SG_BANK_ACCOUNT_NUMBER",
        "SG_ADDRESS",
        "PHONE_NUMBER",
        "EMAIL_ADDRESS"
    ],
    "score_threshold": 0.3,
    "anonymizers": {
        "SG_NRIC_FIN": {
            "type": "replace",
            "new_value": "<SG_NRIC_FIN>"
        },
        "PHONE_NUMBER": {
            "type": "mask",
            "masking_char": "*",
            "chars_to_mask": 4,
            "from_end": False
        },

        "PERSON": {
            "type": "hash",
            "hash_type": "sha256"
        },

        "EMAIL_ADDRESS": {
            "type": "encrypt",
            "key": "12345678901234567890123456789012"
        },

        "SG_ADDRESS": {
            "type": "replace",
            "new_value": "<SG_ADDRESS>"
        },

        "SG_BANK_ACCOUNT_NUMBER": {
            "type": "replace",
            "new_value": "<SG_BANK_ACCOUNT_NUMBER>"
        }
    }
}

service = "fta"

signed_headers = {
    "Content-Type": "application/json"
}

In [16]:
# make the transformation request
path, query_params = extract_url_info(url)
signature = generate_signature(http_method, path, query_params, signed_headers, payload, private_key, service)
authorization = f'CLOAK-AUTH Credential={public_key},SignedHeaders=content-type,Signature={signature}'
headers = {'Content-Type':'application/json', 'Accept':'application/json', 'Authorization': f'{authorization}', 'x-cloak-service': f'{service}'}
response = requests.post(url, headers=headers, json=payload, verify=True)
response.json()

{'text': 'Dear Sir/Madam, I am writing to appeal to the Singapore Ministry of Manpower on behalf of my wife, 27f091ca5b236160910cba5e6b6c2ab87316bcfbc6c48a17317cb137158c7f92 (S8273756Y), whose work pass is due for renewal. Her date of birth is 11 November 1911. Our Singapore address is <SG_ADDRESS>. My mobile number is **** 5432 and 636b17bb7cbab1118e6124a690df7d7e7b13cf991f86532abcaba372a420d53d’s number is ****88534123 or you can email us at YPZlhtlk+kV9MuNYFVggOOlbto8BiiLChmPJEuZ8fDTj8P4tjqrlej3xI0MJDTS7.',
 'items': [{'start': 438,
   'end': 502,
   'entity_type': 'EMAIL_ADDRESS',
   'original_text': 'kimfamily@gmail.com',
   'text': 'YPZlhtlk+kV9MuNYFVggOOlbto8BiiLChmPJEuZ8fDTj8P4tjqrlej3xI0MJDTS7',
   'operator': 'encrypt'},
  {'start': 402,
   'end': 414,
   'entity_type': 'PHONE_NUMBER',
   'original_text': '+65 88534123',
   'text': '****88534123',
   'operator': 'mask'},
  {'start': 325,
   'end': 389,
   'entity_type': 'PERSON',
   'original_text': 'Harin',
   'text': '636b1

### 2. Tabular Data Anonymisation

Tabular Data Anonymisation allows users to tag and transform data through **transformation/anonymisation techniques** and address re-identification risk through **k-anonymity checks**

#### 2.1 Create New Task

A task can be used to transform/anonymise tabular data, a task is identified by its `task_id`

In [43]:
# Create new task (POST Example)

url = 'https://ext-api.cloak.gov.sg/prod/L4/tabular/tasks'

http_method = "POST"

signed_headers = {
    "Content-Type": "application/json"
}

path, query_params = extract_url_info(url)

service = "tda"

payload = {
    "task_type": "PAYLOAD",
    "access_mode": "CUSTOM",
    "rules": [
        {
            "column_name": "Address",
            "info_type": "OTHERS",
            "sensitivity_type": "NON_SENSITIVE",
            "data_type": "TEXT",
            "transformation": "MASK",
            "parameters": {
                "mask_type": "mask_suffix",
                "num_of_char": 10,
                "error_resolution_type": "keyword"
            }
        },
        {
            "column_name": "user_nric",
            "info_type": "OTHERS",
            "sensitivity_type": "SENSITIVE",
            "data_type": "TEXT",
            "transformation": "MASK",
            "parameters": {
                "mask_type": "mask_suffix",
                "num_of_char": 8,
                "error_resolution_type": "keyword"
            }
        }
    ]
}

signature = generate_signature(http_method, path, query_params, signed_headers, payload, private_key, service)
authorization = f'CLOAK-AUTH Credential={public_key},SignedHeaders=content-type,Signature={signature}'
headers = {'Content-Type':'application/json', 'Accept':'application/json', 'Authorization': f'{authorization}', 'x-cloak-service': f'{service}'}
response = requests.post(url, headers=headers, json=payload, verify=False)

response.json()

{'data': {'task_id': 'TLwbIbI01dkz'}}

In [44]:
task_id = response.json()['data']['task_id']

task_id

'TLwbIbI01dkz'

#### 2.2 Getting All Tasks

In [None]:
# Get All tasks (GET Example)

url = 'https://ext-api.cloak.gov.sg/prod/L4/tabular/tasks'
http_method = "GET"
signed_headers = {
    "Content-Type": "application/json"
}

payload = {}

path, query_params = extract_url_info(url)

service = "tda"

signature = generate_signature(http_method, path, query_params, signed_headers, payload, private_key, service)
authorization = f'CLOAK-AUTH Credential={public_key},SignedHeaders=content-type,Signature={signature}'
headers = {'Content-Type':'application/json', 'Accept':'application/json', 'Authorization': f'{authorization}', 'x-cloak-service': f'{service}'}
response = requests.get(url, headers=headers, verify=False)

response.json()

#### 2.3 Create Transform Using Task

In [46]:
# Create Transform (POST Example)

http_method = "POST"

url = f'https://ext-api.cloak.gov.sg/prod/L4/tabular/tasks/{task_id}/transform'

signed_headers = {
    "Content-Type": "application/json"
}

path, query_params = extract_url_info(url)

service = "tda"

payload = {
    "data": [
        {
            "user_nric": "S1234567D",
            "Address": "30 Balabala road"
        },
        {
            "user_nric": "G1234567B",
            "Address": "44 Telok Blangah drive"
        }
    ]
}

signature = generate_signature(http_method, path, query_params, signed_headers, payload, private_key, service)
authorization = f'CLOAK-AUTH Credential={public_key},SignedHeaders=content-type,Signature={signature}'
headers = {'Content-Type':'application/json', 'Accept':'application/json', 'Authorization': f'{authorization}', 'x-cloak-service': f'{service}'}
response = requests.post(url, headers=headers, json=payload, verify=False)

response.json()

{'data': [{'user_nric': 'S********', 'Address': '30 Bal**********'},
  {'user_nric': 'G********', 'Address': '44 Telok Bla**********'}]}

Using Tabular Data Anonymisation on **CSV**

In [47]:
import csv

# Functions to transform CSV data into Json data and vice-versa

def csv_to_json(filename):
    data = []
    with open(filename, mode='r', encoding='utf-8') as file:
        reader = csv.reader(file)
        headers = next(reader)  # Read the first row as headers

        for row in reader:
            if len(row) >= 2:  # Ensure at least two columns exist
                data.append({
                    "user_nric": row[0],
                    "Address": row[1]
                })
    return {"data": data}


def json_to_csv(json_data, output_filename):
    # Open the output CSV file
    with open(output_filename, mode='w', newline='', encoding='utf-8') as outfile:
        fieldnames = ['user_nric', 'Address']
        writer = csv.DictWriter(outfile, fieldnames=fieldnames)

        # Write the header row
        writer.writeheader()

        # Write the rows from the JSON data
        for item in json_data:
            writer.writerow({
                'user_nric': item['user_nric'],
                'Address': item['Address']
            })

In [48]:
read_filename = "data.csv"
payload = csv_to_json(read_filename)

In [49]:
payload

{'data': [{'user_nric': 'S1234567D', 'Address': '30 Balabala Road'},
  {'user_nric': 'G1234567B', 'Address': '44 Telok Blangah Drive'},
  {'user_nric': 'T0456789A', 'Address': '20 Ghim Moh Rd'},
  {'user_nric': 'S8765432J', 'Address': '15 Holland Avenue'},
  {'user_nric': 'F2345678N', 'Address': '99 Punggol Field'},
  {'user_nric': 'T9876543K', 'Address': '22 Yishun Street 81'},
  {'user_nric': 'G7654321M', 'Address': '50 Bedok South Ave 3'},
  {'user_nric': 'S3456789P', 'Address': '10 Serangoon North Ave 1'},
  {'user_nric': 'T1122334Z', 'Address': '33 Pasir Ris Drive 8'},
  {'user_nric': 'F5566778L', 'Address': '12 Clementi Ave 4'},
  {'user_nric': 'G3344556R', 'Address': '88 Tampines Central 7'},
  {'user_nric': 'S7788990X', 'Address': '77 Bukit Timah Road'},
  {'user_nric': 'T6677889Y', 'Address': '25 Jurong West Street 42'},
  {'user_nric': 'F9900112K', 'Address': '18 Toa Payoh Lorong 5'},
  {'user_nric': 'G8822334M', 'Address': '60 Choa Chu Kang Loop'},
  {'user_nric': 'S3344556T

In [50]:
signature = generate_signature(http_method, path, query_params, signed_headers, payload, private_key, service)
authorization = f'CLOAK-AUTH Credential={public_key},SignedHeaders=content-type,Signature={signature}'
headers = {'Content-Type':'application/json', 'Accept':'application/json', 'Authorization': f'{authorization}', 'x-cloak-service': f'{service}'}
response = requests.post(url, headers=headers, json=payload, verify=False)

response.json()

{'data': [{'user_nric': 'S********', 'Address': '30 Bal**********'},
  {'user_nric': 'G********', 'Address': '44 Telok Bla**********'},
  {'user_nric': 'T********', 'Address': '20 G**********'},
  {'user_nric': 'S********', 'Address': '15 Holl**********'},
  {'user_nric': 'F********', 'Address': '99 Pun**********'},
  {'user_nric': 'T********', 'Address': '22 Yishun**********'},
  {'user_nric': 'G********', 'Address': '50 Bedok S**********'},
  {'user_nric': 'S********', 'Address': '10 Serangoon N**********'},
  {'user_nric': 'T********', 'Address': '33 Pasir R**********'},
  {'user_nric': 'F********', 'Address': '12 Clem**********'},
  {'user_nric': 'G********', 'Address': '88 Tampines**********'},
  {'user_nric': 'S********', 'Address': '77 Bukit **********'},
  {'user_nric': 'T********', 'Address': '25 Jurong West**********'},
  {'user_nric': 'F********', 'Address': '18 Toa Payo**********'},
  {'user_nric': 'G********', 'Address': '60 Choa Chu**********'},
  {'user_nric': 'S********

In [51]:
json_data = response.json()["data"]

write_filename = "masked.csv"

json_to_csv(json_data, write_filename)

## Conclusion

While GovTech Cloak excels in detecting data specific to the Singapore context, such as Singaporean phone numbers and addresses, it lacks the flexibility offered by Microsoft Presidio in terms of customizing and extending analyzers and anonymizers to meet specific use cases.

Despite both Cloak and Presidio performing suboptimally in detecting certain Personally Identifiable Information (PII) entities, such as Singapore Passport IDs, Presidio’s open-source nature provides a distinct advantage. Users can fine-tune and extend Presidio to enhance its accuracy and tailor it to their unique requirements, making it a more adaptable solution for diverse data protection needs.

| Quality | Presidio | Cloak |
|----------|----------|----------|
| Analyze  | Limited for local context  | Limited, still lacks support for some Singapore-specific entities  |
| Anonymize  | Yes  | Yes  |
| Transformation  | No  | Yes  |
| Modifying/Extending Analyzers/Anonymizers  | Yes  | Limited, only able to modify anonymisers  |
| Encryption/Decryption  | Yes  | Yes  |
| Image Redaction  | Yes  | No  |