# Data anonymization using Faker (Titanic example)

Reference: https://www.kaggle.com/code/carlmcbrideellis/data-anonymization-using-faker-titanic-example/notebook

# Method 1: Faker

load data

In [13]:
import pandas as pd 
# read in the Titanic training data csv file
train_data = pd.read_csv('./input/titanic/train.csv')
# take a look
train_data.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


Pseudonymization : replace real names of each titanic passengers with a pseudonym

The Name column consists of sensitive personal data, specifically the actual names of the passengers in this instance. 

To protect privacy, we should replace this column with fabricated names generated by Faker. Faker will generate either male or female names based on the data in the Sex column.

In [14]:
from faker import Faker
# to create and initialize a faker generator, which can generate data by 
# accessing properties named after the type of data you want.
fake = Faker()

def Sex(row):
    if row['Sex'] == 'female':
        new_name = fake.name_female()
    else:
        new_name = fake.name_male()
    return new_name

train_data['Name faked'] = train_data.apply(Sex, axis=1)

# take a quick look
train_data.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Name faked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,Robert Howard
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,Stephanie Arnold
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,Elizabeth Wilson
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,Christine Vasquez
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,Mr. Wayne Dixon
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q,Brandon Burnett
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S,David Wright
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S,Kyle Morris
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S,Kathy Smith
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C,Sandra Vaughn


Other type of data that can be faker properties: `faker.address()`
Reference: https://github.com/joke2k/faker

## Concerns: 
- Limited types of data that faker can generate automatically. Potential solution: Method 3
- To ensure the generated values are unique for this specific instance.
```
from faker import Faker
fake = Faker()
names = [fake.unique.first_name() for i in range(500)]
assert len(set(names)) == len(names)
```
- For email field, domains are 'free email' like Yahoo and gmail. (Would the presence of 'free email' domains such as Yahoo and Gmail in the email field have any impact on data analysis regarding usage?) Potential solution: create domain distribution. 
Reference: https://medium.com/district-data-labs/a-practical-guide-to-anonymizing-datasets-with-python-faker-ecf15114c9be
- Looks like faker package does not provide de-anonymize method, unless we want to store two columsn manually, which raising upcoming security issues.

Finally we shall now write out this new pseudonymized DataFrame to a csv file:

In [15]:
train_data.to_csv('./output/pseudonymized_train.csv', index=False)
# take a look
train_data.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Name faked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,Robert Howard
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,Stephanie Arnold
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,Elizabeth Wilson
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,Christine Vasquez
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,Mr. Wayne Dixon
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q,Brandon Burnett
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S,David Wright
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S,Kyle Morris
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S,Kathy Smith
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C,Sandra Vaughn


# Method 2: Hashing

An alternative, more practical approach would be to generate a hash for each passenger name and store the hashed values in a dictionary. This dictionary can later be used by the authorized owner to reverse the anonymization process and retrieve the original names.

In [16]:
import hashlib

train_data = pd.read_csv('./input/titanic/train.csv')

# create the hash
train_data['Name hash'] = train_data['Name'].apply(lambda x: hashlib.sha256(x.encode()).hexdigest()[:8])

# save to a dictionary
name_lookup = dict(zip(train_data['Name hash'],train_data['Name']))

# now delete the "Name" column
train_data = train_data.drop(["Name"], axis=1)

# take a look
train_data.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Name hash
0,1,0,3,male,22.0,1,0,A/5 21171,7.25,,S,6c969dc7
1,2,1,1,female,38.0,1,0,PC 17599,71.2833,C85,C,f7ad2d69
2,3,1,3,female,26.0,0,0,STON/O2. 3101282,7.925,,S,7eb1fa77
3,4,1,1,female,35.0,1,0,113803,53.1,C123,S,c8a06e74
4,5,0,3,male,35.0,0,0,373450,8.05,,S,885e53ea
5,6,0,3,male,,0,0,330877,8.4583,,Q,fab68f7e
6,7,0,1,male,54.0,0,0,17463,51.8625,E46,S,558f7a85
7,8,0,3,male,2.0,3,1,349909,21.075,,S,cbd46ee4
8,9,1,3,female,27.0,0,2,347742,11.1333,,S,3f3b1d43
9,10,1,2,female,14.0,1,0,237736,30.0708,,C,c54fde5d


The owner of the dictionary can decode the values in the hash column, exemplified by the following procedure:

In [17]:
print(name_lookup["6c969dc7"])

Braund, Mr. Owen Harris


## Drawback & Concerns:

- The size of the dictionary could be substantial, potentially taking up a significant amount of storage space on the server when saved in disk/cache.
- Posing a security concern when storing the dictionary on the server's disk/cache/DB.

## Further investigation on the encoding methods 
- Use label encoder to anonymize your data https://medium.com/codex/data-anonymization-with-python-8976db6ded36

# Method 3: Microsoft Presidio

Reference: 
- https://microsoft.github.io/presidio/
- Analyze Text for PII Entities & Anonymize Text with Identified PII Entities https://microsoft.github.io/presidio/samples/python/presidio_notebook/
- Installation: https://microsoft.github.io/presidio/samples/python/customizing_presidio_analyzer/
- Adding recognizers https://microsoft.github.io/presidio/analyzer/adding_recognizers/ 

In [4]:
# example 1: Simple anonymization
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import RecognizerResult

# Analyzer output
analyzer_results = [
    RecognizerResult(entity_type="PERSON", start=11, end=15, score=0.8),
    RecognizerResult(entity_type="PERSON", start=17, end=27, score=0.8),
]

# Initialize the engine:
engine = AnonymizerEngine()

# Invoke the anonymize function with the text,
# analyzer results (potentially coming from presidio-analyzer) and
# Operators to get the anonymization output:
result = engine.anonymize(
    text="My name is Bond, James Bond", analyzer_results=analyzer_results
)

print("De-identified text")
print(result.text)

De-identified text
My name is <PERSON>, <PERSON>


## Pros
1. can annonymization define rules for each fields in a json file https://microsoft.github.io/presidio/tutorial/10_simple_anonymization/
2. customized annonymization rules for each types. eg.Mask the last 12 chars of a PHONE_NUMBER entity and replace them with *  https://microsoft.github.io/presidio/tutorial/10_simple_anonymization/

In [6]:
# example 2: Simple anonymization with customized rules in json

from pprint import pprint
import json

from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import OperatorConfig, RecognizerResult


# Analyzer output
analyzer_results = [
    RecognizerResult(entity_type="PERSON", start=11, end=15, score=0.8),
    RecognizerResult(entity_type="PERSON", start=17, end=27, score=0.8),
]

text_to_anonymize = "My name is Bond, James Bond"

anonymizer = AnonymizerEngine()

# Define anonymization operators
operators = {
    "DEFAULT": OperatorConfig("replace", {"new_value": "<ANONYMIZED>"}),
    "PHONE_NUMBER": OperatorConfig(
        "mask",
        {
            "type": "mask",
            "masking_char": "*",
            "chars_to_mask": 12,
            "from_end": True,
        },
    ),
    "TITLE": OperatorConfig("redact", {}),
}

anonymized_results = anonymizer.anonymize(
    text=text_to_anonymize, analyzer_results=analyzer_results, operators=operators
)

print(f"text: {anonymized_results.text}")
print("detailed result:")

pprint(json.loads(anonymized_results.to_json()))

text: My name is <ANONYMIZED>, <ANONYMIZED>
detailed result:
{'items': [{'end': 37,
            'entity_type': 'PERSON',
            'operator': 'replace',
            'start': 25,
            'text': '<ANONYMIZED>'},
           {'end': 23,
            'entity_type': 'PERSON',
            'operator': 'replace',
            'start': 11,
            'text': '<ANONYMIZED>'}],
 'text': 'My name is <ANONYMIZED>, <ANONYMIZED>'}


## Cons:
- de-anonymization needs further investigation with Presidio https://github.com/search?q=repo%3Amicrosoft%2Fpresidio%20deanonymize&type=code

# Method 4: Private AI
- https://docs.private-ai.com/entities/
- https://github.com/privateai/deid-examples

# Method 5: anonymize database 
- https://medium.com/codex/data-anonymization-with-python-8976db6ded36

# Research resources: 
- other methods: 

https://github.com/topics/data-anonymization

https://github.com/topics/pii-anonymization
   