# Data anonymization using Faker (Titanic example)

Reference: https://www.kaggle.com/code/carlmcbrideellis/data-anonymization-using-faker-titanic-example/notebook

load data

In [2]:
import pandas as pd 
# read in the Titanic training data csv file
train_data = pd.read_csv('./input/titanic/train.csv')
# take a look
train_data.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


The Name column consists of sensitive personal data, specifically the actual names of the passengers in this instance. 

To protect privacy, we should replace this column with fabricated names generated by Faker. Faker will generate either male or female names based on the data in the Sex column.

In [3]:
from faker import Faker
fake = Faker()

def Sex(row):
    if row['Sex'] == 'female':
        new_name = fake.name_female()
    else:
        new_name = fake.name_male()
    return new_name

train_data['Name'] = train_data.apply(Sex, axis=1)

# take a quick look
train_data.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,Sean Wood,male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,Michelle Gordon,female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,Teresa Davis,female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,Connie Compton,female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,Timothy Hutchinson,male,35.0,0,0,373450,8.05,,S
5,6,0,3,Richard Estrada,male,,0,0,330877,8.4583,,Q
6,7,0,1,Jonathan Mann,male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,Richard Stevenson,male,2.0,3,1,349909,21.075,,S
8,9,1,3,Debbie Moran,female,27.0,0,2,347742,11.1333,,S
9,10,1,2,Madeline Hanson,female,14.0,1,0,237736,30.0708,,C


Finally we shall now write out this new pseudonymized DataFrame to a csv file:

In [7]:
train_data.to_csv('./output/pseudonymized_train.csv', index=False)
# take a look
train_data.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,Sean Wood,male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,Michelle Gordon,female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,Teresa Davis,female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,Connie Compton,female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,Timothy Hutchinson,male,35.0,0,0,373450,8.05,,S
5,6,0,3,Richard Estrada,male,,0,0,330877,8.4583,,Q
6,7,0,1,Jonathan Mann,male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,Richard Stevenson,male,2.0,3,1,349909,21.075,,S
8,9,1,3,Debbie Moran,female,27.0,0,2,347742,11.1333,,S
9,10,1,2,Madeline Hanson,female,14.0,1,0,237736,30.0708,,C


An alternative, more practical approach would be to generate a hash for each passenger name and store the hashed values in a dictionary. This dictionary can later be used by the authorized owner to reverse the anonymization process and retrieve the original names.

In [9]:
import hashlib

train_data = pd.read_csv('./input/titanic/train.csv')

# create the hash
train_data['Name hash'] = train_data['Name'].apply(lambda x: hashlib.sha256(x.encode()).hexdigest()[:8])

# save to a dictionary
name_lookup = dict(zip(train_data['Name hash'],train_data['Name']))

# now delete the "Name" column
train_data = train_data.drop(["Name"], axis=1)

# take a look
train_data.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Name hash
0,1,0,3,male,22.0,1,0,A/5 21171,7.25,,S,6c969dc7
1,2,1,1,female,38.0,1,0,PC 17599,71.2833,C85,C,f7ad2d69
2,3,1,3,female,26.0,0,0,STON/O2. 3101282,7.925,,S,7eb1fa77
3,4,1,1,female,35.0,1,0,113803,53.1,C123,S,c8a06e74
4,5,0,3,male,35.0,0,0,373450,8.05,,S,885e53ea
5,6,0,3,male,,0,0,330877,8.4583,,Q,fab68f7e
6,7,0,1,male,54.0,0,0,17463,51.8625,E46,S,558f7a85
7,8,0,3,male,2.0,3,1,349909,21.075,,S,cbd46ee4
8,9,1,3,female,27.0,0,2,347742,11.1333,,S,3f3b1d43
9,10,1,2,female,14.0,1,0,237736,30.0708,,C,c54fde5d


The owner of the dictionary can decode the values in the hash column, exemplified by the following procedure:

In [10]:
print(name_lookup["6c969dc7"])

Braund, Mr. Owen Harris


## Drawback & Concerns:

- The size of the dictionary could be substantial, potentially taking up a significant amount of storage space on the server when saved in disk/cache.
- Posing a security concern when storing the dictionary on the server's disk/cache/DB.