## The Titanic without Jack or Rose: data pseudonymization using Faker
In this simple notebook we shall replace the real names of each of the Titanic passengers with a [pseudonym](https://en.wikipedia.org/wiki/Pseudonym). Such a process is known as pseudonymization which, in European Union law, is defined as:

> "*...the processing of personal data in such a manner that the personal data can no longer be attributed to a specific data subject without the use of additional information, provided that such additional information is kept separately and is subject to technical and organisational measures to ensure that the personal data are not attributed to an identified or identifiable natural person.*" [(GDPR) EU 2016/679 Article 4 (5)](https://eur-lex.europa.eu/eli/reg/2016/679)

These new names are generated randomly, and each time one runs the script the names will change. To do this we shall use the [Faker](https://pypi.org/project/Faker/) package. Let us load in the data and take a look at it:

In [1]:
!pip install Faker

Collecting Faker
  Downloading Faker-4.1.1-py3-none-any.whl (1.0 MB)
[K     |████████████████████████████████| 1.0 MB 9.3 MB/s 
Installing collected packages: Faker
Successfully installed Faker-4.1.1


In [2]:
import pandas as pd 
# read in the Titanic training data csv file
train_data = pd.read_csv('../input/titanic/train.csv')
# take a look
train_data.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


we can see that the **Name** column contains [personally identifiable information](https://en.wikipedia.org/wiki/Personal_data) i.e. in this case the real names of the passengers. We shall replace that column with fake names using **Faker**. We shall use either male of female names based on the **Sex** column data.

In [3]:
from faker import Faker
fake = Faker()

def Sex(row):
    if row['Sex'] == 'female':
        new_name = fake.name_female()
    else:
        new_name = fake.name_male()
    return new_name

train_data['Name'] = train_data.apply(Sex, axis=1)

# take a quick look
train_data.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,Derek Rodriguez,male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,Elizabeth Harris,female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,Emily Hensley,female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,Tracy Robinson,female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,Jerome Anderson,male,35.0,0,0,373450,8.05,,S
5,6,0,3,Joshua Rodriguez,male,,0,0,330877,8.4583,,Q
6,7,0,1,Rodney Ferguson,male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,Arthur Cooke,male,2.0,3,1,349909,21.075,,S
8,9,1,3,Brenda Dixon,female,27.0,0,2,347742,11.1333,,S
9,10,1,2,Amy Rhodes,female,14.0,1,0,237736,30.0708,,C


Finally we shall now write out this new pseudonymized DataFrame to a csv file:

In [4]:
train_data.to_csv('pseudonymized_train.csv', index=False)

Note that the new file is by no means rigorously anonymized as we have kept, for example, the **Ticket** column as it was, which could be used to de-anonymize the file. We also have **Cabin** information for some of the entries. Indeed, to truly anonymize data to legal standards is a far from trivial task. 

It is also worth stating that this notebook is just an exercise; the passenger list of the RMS Titanic available under Creative Commons, is a historical document, and is here being used *in the public interest, scientific or historical research purposes or statistical purposes* (Article 89).

## Links:
* [Faker (GitHub)](https://github.com/joke2k/faker)
* [Faker documentation](https://faker.readthedocs.io/en/stable/)

## Related reading:
* [Data anonymization](https://en.wikipedia.org/wiki/Data_anonymization)
* [Pseudonymization](https://en.wikipedia.org/wiki/Pseudonymization)
* [Data re-identification](https://en.wikipedia.org/wiki/Data_re-identification)
* [General Data Protection Regulation (GDPR)](https://eur-lex.europa.eu/eli/reg/2016/679) (EU) 2016/679 