### 01. Pseudonymization

In this notebook, we'll explore pseudonymization methods such as hashing, masking and homomorphic pseudonymization.

For more reading on the topic, please see: 

- [Medium (Alex Ewerlöf): Anonymization vs. Pseudonymization](https://medium.com/@alexewerlof/gdpr-pseudonymization-techniques-62f7b3b46a56)
- [KIProtect: GDPR for Data Science](https://kiprotect.com/blog/gdpr_for_data_science.html)
- [IAPP: Anonymization and Pseudonymization Compared in relation to GDPR compliance](https://iapp.org/media/pdf/resource_center/PA_WP2-Anonymous-pseudonymous-comparison.pdf)


To test out more [KIProtect](https://kiprotect.com) API calls, please [sign up](https://kiprotect.com/signup.html) on our site.

In [None]:
import base64
from hashlib import blake2b

import pandas as pd
import requests

from faker import Faker

#### Precheck: What is our data? 
- What information is contained in our data?
- What privacy concerns are there?
- How should we proceed?

In [None]:
df = pd.read_csv('../data/iot_example.csv')

In [None]:
df.head()

#### Section One: Hashing

- Applying the blake2b hash
- Allowing for de-pseudonymization
- Creating a reusable method for hashing

In [None]:
username = df.iloc[0,1]

In [None]:
username

In [None]:
hasher = blake2b()
hasher.update(username)
hasher.hexdigest()

Oops. What went wrong? How can we fix?

In [None]:
# %load ../solutions/proper_encoding.py



Great! Now we have a hash. Michael is safe! (or [is he?](https://nakedsecurity.sophos.com/2014/06/24/new-york-city-makes-a-hash-of-taxi-driver-data-disclosure/))

But... what if we need to later determine that michaelsmith is a2a858011c091715....

In [None]:
hasher.

Okay, let's try something that we can reverse...

In [None]:
# From https://stackoverflow.com/questions/2490334/simple-way-to-encode-a-string-according-to-a-password

def encode(key, clear):
    enc = []
    for i in range(len(clear)):
        key_c = key[i % len(key)]
        #print(key_c)
        enc_c = (ord(clear[i]) + ord(key_c)) % 256
        #print(enc_c)
        enc.append(enc_c)
    return base64.urlsafe_b64encode(bytes(enc))

def decode(key, enc):
    dec = []
    enc = base64.urlsafe_b64decode(enc)
    for i in range(len(enc)):
        key_c = key[i % len(key)]
        dec_c = chr((256 + enc[i] - ord(key_c)) % 256)
        dec.append(dec_c)
    return "".join(dec)

In [None]:
encode('supa_secret', username)

In [None]:
decode('supa_secret', b'4N7TycDY0dbfzujb')

#### Challenge

- Can you come up with another string which will properly decode the secret which is *not* the same as the original key?
- Hint: Take a look at the encode method and use the print statements for a clue.

In [None]:
# %load ../solutions/lockpick.py


Welp. That maybe is not so great... 

#### Section Two: Data Masking and Tokenization

- What should we mask?
- How?
- What do we do if we need realistic values?

In [None]:
df.sample(2)

In [None]:
super_masked = df.applymap(lambda x: 'NOPE')

In [None]:
super_masked.head()

😜

Okay, no more jokes. But masking usually is just that. Replace your senstive data with some sort of represetation.

But instead, we could also tokenize the data. This means to replace it with random fictitious data. How do we tokenize this?

In [None]:
fakes = Faker()

In [None]:
fakes.name()

In [None]:
fakes.

In [None]:
fakes.user_name()

#### Challenge

Make a new column `pseudonym` which masks the data using the faker `user_name` method.

In [None]:
# %load ../solutions/masked_pseudonym.py



Whaaaa!?!? Pretty cool, eh? 

(In case you want to read up on [how it works](https://github.com/joke2k/faker/blob/06d323f6cff95103d4ccda03f5d4ab2c45334e46/faker/providers/internet/__init__.py#L162))

But.. we can't reverse it. It is tuned per locale (usually using probabilities based on names in locale). That said, works fabulous for test data!

#### Step Three: Homomorphic Pseudonymization

In [None]:
## This key is shared in trust (please do not abuse)!! 
## We will leave it active for a few days. 
## However, you can sign up for your own key on our site 
## https://kiprotect.com

SHARED_KEY = '42a2d3fc1cc449e2a27ddd457e056012'

##### Finding Nulls

We need to create valid JSON, but Pandas and numpy don't make this easy if you have null values. How can we test for nulls?

In [None]:
df.isnull().any()

In [None]:
df.note.isnull().sum()

In [None]:
df.note.isnull().sum() / df.note.count()

In [None]:
df.note.dtype

In [None]:
df.note.value_counts()

In [None]:
jsonable_df = df.fillna('')

##### Creating a list of JSON items

In [None]:
items = jsonable_df.iloc[:10].T.to_dict()

In [None]:
items

In [None]:
item_list = list(items.values())

In [None]:
actions = [
    {
        "name": "pseudonymize-username",
        "transform-value" : {
            "key": "username",
            "pseudonymize" : {
                "method": "merengue",
                "key": "supa_secret", 
            }
        }
    }
]

In [None]:
data = requests.post(
    'https://api.kiprotect.com/v1/transform', 
    data = json.dumps(
        {"actions": actions, "items": item_list}, 
        allow_nan=False),
    headers = {'Authorization': 'Bearer {}'.format(
        SHARED_KEY)}
)

In [None]:
data.json()

In [None]:
depseudonymize_actions = [
    {
        "name": "encode-username",
        "transform-value": {
        "key": "username",
        "decode": {
            "format": "base64"
        }}
    },
    {
        "name": "depseudonymize-username",
        "transform-value" : {
            "key": "username",
            "depseudonymize" : {
                "method": "merengue",
                "key": "supa_secret"
            }
        }
    },
    {
        "name": "encode-username",
        "transform-value": {
        "key": "username",
        "encode": {
            "format": "utf-8"
        }}  
    }   
]

In [None]:
depseudonymized_data = requests.post(
    'https://api.kiprotect.com/v1/transform', 
    json = {'actions': depseudonymize_actions, 
            'items': data.json()['items']},
    headers = {'Authorization': 'Bearer {}'.format(
        SHARED_KEY)}
)

In [None]:
depseudonymized_data.json()

#### Challenge

Create a function which takes a dataframe and uses the pseudonymization API to pseudonymize a selected subset of columns.

It returns a new dataframe of with pseudonymized data.

NOTE: Test using just 1k rows so you don't abuse the server please :)

In [None]:
# %load ../solutions/pseudonymize_columns.py


In [None]:
ps_df = pseudonymize_columns(jsonable_df[:1000], 
                             ['username', 'note'])
ps_df.head()