### 01. Pseudonymization

In this notebook, we'll explore pseudonymization methods such as hashing, masking and homomorphic pseudonymization.

For more reading on the topic, please see: 

- [Medium (Alex Ewerlöf): Anonymization vs. Pseudonymization](https://medium.com/@alexewerlof/gdpr-pseudonymization-techniques-62f7b3b46a56)
- [KIProtect: GDPR for Data Science](https://kiprotect.com/blog/gdpr_for_data_science.html)
- [IAPP: Anonymization and Pseudonymization Compared in relation to GDPR compliance](https://iapp.org/media/pdf/resource_center/PA_WP2-Anonymous-pseudonymous-comparison.pdf)


To test out more [KIProtect](https://kiprotect.com) API calls, please [sign up](https://kiprotect.com/signup.html) on our site.

In [1]:
import base64
from hashlib import blake2b

import pandas as pd
import requests

from faker import Faker

#### Precheck: What is our data? 
- What information is contained in our data?
- What privacy concerns are there?
- How should we proceed?

In [2]:
df = pd.read_csv('../data/iot_example.csv')

In [3]:
df.head()

Unnamed: 0,timestamp,username,temperature,heartrate,build,latest,note
0,2017-01-01T12:00:23,michaelsmith,12,67,4e6a7805-8faa-2768-6ef6-eb3198b483ac,0,interval
1,2017-01-01T12:01:09,kharrison,6,78,7256b7b0-e502-f576-62ec-ed73533c9c84,0,wake
2,2017-01-01T12:01:34,smithadam,5,89,9226c94b-bb4b-a6c8-8e02-cb42b53e9c90,0,
3,2017-01-01T12:02:09,eddierodriguez,28,76,2599ac79-e5e0-5117-b8e1-57e5ced036f7,0,update
4,2017-01-01T12:02:36,kenneth94,29,62,122f1c6a-403c-2221-6ed1-b5caa08f11e0,0,user


#### Section One: Hashing

- Applying the blake2b hash
- Allowing for de-pseudonymization
- Creating a reusable method for hashing

In [4]:
username = df.iloc[0,1]

In [5]:
username = username.encode("utf-8")
username

b'michaelsmith'

In [6]:
hasher = blake2b()
hasher.update(username)
hasher.hexdigest()

'a2a858011c0917154cdf8edce30d399e37df5f13217fa6d2959e453dd5245eb73a0787f0784d0c1969df51a48dc5a6664a59b724e33962be6ed4a9f0424ecb43'

Oops. What went wrong? How can we fix?

In [18]:
username = username.decode("utf-8")

In [19]:
# %load ../solutions/proper_encoding.py



Great! Now we have a hash. Michael is safe! (or [is he?](https://nakedsecurity.sophos.com/2014/06/24/new-york-city-makes-a-hash-of-taxi-driver-data-disclosure/))

But... what if we need to later determine that michaelsmith is a2a858011c091715....

In [20]:
hasher

<_blake2.blake2b at 0x116072750>

Okay, let's try something that we can reverse...

In [31]:
# From https://stackoverflow.com/questions/2490334/simple-way-to-encode-a-string-according-to-a-password

def encode(key, clear):
    enc = []
    for i in range(len(clear)):
        key_c = key[i % len(key)]
        #print(key_c)
        enc_c = (ord(clear[i]) + ord(key_c)) % 256
        #print(enc_c)
        enc.append(enc_c)
    return base64.urlsafe_b64encode(bytes(enc))

def decode(key, enc):
    dec = []
    enc = base64.urlsafe_b64decode(enc)
    for i in range(len(enc)):
        key_c = key[i % len(key)]
        dec_c = chr((256 + enc[i] - ord(key_c)) % 256)
        dec.append(dec_c)
    return "".join(dec)

In [32]:
encode('supa_secret', username)

b'4N7TycDY0dbfzujb'

In [33]:
decode('supa_secret', b'4N7TycDY0dbfzujb')

'michaelsmith'

#### Challenge

- Can you come up with another string which will properly decode the secret which is *not* the same as the original key?
- Hint: Take a look at the encode method and use the print statements for a clue.

In [34]:
# %load ../solutions/lockpick.py


Welp. That maybe is not so great... 

#### Section Two: Data Masking and Tokenization

- What should we mask?
- How?
- What do we do if we need realistic values?

In [35]:
df.sample(2)

Unnamed: 0,timestamp,username,temperature,heartrate,build,latest,note
19904,2017-01-09T10:45:39,shawalyssa,27,78,a80ab45d-4cc6-53fa-72d3-d453493d80c2,0,update
51731,2017-01-22T04:06:55,wwright,9,81,e666f6f9-43a9-307a-36ba-59f1c283106a,1,user


In [36]:
super_masked = df.applymap(lambda x: 'NOPE')

In [37]:
super_masked.head()

Unnamed: 0,timestamp,username,temperature,heartrate,build,latest,note
0,NOPE,NOPE,NOPE,NOPE,NOPE,NOPE,NOPE
1,NOPE,NOPE,NOPE,NOPE,NOPE,NOPE,NOPE
2,NOPE,NOPE,NOPE,NOPE,NOPE,NOPE,NOPE
3,NOPE,NOPE,NOPE,NOPE,NOPE,NOPE,NOPE
4,NOPE,NOPE,NOPE,NOPE,NOPE,NOPE,NOPE


😜

Okay, no more jokes. But masking usually is just that. Replace your senstive data with some sort of represetation.

But instead, we could also tokenize the data. This means to replace it with random fictitious data. How do we tokenize this?

In [38]:
fakes = Faker()

In [39]:
fakes.name()

'Kayla Myers'

In [41]:
fakes

<faker.generator.Generator at 0x110879ac8>

In [42]:
fakes.user_name()

'johnstout'

#### Challenge

Make a new column `pseudonym` which masks the data using the faker `user_name` method.

In [43]:
# %load ../solutions/masked_pseudonym.py



Whaaaa!?!? Pretty cool, eh? 

(In case you want to read up on [how it works](https://github.com/joke2k/faker/blob/06d323f6cff95103d4ccda03f5d4ab2c45334e46/faker/providers/internet/__init__.py#L162))

But.. we can't reverse it. It is tuned per locale (usually using probabilities based on names in locale). That said, works fabulous for test data!

#### Step Three: Homomorphic Pseudonymization

In [44]:
## This key is shared in trust (please do not abuse)!! 
## We will leave it active for a few days. 
## However, you can sign up for your own key on our site 
## https://kiprotect.com

SHARED_KEY = '42a2d3fc1cc449e2a27ddd457e056012'

##### Finding Nulls

We need to create valid JSON, but Pandas and numpy don't make this easy if you have null values. How can we test for nulls?

In [45]:
df.isnull().any()

timestamp      False
username       False
temperature    False
heartrate      False
build          False
latest         False
note            True
dtype: bool

In [46]:
df.note.isnull().sum()

20899

In [47]:
df.note.isnull().sum() / df.note.count()

0.16652855025578098

In [48]:
df.note.dtype

dtype('O')

In [49]:
df.note.value_counts()

wake        21245
user        21032
interval    20935
sleep       20925
update      20734
test        20627
Name: note, dtype: int64

In [50]:
jsonable_df = df.fillna('')

##### Creating a list of JSON items

In [51]:
items = jsonable_df.iloc[:10].T.to_dict()

In [52]:
items

{0: {'timestamp': '2017-01-01T12:00:23',
  'username': 'michaelsmith',
  'temperature': 12,
  'heartrate': 67,
  'build': '4e6a7805-8faa-2768-6ef6-eb3198b483ac',
  'latest': 0,
  'note': 'interval'},
 1: {'timestamp': '2017-01-01T12:01:09',
  'username': 'kharrison',
  'temperature': 6,
  'heartrate': 78,
  'build': '7256b7b0-e502-f576-62ec-ed73533c9c84',
  'latest': 0,
  'note': 'wake'},
 2: {'timestamp': '2017-01-01T12:01:34',
  'username': 'smithadam',
  'temperature': 5,
  'heartrate': 89,
  'build': '9226c94b-bb4b-a6c8-8e02-cb42b53e9c90',
  'latest': 0,
  'note': ''},
 3: {'timestamp': '2017-01-01T12:02:09',
  'username': 'eddierodriguez',
  'temperature': 28,
  'heartrate': 76,
  'build': '2599ac79-e5e0-5117-b8e1-57e5ced036f7',
  'latest': 0,
  'note': 'update'},
 4: {'timestamp': '2017-01-01T12:02:36',
  'username': 'kenneth94',
  'temperature': 29,
  'heartrate': 62,
  'build': '122f1c6a-403c-2221-6ed1-b5caa08f11e0',
  'latest': 0,
  'note': 'user'},
 5: {'timestamp': '2017-01-

In [53]:
item_list = list(items.values())

In [54]:
actions = [
    {
        "name": "pseudonymize-username",
        "transform-value" : {
            "key": "username",
            "pseudonymize" : {
                "method": "merengue",
                "key": "supa_secret", 
            }
        }
    }
]

In [56]:
import json
data = requests.post(
    'https://api.kiprotect.com/v1/transform', 
    data = json.dumps(
        {"actions": actions, "items": item_list}, 
        allow_nan=False),
    headers = {'Authorization': 'Bearer {}'.format(
        SHARED_KEY)}
)

In [57]:
data.json()

{'items': [{'_kip': '64656661756c74',
   'build': '4e6a7805-8faa-2768-6ef6-eb3198b483ac',
   'heartrate': 67,
   'latest': 0,
   'note': 'interval',
   'temperature': 12,
   'timestamp': '2017-01-01T12:00:23',
   'username': '588vluH4QrwB+7Kn'},
  {'_kip': '64656661756c74',
   'build': '7256b7b0-e502-f576-62ec-ed73533c9c84',
   'heartrate': 78,
   'latest': 0,
   'note': 'wake',
   'temperature': 6,
   'timestamp': '2017-01-01T12:01:09',
   'username': 'UeaqfY0ffyHi'},
  {'_kip': '64656661756c74',
   'build': '9226c94b-bb4b-a6c8-8e02-cb42b53e9c90',
   'heartrate': 89,
   'latest': 0,
   'note': '',
   'temperature': 5,
   'timestamp': '2017-01-01T12:01:34',
   'username': '2aVk3sbIkBe6'},
  {'_kip': '64656661756c74',
   'build': '2599ac79-e5e0-5117-b8e1-57e5ced036f7',
   'heartrate': 76,
   'latest': 0,
   'note': 'update',
   'temperature': 28,
   'timestamp': '2017-01-01T12:02:09',
   'username': 'jwDBYBiKqj9aRT/RiDo='},
  {'_kip': '64656661756c74',
   'build': '122f1c6a-403c-2221-6e

In [58]:
depseudonymize_actions = [
    {
        "name": "encode-username",
        "transform-value": {
        "key": "username",
        "decode": {
            "format": "base64"
        }}
    },
    {
        "name": "depseudonymize-username",
        "transform-value" : {
            "key": "username",
            "depseudonymize" : {
                "method": "merengue",
                "key": "supa_secret"
            }
        }
    },
    {
        "name": "encode-username",
        "transform-value": {
        "key": "username",
        "encode": {
            "format": "utf-8"
        }}  
    }   
]

In [59]:
depseudonymized_data = requests.post(
    'https://api.kiprotect.com/v1/transform', 
    json = {'actions': depseudonymize_actions, 
            'items': data.json()['items']},
    headers = {'Authorization': 'Bearer {}'.format(
        SHARED_KEY)}
)

In [60]:
depseudonymized_data.json()

{'items': [{'_kip': '64656661756c74',
   'build': '4e6a7805-8faa-2768-6ef6-eb3198b483ac',
   'heartrate': 67,
   'latest': 0,
   'note': 'interval',
   'temperature': 12,
   'timestamp': '2017-01-01T12:00:23',
   'username': 'michaelsmith'},
  {'_kip': '64656661756c74',
   'build': '7256b7b0-e502-f576-62ec-ed73533c9c84',
   'heartrate': 78,
   'latest': 0,
   'note': 'wake',
   'temperature': 6,
   'timestamp': '2017-01-01T12:01:09',
   'username': 'kharrison'},
  {'_kip': '64656661756c74',
   'build': '9226c94b-bb4b-a6c8-8e02-cb42b53e9c90',
   'heartrate': 89,
   'latest': 0,
   'note': '',
   'temperature': 5,
   'timestamp': '2017-01-01T12:01:34',
   'username': 'smithadam'},
  {'_kip': '64656661756c74',
   'build': '2599ac79-e5e0-5117-b8e1-57e5ced036f7',
   'heartrate': 76,
   'latest': 0,
   'note': 'update',
   'temperature': 28,
   'timestamp': '2017-01-01T12:02:09',
   'username': 'eddierodriguez'},
  {'_kip': '64656661756c74',
   'build': '122f1c6a-403c-2221-6ed1-b5caa08f11e0'

#### Challenge

Create a function which takes a dataframe and uses the pseudonymization API to pseudonymize a selected subset of columns.

It returns a new dataframe of with pseudonymized data.

NOTE: Test using just 1k rows so you don't abuse the server please :)

In [None]:
# %load ../solutions/pseudonymize_columns.py


In [None]:
ps_df = pseudonymize_columns(jsonable_df[:1000], 
                             ['username', 'note'])
ps_df.head()