### 01. Pseudonymization

In this notebook, we'll explore pseudonymization methods such as hashing, masking and format-preserving encryption.

For more reading on the topic, please see: 

- [Medium (Alex Ewerlöf): Anonymization vs. Pseudonymization](https://medium.com/@alexewerlof/gdpr-pseudonymization-techniques-62f7b3b46a56)
- [KIProtect: GDPR for Data Science](https://kiprotect.com/blog/gdpr_for_data_science.html)
- [IAPP: Anonymization and Pseudonymization Compared in relation to GDPR compliance](https://iapp.org/media/pdf/resource_center/PA_WP2-Anonymous-pseudonymous-comparison.pdf)

In [2]:
import base64
from hashlib import blake2b

import pandas as pd
import json
import requests

from faker import Faker
from ff3 import FF3Cipher

#### Precheck: What is our data? 
- What information is contained in our data?
- What privacy concerns are there?
- How should we proceed?

In [3]:
df = pd.read_csv('data/iot_example.csv')

In [4]:
df.head()

             timestamp        username  temperature  heartrate  \
0  2017-01-01T12:00:23    michaelsmith           12         67   
1  2017-01-01T12:01:09       kharrison            6         78   
2  2017-01-01T12:01:34       smithadam            5         89   
3  2017-01-01T12:02:09  eddierodriguez           28         76   
4  2017-01-01T12:02:36       kenneth94           29         62   

                                  build  latest      note  
0  4e6a7805-8faa-2768-6ef6-eb3198b483ac       0  interval  
1  7256b7b0-e502-f576-62ec-ed73533c9c84       0      wake  
2  9226c94b-bb4b-a6c8-8e02-cb42b53e9c90       0       NaN  
3  2599ac79-e5e0-5117-b8e1-57e5ced036f7       0    update  
4  122f1c6a-403c-2221-6ed1-b5caa08f11e0       0      user  

#### Section One: Hashing

- Applying the blake2b hash
- Allowing for de-pseudonymization
- Creating a reusable method for hashing

In [5]:
username = df.iloc[0,1]

In [6]:
username

'michaelsmith'

In [8]:
hasher = blake2b()
hasher.update(username.encode('utf-8'))
hasher.hexdigest()

'a2a858011c0917154cdf8edce30d399e37df5f13217fa6d2959e453dd5245eb73a0787f0784d0c1969df51a48dc5a6664a59b724e33962be6ed4a9f0424ecb43'

Oops. What went wrong? How can we fix?

In [None]:
# %load solutions/proper_encoding.py


Great! Now we have a hash. Michael is safe! (or [is he?](https://nakedsecurity.sophos.com/2014/06/24/new-york-city-makes-a-hash-of-taxi-driver-data-disclosure/))

But... what if we need to later determine that michaelsmith is a2a858011c091715....

In [None]:
hasher.

Okay, let's try something that we can reverse...

In [9]:
# From https://stackoverflow.com/questions/2490334/simple-way-to-encode-a-string-according-to-a-password

def encode(key, clear):
    enc = []
    for i in range(len(clear)):
        key_c = key[i % len(key)]
        #print(key_c)
        enc_c = (ord(clear[i]) + ord(key_c)) % 256
        #print(enc_c)
        enc.append(enc_c)
    return base64.urlsafe_b64encode(bytes(enc))

def decode(key, enc):
    dec = []
    enc = base64.urlsafe_b64decode(enc)
    for i in range(len(enc)):
        key_c = key[i % len(key)]
        dec_c = chr((256 + enc[i] - ord(key_c)) % 256)
        dec.append(dec_c)
    return "".join(dec)

In [10]:
encode('supa_secret', username)

b'4N7TycDY0dbfzujb'

In [11]:
decode('supa_secret', b'4N7TycDY0dbfzujb')

'michaelsmith'

#### Challenge

- Can you come up with another string which will properly decode the secret which is *not* the same as the original key?
- Hint: Take a look at the encode method and use the print statements for a clue.

In [13]:
decode('supa_secrets_for_yoooooou', b'4N7TycDY0dbfzujb')

'michaelsmith'

In [None]:
# %load solutions/lockpick.py


Welp. That maybe is not so great... 

#### Section Two: Data Masking and Tokenization

- What should we mask?
- How?
- What do we do if we need realistic values?

In [14]:
df.sample(2)

                  timestamp       username  temperature  heartrate  \
145466  2017-02-28T15:08:20  roywashington            8         74   
94557   2017-02-08T07:24:00         cjones           27         83   

                                       build  latest   note  
145466  f5e6f75e-2f97-8377-a240-622fb6eb5f90       0  sleep  
94557   199c2ff1-5e02-d552-509a-f0ec1b5be036       1   user  

In [15]:
super_masked = df.applymap(lambda x: 'NOPE')

  super_masked = df.applymap(lambda x: 'NOPE')


In [16]:
super_masked.head()

  timestamp username temperature heartrate build latest  note
0      NOPE     NOPE        NOPE      NOPE  NOPE   NOPE  NOPE
1      NOPE     NOPE        NOPE      NOPE  NOPE   NOPE  NOPE
2      NOPE     NOPE        NOPE      NOPE  NOPE   NOPE  NOPE
3      NOPE     NOPE        NOPE      NOPE  NOPE   NOPE  NOPE
4      NOPE     NOPE        NOPE      NOPE  NOPE   NOPE  NOPE

😜

Okay, no more jokes. But masking usually is just that. Replace your senstive data with some sort of represetation.

But instead, we could also tokenize the data. This means to replace it with random fictitious data. How do we tokenize this?

In [17]:
fakes = Faker()

In [18]:
fakes.name()

'Mr. Ronnie Gonzalez'

In [19]:
fakes.

SyntaxError: invalid syntax (3757920287.py, line 1)

In [20]:
fakes.user_name()

'tinabell'

#### Challenge

Make a new column `pseudonym` which masks the data using the faker `user_name` method.

In [21]:
df['pseudonym'] = df['username'].map(lambda x: fakes.user_name())
df['pseudonym'].head()

0    sararodriguez
1         angela65
2    gregorygreene
3      oharrington
4       amandaruiz
Name: pseudonym, dtype: object

In [None]:
# %load solutions/masked_pseudonym.py



Whaaaa!?!? Pretty cool, eh? 

(In case you want to read up on [how it works](https://github.com/joke2k/faker/blob/06d323f6cff95103d4ccda03f5d4ab2c45334e46/faker/providers/internet/__init__.py#L162))

But.. we can't reverse it. It is tuned per locale (usually using probabilities based on names in locale). That said, works fabulous for test data!

#### Step Three: Format-Preserving Encryption

In [34]:
key = "2DE79D232DF5585D68CE47882AE256D6"
tweak = "CBD09280979564"

c6 = FF3Cipher.withCustomAlphabet(key, tweak, "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789_")

plaintext = "Maxx"
ciphertext = c6.encrypt(plaintext)

ciphertext

'RV1k'

In [35]:
decrypted = c6.decrypt(ciphertext)
decrypted

'Maxx'

In [24]:
df['username'] = df['username'].map(c6.encrypt)

ValueError: message length 3 is not within min 4 and max 32 bounds

Oh no! What does this message mean and how can we fix it?

In [None]:
# %load solutions/pad_text.py


In [30]:
def add_padding_and_encrypt(cipher, username):
    if len(username) < 4:
        username += "X" * (4-len(username))
    return cipher.encrypt(username)

In [31]:
df['username'] = df['username'].map(lambda x: add_padding_and_encrypt(c6, x))

In [32]:
df['username']

0           ApnEF6fyR7tq
1              V5ldensfP
2              HhiBW6QRh
3         bVeuUbJeN7j535
4              CqS57OoWG
               ...      
146392        AzNtqL04LA
146393         hGxuy3Ar5
146394            _KlNg_
146395        r3hL863cdr
146396           kMLWk8b
Name: username, Length: 146397, dtype: object

### Questions

- What would happen if someone found our key?
- What happens if a username ends in X?
- What properties do we need in our data in order to maintain encryption-level security?

#### Additional Challenge

How would we build our own format-preserving encryption?

In [None]:
num_cipher = FF3Cipher.withCustomAlphabet(key, tweak, "0123456789")

In [None]:
example = "2017-01-01T12:00:23"

In [None]:
enc_date = num_cipher.encrypt(example.replace("T","").replace(":","").replace("-",""))

In [None]:
enc_ts = f"{enc_date[:4]}-{enc_date[4:6]}-{enc_date[6:8]}T{enc_date[8:10]}:{enc_date[10:12]}:{enc_date[12:14]}"

In [None]:
enc_ts

#### Homework Challenge

Create a function to format preserve another column in the data.

Return a new dataframe of just the pseudonymized data.