# Anonymize

See documentation for more details: https://inforcehub.readthedocs.io/en/latest/modules/anon.html

In [1]:
# Ignore this cell - this is just used to make this example notebook work
import sys
sys.path.append("../") 

## Load demo product data that we will anonymize

In [2]:
import pandas as pd

FILENAME = 'data/product_data.csv'
df = pd.read_csv(FILENAME)
df

Unnamed: 0,Contract number,Product,Status,First name,Last name,Age,DOB,Start date,Distributor code,Sum Assured,In-force premium
0,33463634,Mousetrap,Active,John,McGee,45,2-Nov-1954,1-Jan-2013,3456,100000,150.3
1,223422342,Mousetrap,Active,Jane,Morrison,67,4-Jan-1967,1-Jan-2013,355,200000,204.0
2,4464646,Mousetrap Pro,Active,Bruce,Lane,62,1-Jan-1943,1-Jan-2013,5757,500000,950.0
3,4234335,Mousetrap Pro,Paid-up,Betty,Brown,45,12-Dec-1986,1-Jan-2013,355,550000,0.0
4,33535363,Mousetrap Pro,Lapsed,Arnold,Brown,53,4-Apr-1972,1-Jan-2013,355,400000,0.0


## Import the module and initialize

In [3]:
from inforcehub import Anonymize

Normally a new Anonymize object will be created with no parameters.
This will give more security as the salt (passphrase) will be generated randomly.

In [4]:
anon = Anonymize()

# Check the salt (passphrase) initialized at random
anon.salt

b'$2b$12$d1TUfhDWi4/a9TpXLHS5z.'

Alternatively a passphrase can be set manually for the `salt` parameter if you want to be able to reproduce the exact same encryption in the future.

In [5]:
anon_reproduce = Anonymize(salt='A_LONG_PASSWORD_THING')
anon_reproduce.salt

b'A_LONG_PASSWORD_THING'

## Anonymize the dataset

First list the columns that need to be encrypted.

In [6]:
to_transform = ['Contract number', 'First name', 'Last name', 'DOB']

Then use the `transform()` method to anonymize the dataframe 

**Note** that `transform()` encrypts the original dataframe you pass the function and does not 
take a copy in order to not use too much memory. Take a copy of the dataframe 
with `original_df = df.copy(deep=True)` first if you want to keep the original.

Using the `verbose=True` option will output the status as it does the encryption

In [7]:
anon.transform(df, to_transform, verbose=True)
df

Will convert columns: Contract number, First name, Last name, DOB
Encrypting 5 rows per column ...

Finished encrypting column Contract number
Finished encrypting column First name
Finished encrypting column Last name
Finished encrypting column DOB


Unnamed: 0,Contract number,Product,Status,First name,Last name,Age,DOB,Start date,Distributor code,Sum Assured,In-force premium
0,d772b2ab4b1aa26375bc98c4192eb090,Mousetrap,Active,98f93e36a7684e30de9d239c10a4fad8,4cc2da2225a1e4a77bcc880dfb7e6ed8,45,20855ef826f425a348248ce5e8527423,1-Jan-2013,3456,100000,150.3
1,2aac32e4fcc94db4fbeeb065cad7771c,Mousetrap,Active,48e03092929b3ec415a43effda854534,2c5b703c38a565a5deeeabb583e256e0,67,4762743b3050e52651df242edc1b7015,1-Jan-2013,355,200000,204.0
2,d87ee3b9f8be8d9ae75f22a4d2777b68,Mousetrap Pro,Active,dc98217af113adadb8fc8dcd6b9bca3a,b7e484bf36156ccd220376db4ce03cb2,62,8847c1ff7566c075cf864fa89f16e0e8,1-Jan-2013,5757,500000,950.0
3,30416905489a59b2d4144d7967f65386,Mousetrap Pro,Paid-up,8f198f0d95a9741fc802bd8f9fae2210,58b5cb9dbcec378955d2218617d6c3d3,45,ba7c8011d73db8a7a64854d67130fc51,1-Jan-2013,355,550000,0.0
4,b62470b629ef90d03eb9f71b17795da3,Mousetrap Pro,Lapsed,a819ca6223420bb840e84e4692fe9a4b,58b5cb9dbcec378955d2218617d6c3d3,53,e10f14b1baba262cb8b13d8392444614,1-Jan-2013,355,400000,0.0


## Pseudo-anonymize the dataset

The same process is used, and collect the **lookup keys** are returned by the `tranform()` method

In [8]:
# Reset our dataframe from our file
df = pd.read_csv(FILENAME)

In [9]:
# Collect the lookup keys
lookup_df = anon.transform(df, to_transform, verbose=True)

Will convert columns: Contract number, First name, Last name, DOB
Encrypting 5 rows per column ...

Finished encrypting column Contract number
Finished encrypting column First name
Finished encrypting column Last name
Finished encrypting column DOB


The lookup keys are a dataframe of column pairs for the encrypted and unencrypted values

In [10]:
lookup_df

Unnamed: 0,Contract number,Contract number_,First name,First name_,Last name,Last name_,DOB,DOB_
0,33463634,d772b2ab4b1aa26375bc98c4192eb090,John,98f93e36a7684e30de9d239c10a4fad8,McGee,4cc2da2225a1e4a77bcc880dfb7e6ed8,2-Nov-1954,20855ef826f425a348248ce5e8527423
1,223422342,2aac32e4fcc94db4fbeeb065cad7771c,Jane,48e03092929b3ec415a43effda854534,Morrison,2c5b703c38a565a5deeeabb583e256e0,4-Jan-1967,4762743b3050e52651df242edc1b7015
2,4464646,d87ee3b9f8be8d9ae75f22a4d2777b68,Bruce,dc98217af113adadb8fc8dcd6b9bca3a,Lane,b7e484bf36156ccd220376db4ce03cb2,1-Jan-1943,8847c1ff7566c075cf864fa89f16e0e8
3,4234335,30416905489a59b2d4144d7967f65386,Betty,8f198f0d95a9741fc802bd8f9fae2210,Brown,58b5cb9dbcec378955d2218617d6c3d3,12-Dec-1986,ba7c8011d73db8a7a64854d67130fc51
4,33535363,b62470b629ef90d03eb9f71b17795da3,Arnold,a819ca6223420bb840e84e4692fe9a4b,Brown,58b5cb9dbcec378955d2218617d6c3d3,4-Apr-1972,e10f14b1baba262cb8b13d8392444614
