# Imports

As with (basically) every Python script we will start off with importing libraries. Please espacially note the hdbcli and hana_ml packages that we utilize for interaction with the HANA DB and for pushing down the ML workloads repectively. 
hdbcli and hana_ml both are available on PyPi so you can simply get them using "pip install hdbcli" or "pip install hana_ml".

In [1]:
import pandas as pd
import requests
import io
import base64
import matplotlib.pyplot as plt

from hdbcli import dbapi
from hana_ml import dataframe
from hana_ml.algorithms.pal.trees import HybridGradientBoostingClassifier
from hana_ml.algorithms.pal.metrics import binary_classification_debriefing, auc, confusion_matrix

# 1. HANA-ML Connection

Add credentials for your HANA instance here. 
Please note that parts of the features used in this sample notebook is only available with HANA Cloud, so you might want to get a trial access for that, if you run into errors.

In [2]:
hana_url = 'xxx'
port = 'xxx'
user = 'xxx'
password = 'xxx'

print("URL: " + hana_url)
print("User: " + user)

URL: ld4500.wdf.sap.corp
User: jupyter


In [3]:
conn = dataframe.ConnectionContext("{}".format(hana_url),int("{}".format(port)), "{}".format(user), "{}".format(password), encrypt="true", sslValidateCertificate="false")

# 2. Load data from source & send to HANA

To enable this notebook to run standalone we will download the used data set in the below sections and load it into the HANA database. 
We apply some changes to the column names to make sure it can easily be handled with SQL syntax (e.g. upper case names are easier to call). 

Apart from that, data is unchanged and no further data preparation as applied, as we want to focus on the anyonymization part.

## Load adult data set from source

In [4]:
trainset_url="https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
testset_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test"
trainset=requests.get(trainset_url).content
testset=requests.get(testset_url).content

In [5]:
train_data = pd.read_csv(io.StringIO(trainset.decode('utf-8')), header=None,)
test_data = pd.read_csv(io.StringIO(testset.decode('utf-8')), skiprows=1, header=None,)

In [6]:
train_data.columns=[x.upper().replace('-','_').replace('SEX','GENDER') for x in ["age", "workclass", "fnlwgt", "education", "education-num", "marital-status", "occupation", "relationship", "race", "sex", "capital-gain", "capital-loss", "hours-per-week", "native-country", "salary"]]
test_data.columns=[x.upper().replace('-','_').replace('SEX','GENDER') for x in ["age", "workclass", "fnlwgt", "education", "education-num", "marital-status", "occupation", "relationship", "race", "sex", "capital-gain", "capital-loss", "hours-per-week", "native-country", "salary"]]

In [7]:
test_data['SALARY'] = test_data['SALARY'].replace({'\.':''}, regex=True)

In [8]:
train_data = train_data.replace({' ':''}, regex=True)
test_data = test_data.replace({' ':''}, regex=True)

In [9]:
train_data

Unnamed: 0,AGE,WORKCLASS,FNLWGT,EDUCATION,EDUCATION_NUM,MARITAL_STATUS,OCCUPATION,RELATIONSHIP,RACE,GENDER,CAPITAL_GAIN,CAPITAL_LOSS,HOURS_PER_WEEK,NATIVE_COUNTRY,SALARY
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,27,Private,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,<=50K
32557,40,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K
32558,58,Private,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K
32559,22,Private,201490,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,<=50K


In [10]:
test_data

Unnamed: 0,AGE,WORKCLASS,FNLWGT,EDUCATION,EDUCATION_NUM,MARITAL_STATUS,OCCUPATION,RELATIONSHIP,RACE,GENDER,CAPITAL_GAIN,CAPITAL_LOSS,HOURS_PER_WEEK,NATIVE_COUNTRY,SALARY
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16276,39,Private,215419,Bachelors,13,Divorced,Prof-specialty,Not-in-family,White,Female,0,0,36,United-States,<=50K
16277,64,?,321403,HS-grad,9,Widowed,?,Other-relative,Black,Male,0,0,40,United-States,<=50K
16278,38,Private,374983,Bachelors,13,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,0,50,United-States,<=50K
16279,44,Private,83891,Bachelors,13,Divorced,Adm-clerical,Own-child,Asian-Pac-Islander,Male,5455,0,40,United-States,<=50K


## Transform to HANA dataframe and load into HANA DB (using HANA-ML package)

Before running this cell, make sure to **change the schema name 'ANON_DEMO'** to some schema that exists in you system or create it.

In [11]:
df_pal_train = dataframe.create_dataframe_from_pandas(conn, train_data, 'ADULT_DATA_TRAIN', force=True, replace=False).add_id('ID')
df_pal_train.save(('ANON_DEMO','ADULT_DATA_TRAIN'), force=True)
df_pal_test = dataframe.create_dataframe_from_pandas(conn, test_data, 'ADULT_DATA_TEST', force=True, replace=False).add_id('ID')
df_pal_test.save(('ANON_DEMO','ADULT_DATA_TEST'), force=True)

<hana_ml.dataframe.DataFrame at 0x7fc9d80236d0>

# 3. Create k-Anonymity View

## Set-up dbapi connection (raw python driver)

In [12]:
hana_connection = dbapi.connect(address=hana_url, port=port, user=user, password=password, encrypt='true', sslValidateCertificate="false")
cursor = hana_connection.cursor()

## Execute statements to create anonymized views for training and test data set

### Hierarchy Functions

First, we’ll specify functions to group the numeric values into buckets. We use two separate functions as the value ranges in age, workhours and years of education vary, obviously.

In [13]:
cursor.execute("""
CREATE OR REPLACE FUNCTION HIERARCHYFUNCTION_NUM_2(value nvarchar(255), level integer)
RETURNS outValue nvarchar(255) AS
BEGIN
  DECLARE double_value double;
  DECLARE rangefrom integer;
  DECLARE defineValue integer default 5;
  DECLARE interval integer;
  IF (level > 0) THEN 
    interval := defineValue * power(2, level-1);
    rangefrom = floor(value/interval)*interval;
    outValue := '[' || rangefrom||'-'||rangefrom+interval-1|| ']'; 
  ELSE
    outValue := value;
  END IF;
END;
""")

True

In [14]:
cursor.execute("""
CREATE OR REPLACE FUNCTION HIERARCHYFUNCTION_SMALL_NUM(value varchar(255), level integer)
RETURNS outValue varchar(255)
AS
BEGIN
    DECLARE double_value double;
    DECLARE rangefrom integer;
    DECLARE interval integer;
    IF (level > 0) THEN 
        interval := power(2, level);
        rangefrom = floor(value/interval)*interval;
        outValue := '[' || rangefrom||'-'||rangefrom+power(2, level)-1|| ']'; 
    ELSE
        outValue := value;
    END IF;
END;
""")

True

Sample view created to demonstrate potential problems in unparamterized anonymization.

Productive views are created on both the training and the test data set with the following parameters:
- k: 10
- recoding: multi_dimensional_strict
- loss: 
    - on training: 2,5 % (for less information loss in remaining data)
    - on test: 0 % (would bias comparison as outliers would most likely be skipped in anonymization, since they are harder to match)
- min/max hierachy level: mostly set to force information loss (min: 1), but avoid complete supression for column usability


### Sample views

For the very first try, we simply define the core view, with no special parameters.

In [15]:
try:
    cursor.execute("DROP VIEW ADULT_VIEW_SAMPLE_1")
except:
    pass

cursor.execute("""
CREATE VIEW ADULT_VIEW_SAMPLE_1 (ID, AGE, EDUCATION, MARITAL_STATUS, GENDER, HOURS_PER_WEEK, WORKCLASS, EDUCATION_NUM, OCCUPATION ,CAPITAL_GAIN, CAPITAL_LOSS, SALARY) AS
SELECT ID, TO_VARCHAR(AGE) AS "AGE", EDUCATION, MARITAL_STATUS, GENDER, TO_VARCHAR(HOURS_PER_WEEK) AS "HOURS_PER_WEEK", WORKCLASS, TO_VARCHAR(EDUCATION_NUM) AS EDUCATION_NUM, OCCUPATION, CAPITAL_GAIN, CAPITAL_LOSS, SALARY FROM ANON_DEMO.ADULT_DATA_TRAIN
WITH ANONYMIZATION (
ALGORITHM 'K-ANONYMITY' PARAMETERS '{"data_change_strategy": "restricted", "k":10}'
COLUMN ID PARAMETERS '{"is_sequence":true}'
COLUMN AGE PARAMETERS '{"is_quasi_identifier":true, "hierarchy":{"schema":"JUPYTER","function":"HIERARCHYFUNCTION_NUM_2","levels":5}}'
COLUMN EDUCATION PARAMETERS '{"is_quasi_identifier":true,"hierarchy":{"embedded":[["Preschool","Primary School","Primary education"],["1st-4th","Primary School","Primary education"],["5th-6th","Primary School","Primary education"],["7th-8th","High School","Secondary education"],["9th","High School","Secondary education"],["10th","High School","Secondary education"],["11th","High School","Secondary education"],["12th","High School","Secondary education"],["HS-grad","High School","Secondary education"],["Assoc-voc","Professional Education","Higher education"],["Assoc-acdm","Professional Education","Higher education"],["Prof-school","Professional Education","Higher education"],["Some-college","Undergraduate","Higher education"],["Bachelors","Undergraduate","Higher education"],["Masters","Graduate","Higher education"],["Doctorate","Graduate","Higher education"]]}}'
COLUMN MARITAL_STATUS PARAMETERS '{"is_quasi_identifier":true,"hierarchy":{"embedded":[["Married-civ-spouse","spouse present"],["Divorced","spouse not present"],["Never-married","spouse not present"],["Separated","spouse not present"],["Widowed","spouse not present"],["Married-spouse-absent","spouse not present"],["Married-AF-spouse","spouse present"]]}}'
COLUMN HOURS_PER_WEEK PARAMETERS '{"is_quasi_identifier":true, "hierarchy":{"schema":"JUPYTER","function":"HIERARCHYFUNCTION_NUM_2","levels":5}}'
COLUMN GENDER PARAMETERS '{"is_quasi_identifier":true,"hierarchy":{"embedded":[["Male"],["Female"]]}}'
COLUMN WORKCLASS PARAMETERS '{"is_quasi_identifier":true, "hierarchy":{"embedded": [["Private", "Private"],["Federal-gov","Government"],["Local-gov","Government"],["State-gov","Government"],["Self-emp-inc","Self-Employed"],["Self-emp-not-inc","Self-Employed"],["Never-worked","Other"],["Without-pay","Other"],["?","Other/Unknown"],["Unknown","Other/Unknown"]]}}'
COLUMN EDUCATION_NUM PARAMETERS '{"is_quasi_identifier":true, "hierarchy":{"schema":"JUPYTER","function":"HIERARCHYFUNCTION_SMALL_NUM","levels":5}}'
COLUMN OCCUPATION PARAMETERS '{"is_quasi_identifier":true, "hierarchy":{"embedded": [["Adm-clerical", "White-Collar"],["Craft-repair", "Blue-Collar"],["Exec-managerial", "White-Collar"],["Farming-fishing", "Blue-Collar"],["Handlers-cleaners", "Blue-Collar"],["Machine-op-inspct", "Blue-Collar"],["Other-service", "Service"],["Priv-house-serv", "Service"],["Prof-specialty", "Professional"],["Protective-serv", "Service"],["Tech-support", "Service"],["Transport-moving", "Blue-Collar"],["Armed-Forces", "Other/Unknown"],["?", "Other/Unknown"],["Sales", "Other/Unknown"]]}}'
)""")

cursor.execute("REFRESH VIEW ADULT_VIEW_SAMPLE_1 ANONYMIZATION")
cursor.execute("COMMIT")
dataframe.DataFrame(conn,'SELECT * FROM ADULT_VIEW_SAMPLE_1 LIMIT 10').collect()

Unnamed: 0,ID,AGE,EDUCATION,MARITAL_STATUS,GENDER,HOURS_PER_WEEK,WORKCLASS,EDUCATION_NUM,OCCUPATION,CAPITAL_GAIN,CAPITAL_LOSS,SALARY
0,1,*,Bachelors,*,Male,*,*,13,*,2174,0,<=50K
1,2,*,Bachelors,*,Male,*,*,13,*,0,0,<=50K
2,3,*,HS-grad,*,Male,*,*,9,*,0,0,<=50K
3,4,*,11th,*,Male,*,*,7,*,0,0,<=50K
4,5,*,Bachelors,*,Female,*,*,13,*,0,0,<=50K
5,6,*,Masters,*,Female,*,*,14,*,0,0,<=50K
6,7,*,9th,*,Female,*,*,5,*,0,0,<=50K
7,8,*,HS-grad,*,Male,*,*,9,*,0,0,>50K
8,9,*,Masters,*,Female,*,*,14,*,14084,0,>50K
9,10,*,Bachelors,*,Male,*,*,13,*,5178,0,>50K




Let’s give it a second shot with the “multi_dimensional_strict” recoding scheme, that has recently been introduced with HANA Cloud and allows all the group to have an individual level of anonymization for each column.

In [16]:
try:
    cursor.execute("DROP VIEW ADULT_VIEW_SAMPLE_2")
except:
    pass

cursor.execute("""
CREATE VIEW ADULT_VIEW_SAMPLE_2 (ID, AGE, EDUCATION, MARITAL_STATUS, GENDER, HOURS_PER_WEEK, WORKCLASS, EDUCATION_NUM, OCCUPATION ,CAPITAL_GAIN, CAPITAL_LOSS, SALARY) AS
SELECT ID, TO_VARCHAR(AGE) AS "AGE", EDUCATION, MARITAL_STATUS, GENDER, TO_VARCHAR(HOURS_PER_WEEK) AS "HOURS_PER_WEEK", WORKCLASS, TO_VARCHAR(EDUCATION_NUM) AS EDUCATION_NUM, OCCUPATION, CAPITAL_GAIN, CAPITAL_LOSS, SALARY FROM ANON_DEMO.ADULT_DATA_TRAIN
WITH ANONYMIZATION (
ALGORITHM 'K-ANONYMITY' PARAMETERS '{"data_change_strategy": "restricted", "k":10, "recoding":"multi_dimensional_strict"}'
COLUMN ID PARAMETERS '{"is_sequence":true}'
COLUMN AGE PARAMETERS '{"is_quasi_identifier":true, "hierarchy":{"schema":"JUPYTER","function":"HIERARCHYFUNCTION_NUM_2","levels":5}}'
COLUMN EDUCATION PARAMETERS '{"is_quasi_identifier":true,"hierarchy":{"embedded":[["Preschool","Primary School","Primary education"],["1st-4th","Primary School","Primary education"],["5th-6th","Primary School","Primary education"],["7th-8th","High School","Secondary education"],["9th","High School","Secondary education"],["10th","High School","Secondary education"],["11th","High School","Secondary education"],["12th","High School","Secondary education"],["HS-grad","High School","Secondary education"],["Assoc-voc","Professional Education","Higher education"],["Assoc-acdm","Professional Education","Higher education"],["Prof-school","Professional Education","Higher education"],["Some-college","Undergraduate","Higher education"],["Bachelors","Undergraduate","Higher education"],["Masters","Graduate","Higher education"],["Doctorate","Graduate","Higher education"]]}}'
COLUMN MARITAL_STATUS PARAMETERS '{"is_quasi_identifier":true,"hierarchy":{"embedded":[["Married-civ-spouse","spouse present"],["Divorced","spouse not present"],["Never-married","spouse not present"],["Separated","spouse not present"],["Widowed","spouse not present"],["Married-spouse-absent","spouse not present"],["Married-AF-spouse","spouse present"]]}}'
COLUMN HOURS_PER_WEEK PARAMETERS '{"is_quasi_identifier":true, "hierarchy":{"schema":"JUPYTER","function":"HIERARCHYFUNCTION_NUM_2","levels":5}}'
COLUMN GENDER PARAMETERS '{"is_quasi_identifier":true,"hierarchy":{"embedded":[["Male"],["Female"]]}}'
COLUMN WORKCLASS PARAMETERS '{"is_quasi_identifier":true, "hierarchy":{"embedded": [["Private", "Private"],["Federal-gov","Government"],["Local-gov","Government"],["State-gov","Government"],["Self-emp-inc","Self-Employed"],["Self-emp-not-inc","Self-Employed"],["Never-worked","Other"],["Without-pay","Other"],["?","Other/Unknown"],["Unknown","Other/Unknown"]]}}'
COLUMN EDUCATION_NUM PARAMETERS '{"is_quasi_identifier":true, "hierarchy":{"schema":"JUPYTER","function":"HIERARCHYFUNCTION_SMALL_NUM","levels":5}}'
COLUMN OCCUPATION PARAMETERS '{"is_quasi_identifier":true, "hierarchy":{"embedded": [["Adm-clerical", "White-Collar"],["Craft-repair", "Blue-Collar"],["Exec-managerial", "White-Collar"],["Farming-fishing", "Blue-Collar"],["Handlers-cleaners", "Blue-Collar"],["Machine-op-inspct", "Blue-Collar"],["Other-service", "Service"],["Priv-house-serv", "Service"],["Prof-specialty", "Professional"],["Protective-serv", "Service"],["Tech-support", "Service"],["Transport-moving", "Blue-Collar"],["Armed-Forces", "Other/Unknown"],["?", "Other/Unknown"],["Sales", "Other/Unknown"]]}}'
)""")

cursor.execute("REFRESH VIEW ADULT_VIEW_SAMPLE_2 ANONYMIZATION")
cursor.execute("COMMIT")
dataframe.DataFrame(conn,'SELECT * FROM ADULT_VIEW_SAMPLE_2 LIMIT 10').collect()

Unnamed: 0,ID,AGE,EDUCATION,MARITAL_STATUS,GENDER,HOURS_PER_WEEK,WORKCLASS,EDUCATION_NUM,OCCUPATION,CAPITAL_GAIN,CAPITAL_LOSS,SALARY
0,1,[35-39],Higher education,spouse not present,Male,[40-59],Government,[12-15],White-Collar,2174,0,<=50K
1,2,[40-59],Higher education,Married-civ-spouse,Male,[0-19],*,[12-15],*,0,0,<=50K
2,3,[35-39],HS-grad,spouse not present,Male,[40-59],*,9,Handlers-cleaners,0,0,<=50K
3,4,[50-54],11th,Married-civ-spouse,Male,[40-59],*,7,*,0,0,<=50K
4,5,[20-29],Bachelors,spouse present,Female,[40-59],Private,13,*,0,0,<=50K
5,6,[20-39],Masters,Married-civ-spouse,Female,[40-59],Private,14,*,0,0,<=50K
6,7,[40-59],9th,spouse not present,Female,*,*,5,Service,0,0,<=50K
7,8,[50-59],High School,Married-civ-spouse,Male,[40-59],Self-emp-not-inc,[8-9],Exec-managerial,0,0,>50K
8,9,[30-39],Masters,spouse not present,Female,[40-59],Private,14,Prof-specialty,14084,0,>50K
9,10,[40-44],Bachelors,Married-civ-spouse,Male,40,Private,13,Exec-managerial,5178,0,>50K


### Productive views

For the sake of this demo, we unfortunately need to make information loss a little worse again, as we want to see the algorithm perform under this suboptimal condition. To do so, we will define some min and max values on the hierarchy level of each of the column. That will force the algorithm aggregate some columns although not technically necessary.

#### View for Training

In [17]:
try:
    cursor.execute("DROP VIEW ADULT_VIEW_TRAIN")
except:
    pass

cursor.execute("""
CREATE VIEW ADULT_VIEW_TRAIN (ID, AGE, EDUCATION, MARITAL_STATUS, GENDER, HOURS_PER_WEEK, WORKCLASS, EDUCATION_NUM, OCCUPATION ,CAPITAL_GAIN, CAPITAL_LOSS, SALARY) AS
SELECT ID, TO_VARCHAR(AGE) AS "AGE", EDUCATION, MARITAL_STATUS, GENDER, TO_VARCHAR(HOURS_PER_WEEK) AS "HOURS_PER_WEEK", WORKCLASS, TO_VARCHAR(EDUCATION_NUM) AS EDUCATION_NUM, OCCUPATION, CAPITAL_GAIN, CAPITAL_LOSS, SALARY FROM ANON_DEMO.ADULT_DATA_TRAIN
WITH ANONYMIZATION (
ALGORITHM 'K-ANONYMITY' PARAMETERS '{"data_change_strategy": "restricted", "k":10,"recoding":"multi_dimensional_strict", "loss":0.025}'
COLUMN ID PARAMETERS '{"is_sequence":true}'
COLUMN AGE PARAMETERS '{"is_quasi_identifier":true, "min":1, "max": 3 ,"hierarchy":{"schema":"JUPYTER","function":"HIERARCHYFUNCTION_NUM_2","levels":5}}'
COLUMN EDUCATION PARAMETERS '{"is_quasi_identifier":true,"hierarchy":{"embedded":[["Preschool","Primary School","Primary education"],["1st-4th","Primary School","Primary education"],["5th-6th","Primary School","Primary education"],["7th-8th","High School","Secondary education"],["9th","High School","Secondary education"],["10th","High School","Secondary education"],["11th","High School","Secondary education"],["12th","High School","Secondary education"],["HS-grad","High School","Secondary education"],["Assoc-voc","Professional Education","Higher education"],["Assoc-acdm","Professional Education","Higher education"],["Prof-school","Professional Education","Higher education"],["Some-college","Undergraduate","Higher education"],["Bachelors","Undergraduate","Higher education"],["Masters","Graduate","Higher education"],["Doctorate","Graduate","Higher education"]]}}'
COLUMN MARITAL_STATUS PARAMETERS '{"is_quasi_identifier":true,"hierarchy":{"embedded":[["Married-civ-spouse","spouse present"],["Divorced","spouse not present"],["Never-married","spouse not present"],["Separated","spouse not present"],["Widowed","spouse not present"],["Married-spouse-absent","spouse not present"],["Married-AF-spouse","spouse present"]]}}'
COLUMN HOURS_PER_WEEK PARAMETERS '{"is_quasi_identifier":true, "min":1, "max": 4 ,"hierarchy":{"schema":"JUPYTER","function":"HIERARCHYFUNCTION_NUM_2","levels":5}}'
COLUMN GENDER PARAMETERS '{"is_quasi_identifier":true,"hierarchy":{"embedded":[["Male"],["Female"]]}}'
COLUMN WORKCLASS PARAMETERS '{"is_quasi_identifier":true, "min": 1, "hierarchy":{"embedded": [["Private", "Private"],["Federal-gov","Government"],["Local-gov","Government"],["State-gov","Government"],["Self-emp-inc","Self-Employed"],["Self-emp-not-inc","Self-Employed"],["Never-worked","Other"],["Without-pay","Other"],["?","Other/Unknown"],["Unknown","Other/Unknown"]]}}'
COLUMN EDUCATION_NUM PARAMETERS '{"is_quasi_identifier":true, "min": 1, "max":4 ,"hierarchy":{"schema":"JUPYTER","function":"HIERARCHYFUNCTION_SMALL_NUM","levels":5}}'
COLUMN OCCUPATION PARAMETERS '{"is_quasi_identifier":true, "min": 1, "hierarchy":{"embedded": [["Adm-clerical", "White-Collar"],["Craft-repair", "Blue-Collar"],["Exec-managerial", "White-Collar"],["Farming-fishing", "Blue-Collar"],["Handlers-cleaners", "Blue-Collar"],["Machine-op-inspct", "Blue-Collar"],["Other-service", "Service"],["Priv-house-serv", "Service"],["Prof-specialty", "Professional"],["Protective-serv", "Service"],["Tech-support", "Service"],["Transport-moving", "Blue-Collar"],["Armed-Forces", "Other/Unknown"],["?", "Other/Unknown"],["Sales", "Other/Unknown"]]}}'
)""")

cursor.execute("REFRESH VIEW ADULT_VIEW_TRAIN ANONYMIZATION")
cursor.execute("COMMIT")
dataframe.DataFrame(conn,'SELECT * FROM ADULT_VIEW_TRAIN LIMIT 10').collect()

Unnamed: 0,ID,AGE,EDUCATION,MARITAL_STATUS,GENDER,HOURS_PER_WEEK,WORKCLASS,EDUCATION_NUM,OCCUPATION,CAPITAL_GAIN,CAPITAL_LOSS,SALARY
0,1,[35-39],Higher education,Never-married,Male,[40-49],Government,[8-15],White-Collar,2174,0,<=50K
1,2,[50-59],Higher education,*,Male,[0-19],*,[8-15],*,0,0,<=50K
2,3,[35-39],High School,spouse not present,Male,[40-44],Private,[8-9],Blue-Collar,0,0,<=50K
3,4,[50-54],11th,Married-civ-spouse,Male,*,*,[6-7],*,0,0,<=50K
4,5,[25-29],Higher education,Married-civ-spouse,Female,[40-59],*,[12-13],*,0,0,<=50K
5,6,[35-39],Higher education,spouse present,Female,*,Private,[12-15],White-Collar,0,0,<=50K
6,7,[45-49],High School,spouse not present,Female,*,*,[4-5],Service,0,0,<=50K
7,8,[50-54],High School,Married-civ-spouse,Male,[40-59],Self-Employed,[8-9],White-Collar,0,0,>50K
8,9,[30-34],Masters,spouse not present,Female,*,Private,[14-15],Professional,14084,0,>50K
9,10,[40-44],Bachelors,Married-civ-spouse,Male,[40-44],Private,[12-13],White-Collar,5178,0,>50K


#### View for Testing

In [18]:
try:
    cursor.execute("DROP VIEW ADULT_VIEW_TEST")
except:
    pass

cursor.execute("""
CREATE VIEW ADULT_VIEW_TEST (ID, AGE, EDUCATION, MARITAL_STATUS, GENDER, HOURS_PER_WEEK, WORKCLASS, EDUCATION_NUM, OCCUPATION ,CAPITAL_GAIN, CAPITAL_LOSS, SALARY) AS
SELECT ID, TO_VARCHAR(AGE) AS "AGE", EDUCATION, MARITAL_STATUS, GENDER, TO_VARCHAR(HOURS_PER_WEEK) AS "HOURS_PER_WEEK", WORKCLASS, TO_VARCHAR(EDUCATION_NUM) AS EDUCATION_NUM, OCCUPATION, CAPITAL_GAIN, CAPITAL_LOSS, SALARY FROM ANON_DEMO.ADULT_DATA_TEST
WITH ANONYMIZATION (
ALGORITHM 'K-ANONYMITY' PARAMETERS '{"data_change_strategy": "restricted", "k":10,"recoding":"multi_dimensional_strict"}'
COLUMN ID PARAMETERS '{"is_sequence":true}'
COLUMN AGE PARAMETERS '{"is_quasi_identifier":true, "min":1, "max": 3 ,"hierarchy":{"schema":"JUPYTER","function":"HIERARCHYFUNCTION_NUM_2","levels":5}}'
COLUMN EDUCATION PARAMETERS '{"is_quasi_identifier":true,"hierarchy":{"embedded":[["Preschool","Primary School","Primary education"],["1st-4th","Primary School","Primary education"],["5th-6th","Primary School","Primary education"],["7th-8th","High School","Secondary education"],["9th","High School","Secondary education"],["10th","High School","Secondary education"],["11th","High School","Secondary education"],["12th","High School","Secondary education"],["HS-grad","High School","Secondary education"],["Assoc-voc","Professional Education","Higher education"],["Assoc-acdm","Professional Education","Higher education"],["Prof-school","Professional Education","Higher education"],["Some-college","Undergraduate","Higher education"],["Bachelors","Undergraduate","Higher education"],["Masters","Graduate","Higher education"],["Doctorate","Graduate","Higher education"]]}}'
COLUMN MARITAL_STATUS PARAMETERS '{"is_quasi_identifier":true,"hierarchy":{"embedded":[["Married-civ-spouse","spouse present"],["Divorced","spouse not present"],["Never-married","spouse not present"],["Separated","spouse not present"],["Widowed","spouse not present"],["Married-spouse-absent","spouse not present"],["Married-AF-spouse","spouse present"]]}}'
COLUMN HOURS_PER_WEEK PARAMETERS '{"is_quasi_identifier":true, "min":1, "max": 4 ,"hierarchy":{"schema":"JUPYTER","function":"HIERARCHYFUNCTION_NUM_2","levels":5}}'
COLUMN GENDER PARAMETERS '{"is_quasi_identifier":true,"hierarchy":{"embedded":[["Male"],["Female"]]}}'
COLUMN WORKCLASS PARAMETERS '{"is_quasi_identifier":true, "min": 1, "hierarchy":{"embedded": [["Private", "Private"],["Federal-gov","Government"],["Local-gov","Government"],["State-gov","Government"],["Self-emp-inc","Self-Employed"],["Self-emp-not-inc","Self-Employed"],["Never-worked","Other"],["Without-pay","Other"],["?","Other/Unknown"],["Unknown","Other/Unknown"]]}}'
COLUMN EDUCATION_NUM PARAMETERS '{"is_quasi_identifier":true, "min": 1, "max":4 ,"hierarchy":{"schema":"JUPYTER","function":"HIERARCHYFUNCTION_SMALL_NUM","levels":5}}'
COLUMN OCCUPATION PARAMETERS '{"is_quasi_identifier":true, "min": 1, "hierarchy":{"embedded": [["Adm-clerical", "White-Collar"],["Craft-repair", "Blue-Collar"],["Exec-managerial", "White-Collar"],["Farming-fishing", "Blue-Collar"],["Handlers-cleaners", "Blue-Collar"],["Machine-op-inspct", "Blue-Collar"],["Other-service", "Service"],["Priv-house-serv", "Service"],["Prof-specialty", "Professional"],["Protective-serv", "Service"],["Tech-support", "Service"],["Transport-moving", "Blue-Collar"],["Armed-Forces", "Other/Unknown"],["?", "Other/Unknown"],["Sales", "Other/Unknown"]]}}'
)""")

cursor.execute("REFRESH VIEW ADULT_VIEW_TEST ANONYMIZATION")
cursor.execute("COMMIT")
dataframe.DataFrame(conn,'SELECT * FROM ADULT_VIEW_TEST LIMIT 10').collect()

Unnamed: 0,ID,AGE,EDUCATION,MARITAL_STATUS,GENDER,HOURS_PER_WEEK,WORKCLASS,EDUCATION_NUM,OCCUPATION,CAPITAL_GAIN,CAPITAL_LOSS,SALARY
0,1,[25-29],High School,spouse not present,Male,*,*,[6-7],*,0,0,<=50K
1,2,[35-39],High School,Married-civ-spouse,Male,*,Private,[8-9],Blue-Collar,0,0,<=50K
2,3,[25-29],Higher education,Married-civ-spouse,Male,*,*,*,Service,0,0,>50K
3,4,[40-44],Some-college,Married-civ-spouse,Male,[40-49],Private,[10-11],Blue-Collar,7688,0,>50K
4,5,[15-19],*,*,Female,*,*,[8-15],*,0,0,<=50K
5,6,[30-34],High School,spouse not present,Male,*,*,[6-7],*,0,0,<=50K
6,7,[25-29],HS-grad,spouse not present,Male,*,*,[8-9],*,0,0,<=50K
7,8,[60-69],Higher education,Married-civ-spouse,Male,*,Self-Employed,[12-15],Professional,3103,0,>50K
8,9,[20-24],Higher education,spouse not present,Female,*,*,*,Service,0,0,<=50K
9,10,[55-59],7th-8th,Married-civ-spouse,Male,*,*,[4-5],*,0,0,<=50K


### Alternative Implementation for productive scenarios

Using the multi-dimensional recoding scheme allowed me to get nicer looking data, but since every group in the data set may have its individual level of anonymization, chances are that these will differ between training and test data. That would hence result in test data characteristics that my model has not seen during training. 

To get around this issue there are two approaches you might think of:
1. Use the multi-dimensional scheme for the training data set and apply the found anonymization rules to the test set by using the same anonymization view.
2. Anonymize training data with the global_strict  scheme, build a “dictionary” from this and apply it to the test data. 


#### 1. Multi-dimensional recoding scheme with test data processed by training view

In [19]:
try:
    cursor.execute("""DROP TABLE ANON_DEMO.ADULT_DATA_TRAIN_MDS;""")
except:
    pass

cursor.execute("""CREATE TABLE
"ANON_DEMO"."ADULT_DATA_TRAIN_MDS"
AS (
SELECT 
"ID","AGE","EDUCATION","MARITAL_STATUS","GENDER","HOURS_PER_WEEK","WORKCLASS","EDUCATION_NUM","OCCUPATION","CAPITAL_GAIN","CAPITAL_LOSS","SALARY" FROM "ANON_DEMO"."ADULT_DATA_TRAIN"
);
""")

True

In [20]:
try:
    cursor.execute("DROP VIEW ADULT_VIEW_TRAIN_MDS")
except:
    pass

cursor.execute("""
CREATE VIEW ADULT_VIEW_TRAIN_MDS (ID, AGE, EDUCATION, MARITAL_STATUS, GENDER, HOURS_PER_WEEK, WORKCLASS, EDUCATION_NUM, OCCUPATION ,CAPITAL_GAIN, CAPITAL_LOSS, SALARY) AS
SELECT ID, TO_VARCHAR(AGE) AS "AGE", EDUCATION, MARITAL_STATUS, GENDER, TO_VARCHAR(HOURS_PER_WEEK) AS "HOURS_PER_WEEK", WORKCLASS, TO_VARCHAR(EDUCATION_NUM) AS EDUCATION_NUM, OCCUPATION, CAPITAL_GAIN, CAPITAL_LOSS, SALARY FROM ANON_DEMO.ADULT_DATA_TRAIN_MDS
WITH ANONYMIZATION (
ALGORITHM 'K-ANONYMITY' PARAMETERS '{"data_change_strategy": "restricted", "k":10,"recoding":"multi_dimensional_strict"}'
COLUMN ID PARAMETERS '{"is_sequence":true}'
COLUMN AGE PARAMETERS '{"is_quasi_identifier":true, "min":1, "max": 4 ,"hierarchy":{"schema":"JUPYTER","function":"HIERARCHYFUNCTION_NUM_2","levels":5}}'
COLUMN EDUCATION PARAMETERS '{"is_quasi_identifier":true,"hierarchy":{"embedded":[["Preschool","Primary School","Primary education"],["1st-4th","Primary School","Primary education"],["5th-6th","Primary School","Primary education"],["7th-8th","High School","Secondary education"],["9th","High School","Secondary education"],["10th","High School","Secondary education"],["11th","High School","Secondary education"],["12th","High School","Secondary education"],["HS-grad","High School","Secondary education"],["Assoc-voc","Professional Education","Higher education"],["Assoc-acdm","Professional Education","Higher education"],["Prof-school","Professional Education","Higher education"],["Some-college","Undergraduate","Higher education"],["Bachelors","Undergraduate","Higher education"],["Masters","Graduate","Higher education"],["Doctorate","Graduate","Higher education"]]}}'
COLUMN MARITAL_STATUS PARAMETERS '{"is_quasi_identifier":true,"hierarchy":{"embedded":[["Married-civ-spouse","spouse present"],["Divorced","spouse not present"],["Never-married","spouse not present"],["Separated","spouse not present"],["Widowed","spouse not present"],["Married-spouse-absent","spouse not present"],["Married-AF-spouse","spouse present"]]}}'
COLUMN HOURS_PER_WEEK PARAMETERS '{"is_quasi_identifier":true, "min":1, "max": 5 ,"hierarchy":{"schema":"JUPYTER","function":"HIERARCHYFUNCTION_NUM_2","levels":5}}'
COLUMN GENDER PARAMETERS '{"is_quasi_identifier":true,"hierarchy":{"embedded":[["Male"],["Female"]]}}'
COLUMN WORKCLASS PARAMETERS '{"is_quasi_identifier":true, "min": 1, "hierarchy":{"embedded": [["Private", "Private"],["Federal-gov","Government"],["Local-gov","Government"],["State-gov","Government"],["Self-emp-inc","Self-Employed"],["Self-emp-not-inc","Self-Employed"],["Never-worked","Other"],["Without-pay","Other"],["?","Other/Unknown"],["Unknown","Other/Unknown"]]}}'
COLUMN EDUCATION_NUM PARAMETERS '{"is_quasi_identifier":true, "min": 1, "max":4 ,"hierarchy":{"schema":"JUPYTER","function":"HIERARCHYFUNCTION_SMALL_NUM","levels":5}}'
COLUMN OCCUPATION PARAMETERS '{"is_quasi_identifier":true, "min": 1, "hierarchy":{"embedded": [["Adm-clerical", "White-Collar"],["Craft-repair", "Blue-Collar"],["Exec-managerial", "White-Collar"],["Farming-fishing", "Blue-Collar"],["Handlers-cleaners", "Blue-Collar"],["Machine-op-inspct", "Blue-Collar"],["Other-service", "Service"],["Priv-house-serv", "Service"],["Prof-specialty", "Professional"],["Protective-serv", "Service"],["Tech-support", "Service"],["Transport-moving", "Blue-Collar"],["Armed-Forces", "Other/Unknown"],["?", "Other/Unknown"],["Sales", "Other/Unknown"]]}}'
)""")

cursor.execute("REFRESH VIEW ADULT_VIEW_TRAIN_MDS ANONYMIZATION")
cursor.execute("COMMIT")
dataframe.DataFrame(conn,'SELECT * FROM ADULT_VIEW_TRAIN_MDS LIMIT 10').collect()

Unnamed: 0,ID,AGE,EDUCATION,MARITAL_STATUS,GENDER,HOURS_PER_WEEK,WORKCLASS,EDUCATION_NUM,OCCUPATION,CAPITAL_GAIN,CAPITAL_LOSS,SALARY
0,1,[35-39],Higher education,spouse not present,Male,[40-59],Government,[12-15],White-Collar,2174,0,<=50K
1,2,[40-59],Higher education,Married-civ-spouse,Male,[0-19],*,[12-15],*,0,0,<=50K
2,3,[35-39],HS-grad,spouse not present,Male,[40-44],*,[8-9],Blue-Collar,0,0,<=50K
3,4,[50-54],11th,Married-civ-spouse,Male,[40-59],*,[6-7],*,0,0,<=50K
4,5,[20-29],Bachelors,spouse present,Female,[40-59],Private,[12-13],*,0,0,<=50K
5,6,[20-39],Masters,Married-civ-spouse,Female,[40-59],Private,[14-15],*,0,0,<=50K
6,7,[40-59],9th,spouse not present,Female,*,*,[4-5],Service,0,0,<=50K
7,8,[50-54],High School,Married-civ-spouse,Male,[40-59],Self-Employed,[8-9],White-Collar,0,0,>50K
8,9,[30-39],Masters,spouse not present,Female,[40-59],Private,[14-15],Professional,14084,0,>50K
9,10,[40-44],Bachelors,Married-civ-spouse,Male,[40-44],Private,[12-13],White-Collar,5178,0,>50K


In [21]:
cursor.execute("""INSERT INTO ANON_DEMO.ADULT_DATA_TRAIN_MDS (
SELECT
(ID + 32561) AS "ID",
"AGE","EDUCATION","MARITAL_STATUS","GENDER","HOURS_PER_WEEK","WORKCLASS","EDUCATION_NUM","OCCUPATION","CAPITAL_GAIN","CAPITAL_LOSS","SALARY" 
FROM ANON_DEMO.ADULT_DATA_TEST)
;""")
cursor.execute("""REFRESH VIEW ADULT_VIEW_TRAIN_MDS ANONYMIZATION;""")
dataframe.DataFrame(conn,'SELECT * FROM ADULT_VIEW_TRAIN_MDS LIMIT 10').collect()

Unnamed: 0,ID,AGE,EDUCATION,MARITAL_STATUS,GENDER,HOURS_PER_WEEK,WORKCLASS,EDUCATION_NUM,OCCUPATION,CAPITAL_GAIN,CAPITAL_LOSS,SALARY
0,1,[35-39],Higher education,spouse not present,Male,[40-59],Government,[12-15],White-Collar,2174,0,<=50K
1,2,[40-59],Higher education,Married-civ-spouse,Male,[0-19],*,[12-15],*,0,0,<=50K
2,3,[35-39],HS-grad,spouse not present,Male,[40-44],*,[8-9],Blue-Collar,0,0,<=50K
3,4,[50-54],11th,Married-civ-spouse,Male,[40-59],*,[6-7],*,0,0,<=50K
4,5,[20-29],Bachelors,spouse present,Female,[40-59],Private,[12-13],*,0,0,<=50K
5,6,[20-39],Masters,Married-civ-spouse,Female,[40-59],Private,[14-15],*,0,0,<=50K
6,7,[40-59],9th,spouse not present,Female,*,*,[4-5],Service,0,0,<=50K
7,8,[50-54],High School,Married-civ-spouse,Male,[40-59],Self-Employed,[8-9],White-Collar,0,0,>50K
8,9,[30-39],Masters,spouse not present,Female,[40-59],Private,[14-15],Professional,14084,0,>50K
9,10,[40-44],Bachelors,Married-civ-spouse,Male,[40-44],Private,[12-13],White-Collar,5178,0,>50K


In [22]:
cursor.execute("""SELECT MAX(ID) FROM ANON_DEMO.ADULT_DATA_TRAIN_MDS;""")
cursor.fetchall()

[(48842,)]

#### 2. Global recoding scheme with dictionary
##### 2.1 Build Training view

In [23]:
try:
    cursor.execute("DROP VIEW ADULT_VIEW_TRAIN_MANUAL_GLOBAL")
except:
    pass

cursor.execute("""
CREATE VIEW ADULT_VIEW_TRAIN_MANUAL_GLOBAL (ID, AGE, EDUCATION, MARITAL_STATUS, GENDER, HOURS_PER_WEEK, WORKCLASS, EDUCATION_NUM, OCCUPATION ,CAPITAL_GAIN, CAPITAL_LOSS, SALARY) AS
SELECT ID, TO_VARCHAR(AGE) AS "AGE", EDUCATION, MARITAL_STATUS, GENDER, TO_VARCHAR(HOURS_PER_WEEK) AS "HOURS_PER_WEEK", WORKCLASS, TO_VARCHAR(EDUCATION_NUM) AS EDUCATION_NUM, OCCUPATION, CAPITAL_GAIN, CAPITAL_LOSS, SALARY FROM ANON_DEMO.ADULT_DATA_TRAIN
WITH ANONYMIZATION (
ALGORITHM 'K-ANONYMITY' PARAMETERS '{"data_change_strategy": "restricted", "k":10, "loss":0.05}'
COLUMN ID PARAMETERS '{"is_sequence":true}'
COLUMN AGE PARAMETERS '{"is_quasi_identifier":true, "max": 4, "hierarchy":{"schema":"JUPYTER","function":"HIERARCHYFUNCTION_SMALL_NUM","levels":8}}'
COLUMN EDUCATION PARAMETERS '{"is_quasi_identifier":true, "max": 1, "hierarchy":{"embedded":[["Preschool","Primary School","Primary education"],["1st-4th","Primary School","Primary education"],["5th-6th","Primary School","Primary education"],["7th-8th","High School","Secondary education"],["9th","High School","Secondary education"],["10th","High School","Secondary education"],["11th","High School","Secondary education"],["12th","High School","Secondary education"],["HS-grad","High School","Secondary education"],["Assoc-voc","Professional Education","Higher education"],["Assoc-acdm","Professional Education","Higher education"],["Prof-school","Professional Education","Higher education"],["Some-college","Undergraduate","Higher education"],["Bachelors","Undergraduate","Higher education"],["Masters","Graduate","Higher education"],["Doctorate","Graduate","Higher education"]]}}'
COLUMN MARITAL_STATUS PARAMETERS '{"is_quasi_identifier":true, "max": 1,  "hierarchy":{"embedded":[["Married-civ-spouse","spouse present"],["Divorced","spouse not present"],["Never-married","spouse not present"],["Separated","spouse not present"],["Widowed","spouse not present"],["Married-spouse-absent","spouse not present"],["Married-AF-spouse","spouse present"]]}}'
COLUMN HOURS_PER_WEEK PARAMETERS '{"is_quasi_identifier":true, "max": 6, "hierarchy":{"schema":"JUPYTER","function":"HIERARCHYFUNCTION_NUM_2","levels":10}}'
COLUMN GENDER PARAMETERS '{"is_quasi_identifier":true, "hierarchy":{"embedded":[["Male"],["Female"]]}}'
COLUMN WORKCLASS PARAMETERS '{"is_quasi_identifier":true, "max": 1, "hierarchy":{"embedded": [["Private", "Private"],["Federal-gov","Government"],["Local-gov","Government"],["State-gov","Government"],["Self-emp-inc","Self-Employed"],["Self-emp-not-inc","Self-Employed"],["Never-worked","Other"],["Without-pay","Other"],["?","Other/Unknown"],["Unknown","Other/Unknown"]]}}'
COLUMN EDUCATION_NUM PARAMETERS '{"is_quasi_identifier":true, "hierarchy":{"schema":"JUPYTER","function":"HIERARCHYFUNCTION_SMALL_NUM","levels":6}}'
COLUMN OCCUPATION PARAMETERS '{"is_quasi_identifier":true, "max": 1, "hierarchy":{"embedded": [["Adm-clerical", "White-Collar"],["Craft-repair", "Blue-Collar"],["Exec-managerial", "White-Collar"],["Farming-fishing", "Blue-Collar"],["Handlers-cleaners", "Blue-Collar"],["Machine-op-inspct", "Blue-Collar"],["Other-service", "Service"],["Priv-house-serv", "Service"],["Prof-specialty", "Professional"],["Protective-serv", "Service"],["Tech-support", "Service"],["Transport-moving", "Blue-Collar"],["Armed-Forces", "Other/Unknown"],["?", "Other/Unknown"],["Sales", "Other/Unknown"]]}}'
)""")

cursor.execute("REFRESH VIEW ADULT_VIEW_TRAIN_MANUAL_GLOBAL ANONYMIZATION")
cursor.execute("COMMIT")
dataframe.DataFrame(conn,'SELECT * FROM ADULT_VIEW_TRAIN_MANUAL_GLOBAL LIMIT 10').collect()

Unnamed: 0,ID,AGE,EDUCATION,MARITAL_STATUS,GENDER,HOURS_PER_WEEK,WORKCLASS,EDUCATION_NUM,OCCUPATION,CAPITAL_GAIN,CAPITAL_LOSS,SALARY
0,1,[32-47],Undergraduate,spouse not present,*,[0-79],Government,[8-15],White-Collar,2174,0,<=50K
1,2,[48-63],Undergraduate,spouse present,*,[0-79],Self-Employed,[8-15],White-Collar,0,0,<=50K
2,3,[32-47],High School,spouse not present,*,[0-79],Private,[8-15],Blue-Collar,0,0,<=50K
3,4,[48-63],High School,spouse present,*,[0-79],Private,[0-7],Blue-Collar,0,0,<=50K
4,5,[16-31],Undergraduate,spouse present,*,[0-79],Private,[8-15],Professional,0,0,<=50K
5,6,[32-47],Graduate,spouse present,*,[0-79],Private,[8-15],White-Collar,0,0,<=50K
6,7,[48-63],High School,spouse not present,*,[0-79],Private,[0-7],Service,0,0,<=50K
7,8,[48-63],High School,spouse present,*,[0-79],Self-Employed,[8-15],White-Collar,0,0,>50K
8,9,[16-31],Graduate,spouse not present,*,[0-79],Private,[8-15],Professional,14084,0,>50K
9,10,[32-47],Undergraduate,spouse present,*,[0-79],Private,[8-15],White-Collar,5178,0,>50K


##### 2.2 Create dictionary tables (very quick & dirty approach, obivously)

In [25]:
cursor.execute("""DROP TABLE DISTINCT_AGE;""") 
cursor.execute("""DROP TABLE DISTINCT_EDUCATION;""") 
cursor.execute("""DROP TABLE DISTINCT_MARITAL_STATUS;""") 
cursor.execute("""DROP TABLE DISTINCT_GENDER;""")
cursor.execute("""DROP TABLE DISTINCT_HOURS_PER_WEEK;""") 
cursor.execute("""DROP TABLE DISTINCT_WORKCLASS;""") 
cursor.execute("""DROP TABLE DISTINCT_EDUCATION_NUM;""") 
cursor.execute("""DROP TABLE DISTINCT_OCCUPATION;""") 
cursor.execute("""DROP TABLE DISTINCT_SALARY;""") 

cursor.execute("""CREATE TABLE DISTINCT_AGE AS (SELECT DISTINCT T.AGE AS ORIGINAL, V.AGE AS PRIVATE FROM ANON_DEMO.ADULT_DATA_TRAIN AS T, JUPYTER.ADULT_VIEW_TRAIN_MANUAL_GLOBAL AS V WHERE T.ID = V.ID);""")
cursor.execute("""CREATE TABLE DISTINCT_EDUCATION AS (SELECT DISTINCT T.EDUCATION AS ORIGINAL, V.EDUCATION AS PRIVATE FROM ANON_DEMO.ADULT_DATA_TRAIN AS T, JUPYTER.ADULT_VIEW_TRAIN_MANUAL_GLOBAL AS V WHERE T.ID = V.ID);""")
cursor.execute("""CREATE TABLE DISTINCT_MARITAL_STATUS AS (SELECT DISTINCT T.MARITAL_STATUS AS ORIGINAL, V.MARITAL_STATUS AS PRIVATE FROM ANON_DEMO.ADULT_DATA_TRAIN AS T, JUPYTER.ADULT_VIEW_TRAIN_MANUAL_GLOBAL AS V WHERE T.ID = V.ID);""")
cursor.execute("""CREATE TABLE DISTINCT_GENDER AS (SELECT DISTINCT T.GENDER AS ORIGINAL, V.GENDER AS PRIVATE FROM ANON_DEMO.ADULT_DATA_TRAIN AS T, JUPYTER.ADULT_VIEW_TRAIN_MANUAL_GLOBAL AS V WHERE T.ID = V.ID);""")
cursor.execute("""CREATE TABLE DISTINCT_HOURS_PER_WEEK AS (SELECT DISTINCT T.HOURS_PER_WEEK AS ORIGINAL, V.HOURS_PER_WEEK AS PRIVATE FROM ANON_DEMO.ADULT_DATA_TRAIN AS T, JUPYTER.ADULT_VIEW_TRAIN_MANUAL_GLOBAL AS V WHERE T.ID = V.ID);""")
cursor.execute("""CREATE TABLE DISTINCT_WORKCLASS AS (SELECT DISTINCT T.WORKCLASS AS ORIGINAL, V.WORKCLASS AS PRIVATE FROM ANON_DEMO.ADULT_DATA_TRAIN AS T, JUPYTER.ADULT_VIEW_TRAIN_MANUAL_GLOBAL AS V WHERE T.ID = V.ID);""")
cursor.execute("""CREATE TABLE DISTINCT_EDUCATION_NUM AS (SELECT DISTINCT T.EDUCATION_NUM AS ORIGINAL, V.EDUCATION_NUM AS PRIVATE FROM ANON_DEMO.ADULT_DATA_TRAIN AS T, JUPYTER.ADULT_VIEW_TRAIN_MANUAL_GLOBAL AS V WHERE T.ID = V.ID);""")
cursor.execute("""CREATE TABLE DISTINCT_OCCUPATION AS (SELECT DISTINCT T.OCCUPATION AS ORIGINAL, V.OCCUPATION AS PRIVATE FROM ANON_DEMO.ADULT_DATA_TRAIN AS T, JUPYTER.ADULT_VIEW_TRAIN_MANUAL_GLOBAL AS V WHERE T.ID = V.ID);""")
cursor.execute("""CREATE TABLE DISTINCT_SALARY  AS (SELECT DISTINCT T.SALARY  AS ORIGINAL, V.SALARY  AS PRIVATE FROM ANON_DEMO.ADULT_DATA_TRAIN AS T, JUPYTER.ADULT_VIEW_TRAIN_MANUAL_GLOBAL AS V WHERE T.ID = V.ID);""")

True

##### 2.3 Build test data from dictionary

In [26]:
try:
    cursor.execute("DROP VIEW ADULT_DATA_TEST_VIEW_MANUAL_GLOBAL")
except:
    pass

cursor.execute("""
CREATE VIEW ADULT_DATA_TEST_VIEW_MANUAL_GLOBAL AS (
SELECT
    ORIGINAL.ID,
    AGE.PRIVATE AS AGE,
    EDUCATION.PRIVATE AS EDUCATION,
    MARITAL_STATUS.PRIVATE AS MARITAL_STATUS,
    GENDER.PRIVATE AS GENDER,
    HOURS_PER_WEEK.PRIVATE AS HOURS_PER_WEEK,
    WORKCLASS.PRIVATE AS WORKCLASS,
    EDUCATION_NUM.PRIVATE AS EDUCATION_NUM,
    OCCUPATION.PRIVATE AS OCCUPATION,
    SALARY.PRIVATE AS SALARY,
    ORIGINAL.CAPITAL_GAIN AS CAPITAL_GAIN,
    ORIGINAL.CAPITAL_LOSS  AS CAPITAL_LOSS
FROM 
    DISTINCT_AGE AS AGE,
    DISTINCT_EDUCATION AS EDUCATION,
    DISTINCT_MARITAL_STATUS AS MARITAL_STATUS,
    DISTINCT_GENDER AS GENDER,
    DISTINCT_HOURS_PER_WEEK AS HOURS_PER_WEEK,
    DISTINCT_WORKCLASS AS WORKCLASS,
    DISTINCT_EDUCATION_NUM AS EDUCATION_NUM,
    DISTINCT_OCCUPATION AS OCCUPATION,
    DISTINCT_SALARY AS SALARY,
    ANON_DEMO.ADULT_DATA_TEST AS ORIGINAL
WHERE 
    AGE.ORIGINAL = ORIGINAL.AGE 
    AND EDUCATION.ORIGINAL = ORIGINAL.EDUCATION 
    AND MARITAL_STATUS.ORIGINAL = ORIGINAL.MARITAL_STATUS 
    AND GENDER.ORIGINAL = ORIGINAL.GENDER 
    AND HOURS_PER_WEEK.ORIGINAL = ORIGINAL.HOURS_PER_WEEK 
    AND WORKCLASS.ORIGINAL = ORIGINAL.WORKCLASS 
    AND EDUCATION_NUM.ORIGINAL = ORIGINAL.EDUCATION_NUM 
    AND OCCUPATION.ORIGINAL = ORIGINAL.OCCUPATION 
    AND SALARY.ORIGINAL = ORIGINAL.SALARY  ); 
""")

dataframe.DataFrame(conn,'SELECT * FROM ADULT_DATA_TEST_VIEW_MANUAL_GLOBAL LIMIT 10').collect()

Unnamed: 0,ID,AGE,EDUCATION,MARITAL_STATUS,GENDER,HOURS_PER_WEEK,WORKCLASS,EDUCATION_NUM,OCCUPATION,SALARY,CAPITAL_GAIN,CAPITAL_LOSS
0,1,[16-31],High School,spouse not present,*,[0-79],Private,[0-7],Blue-Collar,<=50K,0,0
1,2,[32-47],High School,spouse present,*,[0-79],Private,[8-15],Blue-Collar,<=50K,0,0
2,3,[16-31],Professional Education,spouse present,*,[0-79],Government,[8-15],Service,>50K,0,0
3,4,[32-47],Undergraduate,spouse present,*,[0-79],Private,[8-15],Blue-Collar,>50K,7688,0
4,5,[16-31],Undergraduate,spouse not present,*,[0-79],Other/Unknown,[8-15],Other/Unknown,<=50K,0,0
5,6,[32-47],High School,spouse not present,*,[0-79],Private,[0-7],Service,<=50K,0,0
6,7,[16-31],High School,spouse not present,*,[0-79],Other/Unknown,[8-15],Other/Unknown,<=50K,0,0
7,8,[48-63],Professional Education,spouse present,*,[0-79],Self-Employed,[8-15],Professional,>50K,3103,0
8,9,[16-31],Undergraduate,spouse not present,*,[0-79],Private,[8-15],Service,<=50K,0,0
9,10,[48-63],High School,spouse present,*,[0-79],Private,[0-7],Blue-Collar,<=50K,0,0


# 4. Load data from DB

## Create HANA dataframe from anonymized view

In [27]:
df_pal_train_anon = dataframe.DataFrame(conn,'SELECT * FROM ADULT_VIEW_TRAIN')
df_pal_test_anon = dataframe.DataFrame(conn,'SELECT * FROM ADULT_VIEW_TEST')

In [28]:
df_pal_train_anon_mds = dataframe.DataFrame(conn,'SELECT "ID","AGE","EDUCATION","MARITAL_STATUS","GENDER","HOURS_PER_WEEK","WORKCLASS","EDUCATION_NUM","OCCUPATION","CAPITAL_GAIN","CAPITAL_LOSS","SALARY" FROM ADULT_VIEW_TRAIN_MDS WHERE ID <= 32561')
df_pal_test_anon_mds = dataframe.DataFrame(conn,'SELECT (ID - 32561) AS "ID","AGE","EDUCATION","MARITAL_STATUS","GENDER","HOURS_PER_WEEK","WORKCLASS","EDUCATION_NUM","OCCUPATION","CAPITAL_GAIN","CAPITAL_LOSS","SALARY" FROM ADULT_VIEW_TRAIN_MDS WHERE ID > 32561')

In [29]:
df_pal_train_anon_gm = dataframe.DataFrame(conn,'SELECT * FROM ADULT_VIEW_TRAIN_MANUAL_GLOBAL')
df_pal_test_anon_gm = dataframe.DataFrame(conn,'SELECT * FROM ADULT_DATA_TEST_VIEW_MANUAL_GLOBAL')

In [30]:
df_pal_test_anon_mds.collect().head()

Unnamed: 0,ID,AGE,EDUCATION,MARITAL_STATUS,GENDER,HOURS_PER_WEEK,WORKCLASS,EDUCATION_NUM,OCCUPATION,CAPITAL_GAIN,CAPITAL_LOSS,SALARY
0,1,[25-29],11th,spouse not present,Male,*,*,[6-7],*,0,0,<=50K
1,2,[35-39],HS-grad,spouse present,Male,[50-59],Private,[8-9],Blue-Collar,0,0,<=50K
2,3,[20-29],Higher education,Married-civ-spouse,Male,[40-59],*,[12-13],Service,0,0,>50K
3,4,[40-44],Some-college,Married-civ-spouse,Male,[40-44],Private,[10-11],Blue-Collar,7688,0,>50K
4,5,[15-19],Higher education,*,Female,*,*,[8-15],*,0,0,<=50K


## Create HANA dataframe from original data

Column selection is based on the columns used in the anonymization views for equal comparison.

In [31]:
df_pal_train = conn.table('ADULT_DATA_TRAIN', schema = 'ANON_DEMO').select('ID', 'AGE', 'EDUCATION', 'MARITAL_STATUS', 'GENDER', 'HOURS_PER_WEEK', 'WORKCLASS', 'EDUCATION_NUM', 'OCCUPATION' ,'CAPITAL_GAIN', 'CAPITAL_LOSS', 'SALARY')
df_pal_test = conn.table('ADULT_DATA_TEST', schema = 'ANON_DEMO').select('ID', 'AGE', 'EDUCATION', 'MARITAL_STATUS', 'GENDER', 'HOURS_PER_WEEK', 'WORKCLASS', 'EDUCATION_NUM', 'OCCUPATION' ,'CAPITAL_GAIN', 'CAPITAL_LOSS', 'SALARY')

In [32]:
df_pal_train.collect().head()

Unnamed: 0,ID,AGE,EDUCATION,MARITAL_STATUS,GENDER,HOURS_PER_WEEK,WORKCLASS,EDUCATION_NUM,OCCUPATION,CAPITAL_GAIN,CAPITAL_LOSS,SALARY
0,1,39,Bachelors,Never-married,Male,40,State-gov,13,Adm-clerical,2174,0,<=50K
1,2,50,Bachelors,Married-civ-spouse,Male,13,Self-emp-not-inc,13,Exec-managerial,0,0,<=50K
2,3,38,HS-grad,Divorced,Male,40,Private,9,Handlers-cleaners,0,0,<=50K
3,4,53,11th,Married-civ-spouse,Male,40,Private,7,Handlers-cleaners,0,0,<=50K
4,5,28,Bachelors,Married-civ-spouse,Female,40,Private,13,Prof-specialty,0,0,<=50K


# 5. ML Model training and evaluation
## Instanciate models

To allow for full comparions, I work with 4 HGBC Instances, all with same parameters, to create models for:
1. Original data
2. Anonymized data with global recoding scheme, with dictionary approach for test data
3. Anonymized data with multi-dimensional recoding scheme, separate views for training and testing
4. Anonymized data with multi-dimensional recoding scheme, same view for training and testing

In [33]:
hgbc = HybridGradientBoostingClassifier(random_state= 4, 
                                        fold_num=5,
                                        evaluation_metric = 'error_rate', 
                                        ref_metric=['auc'],
                                        resampling_method  = 'stratified_cv',
                                        param_search_strategy = 'grid', 
                                        param_range=[('learning_rate',[0.1, 0.2, 1.0]),
                                                    ('n_estimators', [4, 3, 10]),
                                                    ('split_threshold', [0.1, 0.2, 2.0]),
                                                    ('max_depth',[2, 2, 20])]
                                       )

In [34]:
hgbc_anon = HybridGradientBoostingClassifier(random_state= 4, fold_num=5, evaluation_metric = 'error_rate', ref_metric=['auc'],resampling_method  = 'stratified_cv',param_search_strategy = 'grid', param_range=[('learning_rate',[0.1, 0.2, 1.0]), ('n_estimators', [4, 3, 10]), ('split_threshold', [0.1, 0.2, 2.0]), ('max_depth',[2, 2, 20])]  )

In [35]:
hgbc_anon_mds = HybridGradientBoostingClassifier(random_state= 4, fold_num=5,evaluation_metric = 'error_rate', ref_metric=['auc'],resampling_method  = 'stratified_cv',param_search_strategy = 'grid', param_range=[('learning_rate',[0.1, 0.2, 1.0]), ('n_estimators', [4, 3, 10]), ('split_threshold', [0.1, 0.2, 2.0]), ('max_depth',[2, 2, 20])]  )

In [36]:
hgbc_anon_gm = HybridGradientBoostingClassifier(random_state= 4, fold_num=5,evaluation_metric = 'error_rate', ref_metric=['auc'],resampling_method  = 'stratified_cv',param_search_strategy = 'grid', param_range=[('learning_rate',[0.1, 0.2, 1.0]), ('n_estimators', [4, 3, 10]), ('split_threshold', [0.1, 0.2, 2.0]), ('max_depth',[2, 2, 20])]  )

## Train and predict on anonymized data

In [37]:
features = df_pal_train_anon.columns
features.remove('SALARY')
features.remove('ID')

In [38]:
hgbc_anon.fit(df_pal_train_anon,features= features, label = 'SALARY')
hgbc_anon_gm.fit(df_pal_train_anon_gm,features= features, label = 'SALARY')
hgbc_anon_mds.fit(df_pal_train_anon_mds,features= features, label = 'SALARY')

In [39]:
hgbc_anon.selected_param_.collect()

Unnamed: 0,PARAM_NAME,INT_VALUE,DOUBLE_VALUE,STRING_VALUE
0,GAMMA,,1.3,
1,ETA,,0.7,
2,MAX_DEPTH,14.0,,
3,ITER_NUM,10.0,,


In [40]:
hgbc_anon.feature_importances_.sort('IMPORTANCE', 'desc').collect().head()

Unnamed: 0,VARIABLE_NAME,IMPORTANCE
0,MARITAL_STATUS,0.390743
1,CAPITAL_GAIN,0.241871
2,CAPITAL_LOSS,0.094708
3,EDUCATION,0.079007
4,AGE,0.06614


In [41]:
hgbc_anon.confusion_matrix_.collect()

Unnamed: 0,ACTUAL_CLASS,PREDICTED_CLASS,COUNT
0,<=50K,<=50K,22948
1,<=50K,>50K,1302
2,>50K,<=50K,2681
3,>50K,>50K,4816


In [42]:
prediction_anon = hgbc_anon.predict(df_pal_test_anon,features= features, key='ID')
prediction_anon_gm = hgbc_anon_gm.predict(df_pal_test_anon_gm,features= features, key='ID')
prediction_anon_mds = hgbc_anon_mds.predict(df_pal_test_anon_mds,features= features, key='ID')


In [43]:
prediction_anon.collect().head()

Unnamed: 0,ID,SCORE,CONFIDENCE
0,35,<=50K,0.990675
1,40,<=50K,0.996192
2,43,<=50K,0.696018
3,57,<=50K,0.761908
4,60,<=50K,0.748539


In [44]:
df_pal_results_anon = prediction_anon.alias('L').join(df_pal_test_anon.alias('R'), 'L.ID = R.ID', select=[
    ('L.ID','ID'),
    ('L.SCORE','SCORE'),
    ('L.CONFIDENCE','CONFIDENCE'),
    ('R.SALARY','SALARY')
])

In [45]:
df_pal_results_anon_gm = prediction_anon_gm.alias('L').join(df_pal_test_anon_gm.alias('R'), 'L.ID = R.ID', select=[
    ('L.ID','ID'),
    ('L.SCORE','SCORE'),
    ('L.CONFIDENCE','CONFIDENCE'),
    ('R.SALARY','SALARY')
])

In [46]:
df_pal_results_anon_mds = prediction_anon_mds.alias('L').join(df_pal_test_anon_mds.alias('R'), 'L.ID = R.ID', select=[
    ('L.ID','ID'),
    ('L.SCORE','SCORE'),
    ('L.CONFIDENCE','CONFIDENCE'),
    ('R.SALARY','SALARY')
])

In [47]:
df_pal_results_anon_mds.collect()

Unnamed: 0,ID,SCORE,CONFIDENCE,SALARY
0,43,<=50K,0.816560,<=50K
1,462,<=50K,0.816560,<=50K
2,805,<=50K,0.816560,<=50K
3,1293,<=50K,0.816560,<=50K
4,1511,<=50K,0.816560,<=50K
...,...,...,...,...
16276,16111,<=50K,0.995398,<=50K
16277,16223,<=50K,0.999202,<=50K
16278,16234,<=50K,0.988107,<=50K
16279,16242,<=50K,0.951752,<=50K


In [48]:
print(df_pal_results_anon.filter('"SALARY" = "SCORE"').count() / df_pal_results_anon.count())
print(df_pal_results_anon_gm.filter('"SALARY" = "SCORE"').count() / df_pal_results_anon_gm.count())
print(df_pal_results_anon_mds.filter('"SALARY" = "SCORE"').count() / df_pal_results_anon_mds.count())

0.8583010871568085
0.8647313156269302
0.8654873779251888


## Train on original data

In [49]:
features = df_pal_train.columns
features.remove('SALARY')
features.remove('ID')

In [50]:
hgbc.fit(df_pal_train,features= features, label = 'SALARY')

In [51]:
hgbc.feature_importances_.sort('IMPORTANCE', 'desc').collect().head()

Unnamed: 0,VARIABLE_NAME,IMPORTANCE
0,MARITAL_STATUS,0.365188
1,CAPITAL_GAIN,0.201274
2,EDUCATION_NUM,0.167747
3,AGE,0.077628
4,CAPITAL_LOSS,0.062191


In [52]:
hgbc.confusion_matrix_.collect()

Unnamed: 0,ACTUAL_CLASS,PREDICTED_CLASS,COUNT
0,<=50K,<=50K,23477
1,<=50K,>50K,1243
2,>50K,<=50K,2547
3,>50K,>50K,5294


In [53]:
prediction = hgbc.predict(df_pal_test,features= features, key='ID')

In [54]:
prediction.collect().head()

Unnamed: 0,ID,SCORE,CONFIDENCE
0,35,<=50K,0.977781
1,40,<=50K,0.998806
2,43,<=50K,0.864999
3,57,<=50K,0.737743
4,60,<=50K,0.89789


In [55]:
df_pal_results = prediction.alias('L').join(df_pal_test.alias('R'), 'L.ID = R.ID', select=[
    ('L.ID','ID'),
    ('L.SCORE','SCORE'),
    ('L.CONFIDENCE','CONFIDENCE'),
    ('R.SALARY','SALARY')
])

In [56]:
df_pal_results.filter('"SALARY" = "SCORE"').count() / df_pal_results.count()

0.8702782384374425

## Performance Comparison
### Original data model

In [57]:
cm, cr = confusion_matrix(data=df_pal_results, key='ID', label_true='SALARY', label_pred='SCORE')
cm.collect()

Unnamed: 0,SALARY,SCORE,COUNT
0,>50K,>50K,2473
1,>50K,<=50K,1373
2,<=50K,>50K,739
3,<=50K,<=50K,11696


In [58]:
cr.collect()

Unnamed: 0,CLASS,RECALL,PRECISION,F_MEASURE,SUPPORT
0,<=50K,0.940571,0.894942,0.917189,12435
1,>50K,0.643006,0.769925,0.700765,3846


### Multi-dimensional recoding model

In [59]:
cm_anon, cr_anon = confusion_matrix(data=df_pal_results_anon, key='ID', label_true='SALARY', label_pred='SCORE')
cm_anon.collect()

Unnamed: 0,SALARY,SCORE,COUNT
0,>50K,>50K,2179
1,>50K,<=50K,1667
2,<=50K,>50K,640
3,<=50K,<=50K,11795


In [60]:
cr_anon.collect()

Unnamed: 0,CLASS,RECALL,PRECISION,F_MEASURE,SUPPORT
0,<=50K,0.948532,0.87617,0.910916,12435
1,>50K,0.566563,0.772969,0.653863,3846


### Global recoding model with dictionary

In [61]:
cm_anon_gm, cr_anon_gm = confusion_matrix(data=df_pal_results_anon_gm, key='ID', label_true='SALARY', label_pred='SCORE')
cm_anon_gm.collect()

Unnamed: 0,SALARY,SCORE,COUNT
0,>50K,>50K,2355
1,>50K,<=50K,1469
2,<=50K,>50K,721
3,<=50K,<=50K,11645


In [62]:
cr_anon_gm.collect()

Unnamed: 0,CLASS,RECALL,PRECISION,F_MEASURE,SUPPORT
0,<=50K,0.941695,0.887982,0.91405,12366
1,>50K,0.615847,0.765605,0.682609,3824


### Multi-dimensional recoding model with test data processed by training view

In [63]:
cm_anon_mds, cr_anon_mds = confusion_matrix(data=df_pal_results_anon_mds, key='ID', label_true='SALARY', label_pred='SCORE')
cm_anon_mds.collect()

Unnamed: 0,SALARY,SCORE,COUNT
0,>50K,>50K,2369
1,>50K,<=50K,1477
2,<=50K,>50K,713
3,<=50K,<=50K,11722


In [64]:
cr_anon_mds.collect()

Unnamed: 0,CLASS,RECALL,PRECISION,F_MEASURE,SUPPORT
0,<=50K,0.942662,0.888098,0.914567,12435
1,>50K,0.615965,0.768657,0.683891,3846
