# Overview

Welcome to the CN-Protect Demo Notebook. In this demo, we will run through a full example of using CN-Protect in order to anonymize data. After that, we will create logistic regression models to compare how good of a model we can create with the anonomized data versus the original unmodified data. Note the full documentation is available on the [CryptoNumerics doc site](https://docs.cryptonumerics.com/cn-protect-ds/?page=docs.cryptonumerics.com/cn-protect-ds-html/protect.html)



# Import required libraries

CNProtect has several modules for privacy. Here we will show an example using k-anonymity.

In [1]:
from cn.protect import Protect
from cn.protect.privacy import KAnonymity
from cn.protect.hierarchy import DataHierarchy, OrderHierarchy, IntervalHierarchy
from cn.protect.quality import Loss
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import pandas as pd

# To begin, read a dataset into a pandas Dataframe

After running the next cell, the sample dataset can be seen. It has data and several adults, some demographic information and data about their work.

In [2]:
# Read in and display the sample data.
orig = pd.read_csv('adult.csv', sep=";")
orig

Unnamed: 0,email,sex,age,race,marital-status,education,native-country,workclass,occupation,salary-class
0,martha81@ramirez-suarez.com,Male,39,White,Never-married,Bachelors,United-States,State-gov,Adm-clerical,<=50K
1,janice09@craig.biz,Male,50,White,Married-civ-spouse,Bachelors,United-States,Self-emp-not-inc,Exec-managerial,<=50K
2,samuelspencer@hotmail.com,Male,38,White,Divorced,HS-grad,United-States,Private,Handlers-cleaners,<=50K
3,enielsen@fields-miller.com,Male,53,Black,Married-civ-spouse,11th,United-States,Private,Handlers-cleaners,<=50K
4,paul30@yahoo.com,Female,28,Black,Married-civ-spouse,Bachelors,Cuba,Private,Prof-specialty,<=50K
5,chelseysmith@yahoo.com,Female,37,White,Married-civ-spouse,Masters,United-States,Private,Exec-managerial,<=50K
6,perkinsjames@hotmail.com,Female,49,Black,Married-spouse-absent,9th,Jamaica,Private,Other-service,<=50K
7,wrightethan@williams.info,Male,52,White,Married-civ-spouse,HS-grad,United-States,Self-emp-not-inc,Exec-managerial,>50K
8,cainamy@hotmail.com,Female,31,White,Never-married,Masters,United-States,Private,Prof-specialty,>50K
9,brittanymarshall@gmail.com,Male,42,White,Married-civ-spouse,Bachelors,United-States,Private,Exec-managerial,>50K


# Choose a methodology to protect your data. 

To begin, we must create a Protect object whose parameters guide the transformation. The first thing we must choose is the **privacy model.** A privacy model is essentially the criteria that needs to be met for the data to be considered anonymized. In this notebook, we will choose the privacy model k-anonymity. A data set is considered **k-anonymous** if each inidividual cannot be distinguished from k-1 individuals. In other words, for every data point in the dataset, it is in a group of at least k data points all of which are indistinguishiable to an attacker. 

We first create a Protect object which takes as parameters the dataset as a pandas dataframe and a PrivacyModel along with other optional parameters that can be specified later. Here we choose k=5 (although feel free to change this to experiment).

In [3]:
prot=Protect(orig, KAnonymity(5))

# Choosing optimization goal

Next we must choose a **quality model**. If the privacy model is the constraints of the privacy, then the quality model is what we are trying to optimize given these constraints. The default for this is Information Loss, or simply loss. In other words, we would like to obtain a dataset which is considered private, but at the same time minimizes the amount of information that we lose. Although it is already set by default, we show how to set the quality model below

In [4]:
prot.quality_model = Loss()

# Setting suppression limit


There are two ways in which privacy protection is applied:

* Generalization, where entire columns are replaced by more general versions of the data. For example, zipcodes may be generalized to redact characters, so 12345 and 12346 can both be replaced by 123**. Or similarly, the ages 19, 25, 26 are replaced by [10-19] and [20-29].

* Suppresion, where entire rows are removed from the dataset (replaced by a null character \*). This can be useful when there are unique data points that are difficult to generalize. For example if there is only one person in the [60-80] age range and we don't want them to be indeintifiable, rather than creating a much larger age range (say [40-80]), we can simply remove this person from the dataset and retain much more informative, yet private, information.

Suppression, like many parameters, is a trade off. We can set a suppression limit which gives the algorithm a maximum proportion of rows which it can suppress. By setting it to .1, no more than 10% (and likely many fewer) of the rows will be redacted.


In [5]:
prot.suppression = .1

# Assigning Identifying Types

Now we must tell protect about our data. There are four idenityfing types each attribute can have which can be learned about [here](https://docs.cryptonumerics.com/cn-protect-ds-html/overview.html#types-of-privacy-threats)

For our dataset, email is an identifying attribute and should be completely removed. Salary-class is likely not identifying and can be set as insensitive. In general, the field that you would like to predict (as we will do with salary-class later in this notebook), should remain insensitive. All of the other attributes are considered quasi-idenitfying attributes, as on their own, they could not be used to identify a person, however in combination, they potentially could. K-anonymity guarantees that for any tuple of quasi-identifiers, there are at least k-1 others in the data set with the same tuple.

We set the identifying types in the itypes attribute of the Protect object. They can be set in multiple ways as shown.

In [6]:
for col in orig:
    if col not in ('email', 'salary-class'):
        prot.itypes[col] = 'quasi'
prot.itypes.email = 'identifying'
prot.itypes['salary-class'] = 'insensitive'

# Display the identifying types
prot.itypes

email             IDENTIFYING
sex                     QUASI
age                     QUASI
race                    QUASI
marital-status          QUASI
education               QUASI
native-country          QUASI
workclass               QUASI
occupation              QUASI
salary-class      INSENSITIVE
dtype: object

# Datatypes
We must also ensure that the data types are set correctly. They are detected automatically, but can be changed if done incorrectly

In [7]:
prot.dtypes

email              STRING
sex                STRING
age               INTEGER
race               STRING
marital-status     STRING
education          STRING
native-country     STRING
workclass          STRING
occupation         STRING
salary-class       STRING
dtype: object

# Generalization

Recall the two ways that privacy protection can be applied are suppression and generalization. To do generalization well, we must tell protect how the data can be generalized. By default, it's only options are to either leave a column or completely remove it from the dataset.

Hierarchies are used so the algorithm knows what ways it can generalize the columns of the data. For example, a hierarchy is what tells protect that it is allowed to replace the age 11 with the interval [10-14], [10-19], or [0-19] depending on how much generalization needs to occur.

There are several ways to create hierarchies automatically or they can be specified manually in a DataFrame. First, we will automitically create one for age using OrderHierarchy. OrderHierarchies are some of the easiest to generate and are often good enough for many purposes. The first parameter, 'interval', says that that the format of the output should be [min-max]. The rest of the arguments say (multiplicatively) how large the intervals should be. 5, 2, 2, means we give options for intervals of size 5, 10, and 20. However, OrderHierarchies always begin at the minimum element of the dataset, meaning the smallest interval will always be something like [17-26]. For more control, one can use IntervalHierarchies which we will demonstrate later.

In [8]:
prot.hierarchies.age = OrderHierarchy('interval', 5, 2, 2)

# Data Hierarchies

For the rest of the hierarchies, we will specifiy them using dataframes from csv files. Shown below is the hierarchy for education. Each column specifies a degree of generalization, where the first are all the possible original data values, and each successive column shows a further degree of generalization. The last degree is the complete removal of the column. Note that we use square brackets to make it clear which degree of generalization was used; in practice, these are not necessary.

In [9]:
pd.read_csv('adult_hierarchy_education.csv', sep=';', header=None)

Unnamed: 0,0,1,2,3
0,Bachelors,[Undergraduate],[[Higher education]],*
1,Some-college,[Undergraduate],[[Higher education]],*
2,11th,[High School],[[Secondary education]],*
3,HS-grad,[High School],[[Secondary education]],*
4,Prof-school,[Professional Education],[[Higher education]],*
5,Assoc-acdm,[Professional Education],[[Higher education]],*
6,Assoc-voc,[Professional Education],[[Higher education]],*
7,9th,[High School],[[Secondary education]],*
8,7th-8th,[High School],[[Secondary education]],*
9,12th,[High School],[[Secondary education]],*


# Adding Data Hierarchies

We add data hierarchies to some of the columns and display them. For quasi identifiers without hierarchies, the default behavior of either leaving or removing the column is mainained.

In [10]:
for col in orig:
    if col in ('marital-status', 'education', 'native-country', 'workclass', 'occupation'):
        prot.hierarchies[col] = DataHierarchy(pd.read_csv(f'adult_hierarchy_{col}.csv', sep=';', header=None))
        print(f'{col}:')
        print(prot.hierarchies[col].df)

marital-status:
                       0                     1  2
0     Married-civ-spouse      [spouse present]  *
1               Divorced  [spouse not present]  *
2          Never-married  [spouse not present]  *
3              Separated  [spouse not present]  *
4                Widowed  [spouse not present]  *
5  Married-spouse-absent  [spouse not present]  *
6      Married-AF-spouse      [spouse present]  *
education:
               0                         1                        2  3
0      Bachelors           [Undergraduate]     [[Higher education]]  *
1   Some-college           [Undergraduate]     [[Higher education]]  *
2           11th             [High School]  [[Secondary education]]  *
3        HS-grad             [High School]  [[Secondary education]]  *
4    Prof-school  [Professional Education]     [[Higher education]]  *
5     Assoc-acdm  [Professional Education]     [[Higher education]]  *
6      Assoc-voc  [Professional Education]     [[Higher education]]  *
7    

In [11]:
prot.hierarchies

email                                                          None
sex                                                            None
age               <OrderHierarchy: {'hierarchyBuilderType': 'ORD...
race                                                           None
marital-status                <DataHierarchy: {'fromStream': True}>
education                     <DataHierarchy: {'fromStream': True}>
native-country                <DataHierarchy: {'fromStream': True}>
workclass                     <DataHierarchy: {'fromStream': True}>
occupation                    <DataHierarchy: {'fromStream': True}>
salary-class                                                   None
dtype: object

# Apply the privacy model

Now that all the parameters are set, we call protect() on the Protect object to apply the transformation. It returns an anonymized DataFrame meeting the Privacy Model criteria (as long as one exists). Note that the shape is the same as the original, as removed and generalized cells are simply replaced with a *.

In [12]:
priv = prot.protect()
priv

Unnamed: 0,email,sex,age,race,marital-status,education,native-country,workclass,occupation,salary-class
0,*,Male,"[37, 56]",White,Never-married,[[Higher education]],[North America],State-gov,[Other],<=50K
1,*,Male,"[37, 56]",White,Married-civ-spouse,[[Higher education]],[North America],Self-emp-not-inc,[Nontechnical],<=50K
2,*,Male,"[37, 56]",White,Divorced,[[Secondary education]],[North America],Private,[Nontechnical],<=50K
3,*,Male,"[37, 56]",Black,Married-civ-spouse,[[Secondary education]],[North America],Private,[Nontechnical],<=50K
4,*,*,*,*,*,*,*,*,*,<=50K
5,*,Female,"[37, 56]",White,Married-civ-spouse,[[Higher education]],[North America],Private,[Nontechnical],<=50K
6,*,Female,"[37, 56]",Black,Married-spouse-absent,[[Secondary education]],[North America],Private,[Other],<=50K
7,*,Male,"[37, 56]",White,Married-civ-spouse,[[Secondary education]],[North America],Self-emp-not-inc,[Nontechnical],>50K
8,*,Female,"[17, 36]",White,Never-married,[[Higher education]],[North America],Private,[Technical],>50K
9,*,Male,"[37, 56]",White,Married-civ-spouse,[[Higher education]],[North America],Private,[Nontechnical],>50K


# Transformation Statistics

After transformation, we can check statistics about the transformation applied. An important statistic is information loss, the metric that we tried to optimize for. Min class size tells us the smallest group of indistinguishable people. It will always be at least k in order to be private. 

In [13]:
prot.stats

ambiguity                         0.902608
aecs                              0.998533
discernibility                    0.893780
granularity                       0.752259
attributeLevelSquaredError        0.609446
precision                         0.629958
recordLevelSquaredError           0.403072
informationLoss                   0.224757
averageClassSize                 45.908676
averageRisk                       0.024096
lowestRisk                        0.001018
highestRisk                       0.200000
sampleUniqueness                  0.000000
populationUniquenessZayatz        0.000000
numClasses                      657.000000
minClassSize                      5.000000
maxClassSize                   2937.000000
numOutliers                    2937.000000
numTuples                     27225.000000
dtype: float64

# Iteration

If we would like to change things about how privacy was applied, we can change the Protect object and reapply Protection. Here, we will switch age from an OrderHierarchy to an IntervalHierarchy and modify k.

In [14]:
# create intervals (0, 5), (5, 10),...
intervals = [(min, min+5) for min in range(0, 100, 5)]
hierarchy = IntervalHierarchy(intervals)
# Add levels, first level is just (0, 5),...
hierarchy.add_fanout(*[1 for _ in range(20)])
# Second level is (0, 10), (10, 20),...
hierarchy.add_fanout(*[2 for _ in range(10)])
# Third level is (0, 20), (20, 40),...,
hierarchy.add_fanout(*[4 for _ in range(5)])
prot.hierarchies.age = hierarchy

In [15]:
# Set k to 6
prot.privacy_model.k = 6

In [16]:
priv = prot.protect()
priv

Unnamed: 0,email,sex,age,race,marital-status,education,native-country,workclass,occupation,salary-class
0,*,Male,"[0, 40]",White,Never-married,[[Higher education]],[North America],State-gov,[Other],<=50K
1,*,Male,"[40, 80]",White,Married-civ-spouse,[[Higher education]],[North America],Self-emp-not-inc,[Nontechnical],<=50K
2,*,Male,"[0, 40]",White,Divorced,[[Secondary education]],[North America],Private,[Nontechnical],<=50K
3,*,Male,"[40, 80]",Black,Married-civ-spouse,[[Secondary education]],[North America],Private,[Nontechnical],<=50K
4,*,*,*,*,*,*,*,*,*,<=50K
5,*,Female,"[0, 40]",White,Married-civ-spouse,[[Higher education]],[North America],Private,[Nontechnical],<=50K
6,*,*,*,*,*,*,*,*,*,<=50K
7,*,Male,"[40, 80]",White,Married-civ-spouse,[[Secondary education]],[North America],Self-emp-not-inc,[Nontechnical],>50K
8,*,Female,"[0, 40]",White,Never-married,[[Higher education]],[North America],Private,[Technical],>50K
9,*,Male,"[40, 80]",White,Married-civ-spouse,[[Higher education]],[North America],Private,[Nontechnical],>50K


In [17]:
prot.info_loss

0.2353675169639633

# Remove redacted rows

In the anonymized data, some particularly identifying rows and columns may have been completely removed (fully replaced with *). We can remove these rows completely from the dataset. Note that the number of rows and columns may have changed.

In [18]:
# Ignore insensitive columns as they were not changed
insensitive_cols = [col for col in orig if prot.itypes[col] == 'INSENSITIVE']
is_not_redacted = priv.drop(insensitive_cols, axis=1).apply(lambda x: x != '*')
# Remove fully redacted rows, only keep those with at least one non-redacted element
priv = priv[is_not_redacted.any(axis=1)]
# Remove fully redacted columns, only keep those with at least one non-redacted element
for col, is_not_red in dict(is_not_redacted.any(axis=0)).items():
    if not is_not_red:
        priv = priv.drop(col, axis=1)
priv

Unnamed: 0,sex,age,race,marital-status,education,native-country,workclass,occupation,salary-class
0,Male,"[0, 40]",White,Never-married,[[Higher education]],[North America],State-gov,[Other],<=50K
1,Male,"[40, 80]",White,Married-civ-spouse,[[Higher education]],[North America],Self-emp-not-inc,[Nontechnical],<=50K
2,Male,"[0, 40]",White,Divorced,[[Secondary education]],[North America],Private,[Nontechnical],<=50K
3,Male,"[40, 80]",Black,Married-civ-spouse,[[Secondary education]],[North America],Private,[Nontechnical],<=50K
5,Female,"[0, 40]",White,Married-civ-spouse,[[Higher education]],[North America],Private,[Nontechnical],<=50K
7,Male,"[40, 80]",White,Married-civ-spouse,[[Secondary education]],[North America],Self-emp-not-inc,[Nontechnical],>50K
8,Female,"[0, 40]",White,Never-married,[[Higher education]],[North America],Private,[Technical],>50K
9,Male,"[40, 80]",White,Married-civ-spouse,[[Higher education]],[North America],Private,[Nontechnical],>50K
10,Male,"[0, 40]",Black,Married-civ-spouse,[[Higher education]],[North America],Private,[Nontechnical],>50K
11,Male,"[0, 40]",Asian-Pac-Islander,Married-civ-spouse,[[Higher education]],[Asia],State-gov,[Technical],>50K


# Use Logistic Regression to predict salary class

We will now run two logistic regression tests to see how well we can use this data to predict salary class. We will then rerun the same tests on the original data, and compare the results to see how much classification power we have lost by anonymizing the data.

In [19]:
mdl_data = pd.get_dummies(priv, columns=[col for col in priv if col != 'salary-class'])
mdl_data['salary-class'] = priv['salary-class'].astype('category').cat.codes
mdl_data

Unnamed: 0,salary-class,sex_Female,sex_Male,"age_[0, 40]","age_[40, 80]",race_Amer-Indian-Eskimo,race_Asian-Pac-Islander,race_Black,race_Other,race_White,...,native-country_[South America],workclass_Federal-gov,workclass_Local-gov,workclass_Private,workclass_Self-emp-inc,workclass_Self-emp-not-inc,workclass_State-gov,occupation_[Nontechnical],occupation_[Other],occupation_[Technical]
0,0,0,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,1,0,1,0
1,0,0,1,0,1,0,0,0,0,1,...,0,0,0,0,0,1,0,1,0,0
2,0,0,1,1,0,0,0,0,0,1,...,0,0,0,1,0,0,0,1,0,0
3,0,0,1,0,1,0,0,1,0,0,...,0,0,0,1,0,0,0,1,0,0
5,0,1,0,1,0,0,0,0,0,1,...,0,0,0,1,0,0,0,1,0,0
7,1,0,1,0,1,0,0,0,0,1,...,0,0,0,0,0,1,0,1,0,0
8,1,1,0,1,0,0,0,0,0,1,...,0,0,0,1,0,0,0,0,0,1
9,1,0,1,0,1,0,0,0,0,1,...,0,0,0,1,0,0,0,1,0,0
10,1,0,1,1,0,0,0,1,0,0,...,0,0,0,1,0,0,0,1,0,0
11,1,0,1,1,0,0,1,0,0,0,...,0,0,0,0,0,0,1,0,0,1


In [20]:
X_train, X_test, y_train, y_test = train_test_split(mdl_data.drop('salary-class', axis=1), mdl_data['salary-class'])

print(f'{X_train.shape} {X_test.shape}')

(20517, 32) (6839, 32)


In [21]:
mdl = LogisticRegression(solver='lbfgs', max_iter=20000).fit(X_train, y_train)
# mdl = LogisticRegression(random_state=0, penalty='l1',solver='liblinear',
#                               max_iter=20000).fit(X_train, y_train)

accuracy_score(y_test, mdl.predict(X_test))

0.8046498026027197

In [22]:
mdl2 = SGDClassifier(loss='log', tol=1e-3, max_iter=1000).fit(X_train, y_train)

accuracy_score(y_test, mdl2.predict(X_test))

0.8072817663401082

# Repeat the analysis for the original data

Notice the difference in accuracy scores of the anonymized versus the original data.

In [24]:
orig_copy = orig.copy(deep=False)
orig_copy = orig_copy.drop('email', axis=1)
orig_copy['salary-class'] = orig_copy['salary-class'].astype('category').cat.codes
orig_mdl_data = pd.get_dummies(orig_copy, 
                   columns=[col for col in priv if col != 'salary-class'])

X_orig_train, X_orig_test, y_orig_train, y_orig_test = train_test_split(orig_mdl_data.drop('salary-class', axis=1), orig_mdl_data['salary-class'])

In [25]:
mdl_orig = LogisticRegression(random_state=0, penalty='l1',solver='liblinear',
                              max_iter=20000).fit(X_orig_train, y_orig_train)


accuracy_score(y_orig_test, mdl_orig.predict(X_orig_test))

0.840604694337621

In [26]:
mdl_orig2 = SGDClassifier(loss='log', tol=1e-3, max_iter=1000).fit(X_orig_train, y_orig_train)

accuracy_score(y_orig_test, mdl_orig2.predict(X_orig_test))

0.8342394907837157

# Conclusion

Congraluations on finishing the CryptoNumerics Protect notebook. Again, for full documentation visit the [docs site](https://docs.cryptonumerics.com/cn-protect-ds/?page=docs.cryptonumerics.com/cn-protect-ds-html/protect.html) and please contact us at <support@cryptonumerics.com>.