# Usage
## Pickle files
Prerequisites: [Python 3.8](https://www.python.org/downloads/release/python-380/), [pickle](https://pypi.org/project/cloudpickle/), [numpy](https://pypi.org/project/numpy/).  
To load a `.p` file, use: `data = pickle.load(open(<DATA_PATH>, 'rb'))`.  
Sample usage code is seen below: 

Download the 1 year data:

In [1]:
!gdown --id 1oMaHqFJutp0RmAG8pHrmF1k_2H2vAegE
!gdown --id 1aDxPdt8iB4zUc7joWuieSQpRIhXcKPI2
!gdown --id 13UvEGw-I0Ylu5o-9uDLHCU-A3KAL7MiJ

Downloading...
From: https://drive.google.com/uc?id=1oMaHqFJutp0RmAG8pHrmF1k_2H2vAegE
To: /content/census_labels.p
100% 13.5M/13.5M [00:00<00:00, 91.7MB/s]
Downloading...
From: https://drive.google.com/uc?id=1aDxPdt8iB4zUc7joWuieSQpRIhXcKPI2
To: /content/census_feature_desc.p
100% 2.52k/2.52k [00:00<00:00, 2.74MB/s]
Downloading...
From: https://drive.google.com/uc?id=13UvEGw-I0Ylu5o-9uDLHCU-A3KAL7MiJ
To: /content/census_features.p
100% 175M/175M [00:01<00:00, 132MB/s]


Alternatively, download the 5 year data:

In [None]:
!gdown --id 1jGw9TnsdC8nXxiCCK46mqNbIRkWOZpHU
!gdown --id 1M7ms22gfdE1W1GecrWIaghWCgKOMY8lI
!gdown --id 1L-X-nplPyISi85W8YuI-m6eB2pdpMeoG

In [2]:
import pickle,numpy as np

In [3]:
X = pickle.load(open('./census_features.p', 'rb'))
X = np.array(X, dtype=np.float32)
print(f"Dimension of X: {X.shape}")
print(X[:3])

Dimension of X: (1685316, 13)
[[0.20833333 0.         0.7826087  1.         0.         0.
  0.         0.         0.         0.         0.3030303  0.
  0.13725491]
 [0.3125     0.42857143 0.65217394 1.         0.         0.
  0.         0.         0.         0.         0.4040404  0.
  0.13725491]
 [0.45833334 0.42857143 0.65217394 0.5        0.16666667 0.
  1.         1.         0.         1.         0.42424244 0.
  0.13725491]]


Note that the features matrix has 13 column values in each row, which correspond to columns in the `Features` in order.

Next, read the labels data

In [4]:
y = pickle.load(open('./census_labels.p', 'rb'))
y = np.array(y, dtype=np.int32)
print(f"Length of y: {len(y)}")
print(y[:20])

Length of y: 1685316
[0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0]


Read other helpful information

In [6]:
attribute_idx, attribute_dict, max_attr_vals = pickle.load(open('./census_feature_desc.p', 'rb'))

`attribute_idx` contains the column number of features.

In [7]:
print(attribute_idx)

{'DREM': 6, 'DPHY': 7, 'DEAR': 8, 'DEYE': 9, 'SEX': 5, 'COW': 1, 'MAR': 3, 'RAC1P': 4, 'WAOB': 11, 'SCHL': 2, 'ST': 12}


`attribute_dict` contains the data dictionary for each categorical feature. 

In [8]:
print(attribute_dict)

{6: {0: 'No Cognitive Difficulty', 1: 'Cognitive Difficulty'}, 7: {0: 'No Ambulatory Difficulty', 1: 'Ambulatory Difficulty'}, 8: {0: 'No Hearing Difficulty', 1: 'Hearing Difficulty'}, 9: {0: 'No Vision Difficulty', 1: 'Vision Difficulty'}, 5: {0: 'Male', 1: 'Female'}, 1: {0: 'Private For-Profit', 1: 'Private Non-Profit', 2: 'Local Govt', 3: 'State Govt', 4: 'Federal Govt', 5: 'Self-Employed Other', 6: 'Self-Employed Own', 7: 'Unpaid Job', 8: 'Unemployed'}, 3: {0: 'Married', 1: 'Widowed', 2: 'Divorced', 3: 'Separated', 4: 'Never married'}, 4: {0: 'White', 1: 'Black', 2: 'Native American', 3: 'Asian', 4: 'Pacific Islander', 5: 'Some Other Race', 6: 'Two or More Races'}, 11: {0: 'US state', 1: 'PR and US Island Areas', 2: 'Latin America', 3: 'Asia', 4: 'Europe', 5: 'Africa', 6: 'Northern America', 7: 'Oceania and at Sea'}, 12: {0: 'Alabama/AL', 1: 'Alaska/AK', 2: 'Arizona/AZ', 3: 'Arkansas/AR', 4: 'California/CA', 5: 'Colorado/CO', 6: 'Connecticut/CT', 7: 'Delaware/DE', 8: 'District of C

`max_attr_vals` contains the maximum value for each column before normalization.

In [9]:
print(max_attr_vals)

[96  7 23  4  6  1  1  1  1  1 99  7 51]


You can reverse the normalization and get values that correspond to the data dictionary

In [10]:
reverse_normalization = np.around(np.array(X)*max_attr_vals)
print(reverse_normalization)

[[20.  0. 18. ... 30.  0.  7.]
 [30.  3. 15. ... 40.  0.  7.]
 [44.  3. 15. ... 42.  0.  7.]
 ...
 [46.  0. 20. ... 40.  0.  6.]
 [63.  1. 21. ... 45.  0.  6.]
 [61.  0. 17. ... 45.  0.  6.]]


## CSV files
Prerequisites: [Python 3.8](https://www.python.org/downloads/release/python-380/), [numpy](https://pypi.org/project/numpy/).  
To load a `.csv` file, use: `data = numpy.genfromtxt(<DATA_PATH>,delimiter=',')`.  
Sample usage code is seen below: 

Download the 1 year data:

In [11]:
!gdown --id 1FS25Lwn-0qgV2sPzvkmL_HdUYZ_HJRgy
!gdown --id 1d2dYbwK9CjRgh89ISCdcYLdUWG0lfDtc

Downloading...
From: https://drive.google.com/uc?id=1FS25Lwn-0qgV2sPzvkmL_HdUYZ_HJRgy
To: /content/census_labels.csv
100% 42.1M/42.1M [00:00<00:00, 152MB/s] 
Downloading...
From: https://drive.google.com/uc?id=1d2dYbwK9CjRgh89ISCdcYLdUWG0lfDtc
To: /content/census_features.csv
100% 548M/548M [00:04<00:00, 130MB/s]


Alternatively, download the 5 year data:

In [None]:
!gdown --id 1n7O0x2uRdWhWJY4GPhWBxS1osIk5npPQ
!gdown --id 1s45dppmjCv56hM6aFX4CTDPRz7I3tIED

In [12]:
import numpy as np

Read the features data:

In [13]:
X = np.genfromtxt('./census_features.csv',delimiter=',')
print(f"Dimension of X: {X.shape}")
print(X[:3])

Dimension of X: (1685316, 13)
[[0.20833333 0.         0.7826087  1.         0.         0.
  0.         0.         0.         0.         0.3030303  0.
  0.1372549 ]
 [0.3125     0.42857143 0.65217391 1.         0.         0.
  0.         0.         0.         0.         0.4040404  0.
  0.1372549 ]
 [0.45833333 0.42857143 0.65217391 0.5        0.16666667 0.
  1.         1.         0.         1.         0.42424242 0.
  0.1372549 ]]


Note that the features matrix has 13 column values in each row, which correspond to columns in the `Features` in order.

Read the labels data:

In [14]:
y = np.genfromtxt('./census_labels.csv',delimiter=',')
print(f"Length of y: {len(y)}")
print(y[:20])

Length of y: 1685316
[0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]


# Benchmarks
We conducted benchmarks on the 1 year census data by training various classifiers, as demonstrated below for train and test accuracy. For the data, we split into train-test into 2/3 and 1/3 randomly.

## Data Overview  
Percentage of records with > \$50,000 total person's income: 58.1%   
Percentage of records with <= \$50,000 total person's income: 41.9%   

## Membership Inference Benchmarks  
We tested 4 membership inference attacks (Yeom, Shokri, Merlin and Morgan) against NN models  trained using the 1-year census data (with test-to-train ratio of 0.5). We performed tests in various differential privacy settings. We performed 5 repeated runs for each setting and the benchmark results below are the average values from the repeated runs.

### Training and Test Accuracy overview 
For NN models without differential privacy. 

|           | No privacy |
|-----------|------------|
| Train acc | 0.83162    |
| Test acc  | 0.74172    |

For NN models with Gaussian differential privacy (GDP) ('eps' stands for epsilon, the privacy budget).  

|           | eps=0.1 | eps=1.0 | eps=10.0 | eps=100.0 |
|-----------|---------|---------|----------|-----------|
| Train acc | 0.68624 | 0.74272 | 0.76174  | 0.77200   |
| Test acc  | 0.68576 | 0.73864 | 0.75652  | 0.75576   |

For NN models with Renyi differential privacy (RDP) ('eps' stands for epsilon, the privacy budget).  

|           | eps=0.1 | eps=1.0 | eps=10.0 | eps=100.0 |
|-----------|---------|---------|----------|-----------|
| Train acc | 0.66876 | 0.74158 | 0.75350  | 0.77472   |
| Test acc  | 0.66596 | 0.73672 | 0.74584  | 0.75744   |

### Attack results for NN with no privacy  

|           | Yeom           | Shokri         | Merlin         | Morgan         |
|-----------|----------------|----------------|----------------|----------------|
| PPV       | 0.6952+-0.0039 | 0.4084+-0.3335 | 0.6895+-0.0036 | 0.7284+-0.0106 |
| Advantage | 0.0894+-0.0060 | 0.0152+-0.0159 | 0.0488+-0.0063 | 0.0113+-0.0014 |

### Attack results for NN with GDP    
For epsilon=0.1:  

|           | Yeom            | Shokri         | Merlin         | Morgan         |
|-----------|-----------------|----------------|----------------|----------------|
| PPV       | 0.6625+-0.0045  | 0.5353+-0.2677 | 0.6725+-0.0109 | 0.6960+-0.0275 |
| Advantage | -0.0029+-0.0039 | 0.0022+-0.0050 | 0.0003+-0.0076 | 0.0010+-0.0006 |

For epsilon=1.0:  

|           | Yeom            | Shokri         | Merlin          | Morgan          |
|-----------|-----------------|----------------|-----------------|-----------------|
| PPV       | 0.6683+-0.0011  | 0.5374+-0.2687 | 0.6689+-0.0331  | 0.6216+-0.0685  |
| Advantage | 0.0057+-0.0040  | 0.0065+-0.0035 | -0.0043+-0.0056 | -0.0003+-0.0020 |

For epsilon=10.0:  

|           | Yeom            | Shokri         | Merlin          | Morgan          |
|-----------|-----------------|----------------|-----------------|-----------------|
| PPV       | 0.6686+-0.0010  | 0.4037+-0.3296 | 0.6595+-0.0106  | 0.6581+-0.0159  |
| Advantage | 0.0054+-0.0023  | 0.0061+-0.0052 | -0.0024+-0.0028 | -0.0009+-0.0024 |

For epsilon=100.0:  

|           | Yeom            | Shokri         | Merlin         | Morgan         |
|-----------|-----------------|----------------|----------------|----------------|
| PPV       | 0.6708+-0.0013  | 0.6717+-0.0025 | 0.6646+-0.0086 | 0.6660+-0.0281 |
| Advantage | 0.0139+-0.0013  | 0.0090+-0.0026 | 0.0029+-0.0069 | 0.0027+-0.0032 |


### Attack results for NN with RDP  
For epsilon=0.1   

|           | Yeom             | Shokri         | Merlin          | Morgan         |
|-----------|------------------|----------------|-----------------|----------------|
| PPV       | 0.6672+-0.0031   | 0.6722+-0.0033 | 0.6832+-0.0368  | 0.7322+-0.1341 |
| Advantage | -0.0007+-0.0035  | 0.0083+-0.0021 | -0.0015+-0.0066 | 0.0006+-0.0009 |

For epsilon=1.0  

|           | Yeom            | Shokri         | Merlin         | Morgan         |
|-----------|-----------------|----------------|----------------|----------------|
| PPV       | 0.6687+-0.0021  | 0.5368+-0.2684 | 0.6620+-0.0158 | 0.6434+-0.0605 |
| Advantage | 0.0015+-0.0017  | 0.0060+-0.0030 | 0.0014+-0.0041 | 0.0005+-0.0023 |

For epsilon=10.0  

|           | Yeom            | Shokri         | Merlin          | Morgan         |
|-----------|-----------------|----------------|-----------------|----------------|
| PPV       | 0.6706+-0.0019  | 0.4027+-0.3288 | 0.6658+-0.0015  | 0.6836+-0.0242 |
| Advantage | 0.0103+-0.0028  | 0.0046+-0.0056 | -0.0029+-0.0045 | 0.0018+-0.0021 |

For epsilon=100.0  

|           | Yeom            | Shokri         | Merlin          | Morgan         |
|-----------|-----------------|----------------|-----------------|----------------|
| PPV       | 0.6723+-0.0004  | 0.6705+-0.0027 | 0.6661+-0.0025  | 0.6799+-0.0126 |
| Advantage | 0.0177+-0.0010  | 0.0095+-0.0031 | -0.0013+-0.0055 | 0.0029+-0.0029 |







## Classification Benchmarks

First, download the 1 year data

In [15]:
!gdown --id 1oMaHqFJutp0RmAG8pHrmF1k_2H2vAegE
!gdown --id 1aDxPdt8iB4zUc7joWuieSQpRIhXcKPI2
!gdown --id 13UvEGw-I0Ylu5o-9uDLHCU-A3KAL7MiJ

Downloading...
From: https://drive.google.com/uc?id=1oMaHqFJutp0RmAG8pHrmF1k_2H2vAegE
To: /content/census_labels.p
100% 13.5M/13.5M [00:00<00:00, 80.8MB/s]
Downloading...
From: https://drive.google.com/uc?id=1aDxPdt8iB4zUc7joWuieSQpRIhXcKPI2
To: /content/census_feature_desc.p
100% 2.52k/2.52k [00:00<00:00, 3.70MB/s]
Downloading...
From: https://drive.google.com/uc?id=13UvEGw-I0Ylu5o-9uDLHCU-A3KAL7MiJ
To: /content/census_features.p
100% 175M/175M [00:01<00:00, 117MB/s]


Alternatively, download the 5 year data

In [None]:
!gdown --id 1jGw9TnsdC8nXxiCCK46mqNbIRkWOZpHU
!gdown --id 1M7ms22gfdE1W1GecrWIaghWCgKOMY8lI
!gdown --id 1L-X-nplPyISi85W8YuI-m6eB2pdpMeoG

Next, split the data into test and train data.

In [16]:
import pickle
import numpy as np
from sklearn.model_selection import train_test_split
x = pickle.load(open('./census_features.p', 'rb'))
y = pickle.load(open('./census_labels.p', 'rb'))
x = np.array(x, dtype=np.float32)
y = np.array(y, dtype=np.int32)

print(x.shape, len(y))

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.33)

(1685316, 13) 1685316


Next, explore various classifier benchmarks

Mutinomial Naive Bayes

In [None]:
# https://scikit-learn.org/stable/modules/naive_bayes.html
from sklearn.naive_bayes import MultinomialNB
nbClassifier = MultinomialNB()
nbClassifier.fit(X_train,y_train)

y_pred = nbClassifier.predict(X_train)
train_accuracy = (y_train == y_pred).sum() / X_train.shape[0]
print(f"Train accuracy: {train_accuracy}")

y_pred = nbClassifier.predict(X_test)
test_accuracy = (y_test == y_pred).sum() / X_test.shape[0]
print(f"Test accuracy: {test_accuracy}")


Train accuracy: 0.672298281644513
Test accuracy: 0.6727351188068075


Gaussian Naive Bayes

In [None]:
# https://scikit-learn.org/stable/modules/naive_bayes.html
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB().fit(X_train, y_train)

y_pred = gnb.predict(X_train)
train_accuracy = (y_train == y_pred).sum() / X_train.shape[0]
print(f"Train accuracy: {train_accuracy}")

y_pred = gnb.predict(X_test)
test_accuracy = (y_test == y_pred).sum() / X_test.shape[0]
print(f"Test accuracy: {test_accuracy}")

Train accuracy: 0.6788969863464998
Test accuracy: 0.6791362120272226


Logistic Regression

In [None]:
# https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
from sklearn.linear_model import LogisticRegression

logistic = LogisticRegression().fit(X_train, y_train)

y_pred = logistic.predict(X_train)
train_accuracy = (y_train == y_pred).sum() / X_train.shape[0]
print(f"Train accuracy: {train_accuracy}")

y_pred = logistic.predict(X_test)
test_accuracy = (y_test == y_pred).sum() / X_test.shape[0]
print(f"Test accuracy: {test_accuracy}")

Train accuracy: 0.7539899093220541
Test accuracy: 0.7535399304150822


K-Nearest Neighbors (1) (did not run to completion after 30 minutes)

In [None]:
# https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html
from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier(n_neighbors=1)
neigh.fit(X_train,y_train)

y_pred = neigh.predict(X_train)
train_accuracy = (y_train == y_pred).sum() / X_train.shape[0]
print(f"Train accuracy: {train_accuracy}")

y_pred = neigh.predict(X_test)
test_accuracy = (y_test == y_pred).sum() / X_test.shape[0]
print(f"Test accuracy: {test_accuracy}")

K-Nearest Neighbors (3) (did not run to completion after 30 minutes)

In [None]:
# https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html
from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(X_train,y_train)

y_pred = neigh.predict(X_train)
train_accuracy = (y_train == y_pred).sum() / X_train.shape[0]
print(f"Train accuracy: {train_accuracy}")

y_pred = neigh.predict(X_test)
test_accuracy = (y_test == y_pred).sum() / X_test.shape[0]
print(f"Test accuracy: {test_accuracy}")

Support Vector Machine (did not run to completion after 30 minutes)

In [None]:
# https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
svc = make_pipeline(StandardScaler(), SVC(gamma='auto'))
svc.fit(X_train, y_train)

y_pred = svc.predict(X_train)
train_accuracy = (y_train == y_pred).sum() / X_train.shape[0]
print(f"Train accuracy: {train_accuracy}")

y_pred = svc.predict(X_test)
test_accuracy = (y_test == y_pred).sum() / X_test.shape[0]
print(f"Test accuracy: {test_accuracy}")

Decision Tree

In [None]:
# https://scikit-learn.org/stable/modules/tree.html
from sklearn import tree
decisionTree = tree.DecisionTreeClassifier()
decisionTree = decisionTree.fit(X_train, y_train)

y_pred = decisionTree.predict(X_train)
train_accuracy = (y_train == y_pred).sum() / X_train.shape[0]
print(f"Train accuracy: {train_accuracy}")

y_pred = decisionTree.predict(X_test)
test_accuracy = (y_test == y_pred).sum() / X_test.shape[0]
print(f"Test accuracy: {test_accuracy}")

Train accuracy: 0.9406178569752232
Test accuracy: 0.7165376558693171


Random Forest

In [None]:
# https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(max_depth=2, random_state=0)
rf.fit(X_train, y_train)

y_pred = rf.predict(X_train)
train_accuracy = (y_train == y_pred).sum() / X_train.shape[0]
print(f"Train accuracy: {train_accuracy}")

y_pred = rf.predict(X_test)
test_accuracy = (y_test == y_pred).sum() / X_test.shape[0]
print(f"Test accuracy: {test_accuracy}")

Train accuracy: 0.7156649937431421
Test accuracy: 0.717889796909135


Multi-layer Perceptron

In [None]:
# https://scikit-learn.org/stable/modules/neural_networks_supervised.html
from sklearn.neural_network import MLPClassifier

mlp = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(5, 2), random_state=1)
mlp.fit(X_train, y_train)

y_pred = mlp.predict(X_train)
train_accuracy = (y_train == y_pred).sum() / X_train.shape[0]
print(f"Train accuracy: {train_accuracy}")

y_pred = mlp.predict(X_test)
test_accuracy = (y_test == y_pred).sum() / X_test.shape[0]
print(f"Test accuracy: {test_accuracy}")


Train accuracy: 0.7533983196373236
Test accuracy: 0.7540811464429881


Stochastic Gradient Descent Classifier

In [None]:
# https://scikit-learn.org/stable/modules/sgd.html
from sklearn.linear_model import SGDClassifier

sgd = SGDClassifier(loss="hinge", penalty="l2", max_iter=100)
sgd.fit(X_train, y_train)

y_pred = sgd.predict(X_train)
train_accuracy = (y_train == y_pred).sum() / X_train.shape[0]
print(f"Train accuracy: {train_accuracy}")

y_pred = sgd.predict(X_test)
test_accuracy = (y_test == y_pred).sum() / X_test.shape[0]
print(f"Test accuracy: {test_accuracy}")


Train accuracy: 0.7535462170585062
Test accuracy: 0.7541782416772302


Ridge Classifier

In [None]:
# https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeClassifier.html#sklearn.linear_model.RidgeClassifier

from sklearn.linear_model import RidgeClassifier
rc = RidgeClassifier().fit(X_train, y_train)

y_pred = rc.predict(X_train)
train_accuracy = (y_train == y_pred).sum() / X_train.shape[0]
print(f"Train accuracy: {train_accuracy}")

y_pred = rc.predict(X_test)
test_accuracy = (y_test == y_pred).sum() / X_test.shape[0]
print(f"Test accuracy: {test_accuracy}")

Train accuracy: 0.7508938052235243
Test accuracy: 0.751477555717381


Gaussian Process Classifier (crashed due to limited RAM)

In [None]:

from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF

kernel = 1.0 * RBF(1.0)
gpc = GaussianProcessClassifier(kernel=kernel,random_state=0).fit(X_train, y_train)

y_pred = gpc.predict(X_train)
train_accuracy = (y_train == y_pred).sum() / X_train.shape[0]
print(f"Train accuracy: {train_accuracy}")

y_pred = gpc.predict(X_test)
test_accuracy = (y_test == y_pred).sum() / X_test.shape[0]
print(f"Test accuracy: {test_accuracy}")