# Naive Bayes with differential privacy

We start by importing the required libraries and modules and collecting the data that we need from the [Adult dataset](https://archive.ics.uci.edu/ml/datasets/adult).

In [38]:
import diffprivlib as dpl
import numpy as np
from sklearn.naive_bayes import GaussianNB

In [83]:
X_train = np.loadtxt("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",
                        usecols=(0, 4, 10, 11, 12), delimiter=", ")

y_train = np.loadtxt("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",
                        usecols=14, dtype=str, delimiter=", ")
y_train = (y_train == ">50K").astype(int)

Let's also collect the test data from Adult to test our models once they're trained.

In [84]:
X_test = np.loadtxt("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test",
                        usecols=(0, 4, 10, 11, 12), delimiter=", ", skiprows=1)

y_test = np.loadtxt("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test",
                        usecols=14, dtype=str, delimiter=", ", skiprows=1)
y_test = (y_test == ">50K").astype(int)

## Naive Bayes with no privacy

To begin, let's first train a regular (non-private) naive Bayes classifier, and test its accuracy.

In [85]:
nonprivate_clf = GaussianNB()
nonprivate_clf.fit(X_train, y_train)

GaussianNB(priors=None, var_smoothing=1e-09)

In [88]:
print("Non-private test accuracy: %.2f%%" % 
     ((nonprivate_clf.predict(X_test) == y_test).sum() / y_test.shape[0] * 100))

Non-private test accuracy: 88.86%


## Differentially private naive Bayes classification

Using the `models.GaussianNB` module of the Differential Privacy Library, we can train a naive Bayes classifier while satisfying differential privacy.

If we don't specify any parameters, the model defaults to `epsilon = 1` and selects the model's feature bounds from the data. This throws a warning, as it leaks additional privacy. To ensure no additional privacy loss, we should specify the bounds as an argument, and choose the bounds indepedently of the data (i.e. using domain knowledge).

In [91]:
dp_clf = dpl.models.GaussianNB()

If you re-evaluate this cell, the test accuracy will change. This is due to the randomness introduced by differential privacy. Nevertheless, the accuracy should be in the range of 87–93%.

In [142]:
dp_clf.fit(X_train, y_train)

print("Differentially private test accuracy (epsilon=%.2f): %.2f%%" % 
      (dp_clf.dp_epsilon, (dp_clf.predict(X_test) == y_test).sum() / y_test.shape[0] * 100))

Differentially private test accuracy (epsilon=1.00): 90.96%


## Changing `epsilon`

On this occasion, we're going to specify the `bounds` parameter as a list of tuples, indicating the ranges in which we expect each feature to lie.

In [115]:
bounds = [(17, 100), (1, 16), (0, 100000), (0, 4500), (1, 100)]

We will also specify a value for `epsilon`. High `epsilon` (i.e. greater than 1) gives better and more consistent accuracy, but less privacy. Small `epsilon` (i.e. less than 1) gives better privacy but worse and less consistent accuracy.

Setting epsilon to `float("inf")` will give the same accuracy as the non-private model trained previously.

In [140]:
dp_clf2 = dpl.models.GaussianNB(epsilon=float("inf"), bounds=bounds)

dp_clf2.fit(X_train, y_train)

GaussianNB(bounds=[(17, 100), (1, 16), (0, 100000), (0, 4500), (1, 100)],
      epsilon=None, priors=None, var_smoothing=1e-09)

In [141]:
print("Differentially private test accuracy (epsilon=%.2f): %.2f%%" % 
     (dp_clf2.dp_epsilon, (dp_clf2.predict(X_test) == y_test).sum() / y_test.shape[0] * 100))

Differentially private test accuracy (epsilon=inf): 88.86%
