# Implementation of Naive Bayes in PyDP

*Source*: [Differentially Private Na¨ıve Bayes Classification](https://www.researchgate.net/profile/Anirban-Basu/publication/262254729_Differentially_Private_Naive_Bayes_Classification/links/55dfa68208ae2fac4718fdfd/Differentially-Private-Naive-Bayes-Classification.pdf)

In [1]:
# Cite the paper here
# Mentioned any part of the paper realted to the implementation
# Some analysis of the data that is being used
from sklearn import datasets
from sklearn.model_selection import train_test_split

dataset = datasets.load_iris()
X_train, X_test, y_train, y_test = train_test_split(dataset.data, dataset.target, test_size=0.2)

In order to use the Naive Bayes algorithm with PyDP, we only need to import the `GaussianNB` class from the PyDP's package like the following:

`from pydp.ml.naive_bayes import GaussianNB`

The implementaion is inherited from scikiet learn's Naive Bayes class. Some attributes and methods have been modified to support privacy guarentees. 

The following parameters can be adjust according to the use of the algorithm:

- `epsilon`: Privacy parameter for the model. (float, default: 1.0)
  
- `bounds`: Bounds of the data, provided as a tuple of the form (min, max).  `min` and `max` can either be scalars, covering the min/max of the entire data, or vectors with one entry per feature.  If not provided, the bounds are computed on the data when ``.fit()`` is first called, resulting in a :class:`.PrivacyLeakWarning`. (tuple, optional)
  
- `priors`: Prior probabilities of the classes.  If specified the priors are not adjusted according to the data. (array-like, shape (n_classes,))

- `var_smoothing` Portion of the largest variance of all features that is added to variances for calculation stability. (float, default: 1e-9)

- `probability`. Probability for a geometric distribution from which a sample will be drawn as noise for categorical features. (float, default: 1e-2)

Source codes:
- [PyDP's Navie Bayes](https://github.com/OpenMined/PyDP/blob/feature/machine-learning/src/pydp/ml/naive_bayes.py)
- [Geometric Mechanism in PyDP Naive Bayes' implementation](https://github.com/OpenMined/PyDP/blob/feature/machine-learning/src/pydp/ml/mechanisms/geometric.py)
- [Laplace Mechanism in PyDP Naive Bayes' implementation](https://github.com/OpenMined/PyDP/blob/feature/machine-learning/src/pydp/ml/mechanisms/laplace.py)

In [2]:
import numpy as np

epsilon = 1.0 # Privacy Budger

lower = np.array([4.3, 2. , 1. , 0.1]) # lower bound of each feature's values
upper = np.array([7.5, 4. , 6. , 2.]) # upper bound of each feature's values

priors = np.array([0.5, 0.5, 0.5]) # priors of each classes

probability = 0.002 # probability for geometric distribution

var_smoothing = 1e-4 # variance smoothing

from pydp.ml.naive_bayes import GaussianNB

clf = GaussianNB(epsilon=epsilon, bounds=(lower, upper),probability=probability, var_smoothing=var_smoothing)
clf.fit(X_train, y_train)

GaussianNB(accountant=BudgetAccountant(spent_budget=[(1.0, 0)]),
           bounds=(array([4.3, 2. , 1. , 0.1]), array([7.5, 4. , 6. , 2. ])),
           probability=0.002, var_smoothing=0.0001)

In [3]:
y_pred = clf.predict(X_test)
y_pred

array([2, 0, 0, 2, 2, 0, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 0, 0, 2, 2, 1, 0,
       2, 2, 0, 2, 0, 2, 2, 2])

In [4]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score 

cm = confusion_matrix(y_test, y_pred)
print ("Accuracy : ", accuracy_score(y_test, y_pred))
cm

Accuracy :  0.8


array([[ 9,  0,  0],
       [ 0,  1,  6],
       [ 0,  0, 14]])

In [5]:
# Result from sklearn's version of GaussianNB

from sklearn.naive_bayes import GaussianNB

classifier = GaussianNB()
classifier.fit(X_train, y_train)

y_pred = classifier.predict(X_test)

cm = confusion_matrix(y_test, y_pred)
print ("Accuracy : ", accuracy_score(y_test, y_pred))
cm

Accuracy :  0.9666666666666667


array([[ 9,  0,  0],
       [ 0,  7,  0],
       [ 0,  1, 13]])