# 18. Naive Bayes

1. An intuitative approach

2. The ability to work with small data

3. Low computation costs for training and prediction

4. Often solid results in a variety of settings.

The common distributions are the normal (Gaussian), multinomial, and Bernoulli distributions.

- `GaussianNB` continuous features like (0,1,2) labels.
- `MultinomialNB` discrete data like (movie ratings ranging from 1 to 5)
- `BernoulliNB` binary
- `CalibratedClassifierCV` to create well-calibrated predicted probabilities using k-fold cross-validation

## Training a Classifier for Continuous Features

Problem : You have only continuous features and you want to train a naive Bayes classifier.

In [1]:
# Load libraries
from sklearn import datasets
from sklearn.naive_bayes import GaussianNB

# Load data
iris = datasets.load_iris()
features = iris.data
target = iris.target

In [23]:
np.unique(target)

array([0, 1, 2])

In [2]:
# Create Gaussian Naive Bayes object
classifer = GaussianNB()

# Train model
model = classifer.fit(features, target)

Gaussian naive Bayes is best used in cases when all our features are continuous

In [3]:
# Create new observation
new_observation = [[ 4,  4,  4,  0.4]]

# Predict class
model.predict(new_observation)

array([1])

In [4]:
# Create Gaussian Naive Bayes object with prior probabilities of each class
clf = GaussianNB(priors=[0.25, 0.25, 0.5])

# Train model
model = classifer.fit(features, target)

If we do not add any argument to the priors parameter, the prior is adjusted based on the data.

note that the raw predicted probabilities from Gaussian naive Bayes (outputted using predict_proba) are not calibrated.

## Training a Classifier for Discrete and Count Features

Problem : Given discrete or count data, you need to train a naive Bayes classifier.

In [5]:
# Load libraries
import numpy as np
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer

# Create text
text_data = np.array(['I love Brazil. Brazil!',
                      'Brazil is best',
                      'Germany beats both'])

# Create bag of words
count = CountVectorizer()
bag_of_words = count.fit_transform(text_data)

In [10]:
bag_of_words.data

array([2, 1, 1, 1, 1, 1, 1, 1], dtype=int64)

In [6]:
# Create feature matrix
features = bag_of_words.toarray()

# Create target vector
target = np.array([0,0,1])

In [7]:
features

array([[0, 0, 0, 2, 0, 0, 1],
       [0, 1, 0, 1, 0, 1, 0],
       [1, 0, 1, 0, 1, 0, 0]], dtype=int64)

In [8]:
target

array([0, 0, 1])

In [9]:
# Create multinomial naive Bayes object with prior probabilities of each class
classifer = MultinomialNB(class_prior=[0.25, 0.5]) ###############

# Train model
model = classifer.fit(features, target)

In [11]:
# Show feature names
count.get_feature_names()

['beats', 'best', 'both', 'brazil', 'germany', 'is', 'love']

In [12]:
import pandas as pd

pd.DataFrame(features, columns=count.get_feature_names())

Unnamed: 0,beats,best,both,brazil,germany,is,love
0,0,0,0,2,0,0,1
1,0,1,0,1,0,1,0
2,1,0,1,0,1,0,0


One of the most common uses of multinomial naive Bayes is text classification using bags of words or tf-idf approaches

In practice, this means that this classifier is commonly used when we have discrete data (e.g., movie ratings ranging from 1 to 5)

In [13]:
# Create new observation
new_observation = [[0, 0, 0, 1, 0, 1, 0]]

# Predict new observation's class
model.predict(new_observation)

array([0])

MultinomialNB contains an additive smoothing hyperparameter, alpha, that should be tuned. The default value is 1.0, with 0.0 meaning no smoothing takes place.

## Training a Naive Bayes Classifier for Binary Features

Problem : You have binary feature data and need to train a naive Bayes classifier.

In [14]:
# Load libraries
from sklearn.naive_bayes import BernoulliNB

# Create three binary features
features = np.random.randint(2, size=(100, 3))

# Create a binary target vector
target = np.random.randint(2, size=(100, 1)).ravel()

In [16]:
features[:10]

array([[0, 0, 0],
       [0, 0, 1],
       [0, 0, 1],
       [1, 0, 0],
       [0, 0, 0],
       [0, 1, 1],
       [1, 1, 0],
       [1, 1, 0],
       [1, 0, 1],
       [0, 1, 0]])

In [17]:
target[:10]

array([1, 0, 1, 0, 1, 0, 0, 0, 0, 1])

In [18]:
# Create Bernoulli Naive Bayes object with prior probabilities of each class
classifer = BernoulliNB(class_prior=[0.25, 0.5])

# Train model
model = classifer.fit(features, target)

The Bernoulli naive Bayes classifier assumes that all our features are binary such that they take only two values

like `MultinomialNB`, `BernoulliNB` has an additive smoothing hyperparameter, alpha, we will want to tune using model selection techniques. 

In [None]:
model_uniform_prior = BernoulliNB(class_prior=None, fit_prior=True)

## Calibrating Predicted Probabilities

Problem : You want to calibrate the predicted probabilities from naive Bayes classifiers so they are interpretable.

In [19]:
# Load libraries
from sklearn import datasets
from sklearn.naive_bayes import GaussianNB
from sklearn.calibration import CalibratedClassifierCV

# Load data
iris = datasets.load_iris()
features = iris.data
target = iris.target

In [25]:
features[:10]

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.1]])

In [24]:
features.shape

(150, 4)

In [26]:
np.unique(target)

array([0, 1, 2])

In [20]:
# Create Gaussian Naive Bayes object
classifer = GaussianNB()

# Create calibrated cross-validation with sigmoid calibration
classifer_sigmoid = CalibratedClassifierCV(classifer, cv=2, method='sigmoid')

# Calibrate probabilities
classifer_sigmoid.fit(features, target)

CalibratedClassifierCV(base_estimator=GaussianNB(priors=None, var_smoothing=1e-09),
            cv=2, method='sigmoid')

In [21]:
# Create new observation
new_observation = [[ 2.6,  2.6,  2.6,  0.4]]

# View calibrated probabilities
classifer_sigmoid.predict_proba(new_observation)

array([[0.31859969, 0.63663466, 0.04476565]])

The returned predicted probabilities are the average of the k-folds.

In [32]:
classifer_sigmoid.score

<bound method ClassifierMixin.score of CalibratedClassifierCV(base_estimator=GaussianNB(priors=None, var_smoothing=1e-09),
            cv=2, method='sigmoid')>

In [33]:
# Train a Gaussian naive Bayes then predict class probabilities
classifer.fit(features, target).predict_proba(new_observation) #########

array([[2.31548432e-04, 9.99768128e-01, 3.23532277e-07]])

In [34]:
# View calibrated probabilities 
classifer_sigmoid.predict_proba(new_observation) ############

array([[0.31859969, 0.63663466, 0.04476565]])

 because isotonic regression is nonparametric it tends to overfit when sample sizes are very small (e.g., 100 observations). In our solution we used the Iris dataset with 150 observations and therefore used the Platt’s sigmoid model.