## Chapter 18
---
# Naive Bayes

### 18.0 Introduction
Bayes' theorem is the premier method for understanding the probability of some event $P(A|B)$, given some new information, $P(B|A)$, and a prior belief in the probability of the event, P(A):
$$
P(A | B) = \frac{P(B|A)P(A)}{P(B)}
$$

The Bayesian method's popularity has skyrocked in the last decade, more and more rivaling the traditional frequentist applications in academia, government, and business. In machine learning, one applicaiton of Bayes' theorem to classifican comes in the form of the naive Bayes classifier. Naive Bayes classifiers combine a number of desirable qualities in practical machine learning into a single classifier:

1. An intuitive approach
2. The ability to work with small data
3. Low computation costs for training and prediction
4. Often solid results in a variety of settigns

Specifically, a naive bayes classifier is based on:
$$
P(y | x_1, ..., x_j) = \frac{P(x_1, ..., x_j | y)P(y)}{P(x_1,...,x_j)}
$$
where,
* $P(y | x_1, ..., x_j)$ is called the *posterior* and is the probability that an observation is class y given observation's values for the j features, $x_1, ..., x_j$
* $P(x_1, ..., x_j)$ is called likelihood and is the *likelihood* of an observation's values for features, $x_1, ..., x_j$, given their class y.
* $P(y)$ is called the *prior* and is our belief for the probability of class y before looking at the data
* P($x_1, ..., x_j$) is called the *marginal probability*

In naive Bayes, we compare an obsrvation's posterior values for each possible class. Specifically, because the marginal probability is constant across these comparisons, we compare the numerators of the posterior for each class. For each observation the class with the greatest posterior numerator becomes the predicted class, $\hat y$.

There are two important things to note about naive Bayes classifiers.

1. for each feature in the data, we have to assume the statistical distribution of the likelihood, $P(x_1, ..., x_j)$.
- the common distributions are the normal (Gaussian), multinomial, and Bernoulli distributions.
- the distribution chose is often determined by the nature of the features (continuous, binary, etc.)

2. naive Bayes gets its name because we assume that each feature, and its resulting likelihood, is independent. This "naive" assumption is frequently wrong, yet in practice does little to prevent building high quality classifiers

## 18.1 Training a Classifier for Continuous Features

In [3]:
from sklearn import datasets
from sklearn.naive_bayes import GaussianNB

iris = datasets.load_iris()
features = iris.data
target = iris.target

classifier = GaussianNB()

model = classifier.fit(features, target)

new_observation = [[4, 4, 4, 0.4]]
model.predict(new_observation)

array([1])

### Discussion
The most common type of naive Bayes classifier is the Gaussian naive Bayesa. In Gaussian naive Bayesam we assuem that the likelihood of the feature values, x, given an observation is of class y, follows a normal distribution:
$$
p(x_j | y) = \frac{1}{\sqrt{2\pi \sigma_y^2}} e^{-\frac{(x_j - \mu_y)^2}{2\sigma_y^2}}
$$
where $\sigma_y^2$ and $\mu_y$ are the variance and mean values of feature x_j for class y. Because of the assumption of the normal distribution, Gaussian naive Bayes is best used in cases when all our features are continuous.

One of the interesting aspects of naive Bayes classifiers is that they allow us to assign a prior belief over the respect target classes. We can do this using `GaussianNB`'s `priors` parameter, which takes in a list of the probabilities assigned to each class of the target vector

In [4]:
clf = GaussianNB(priors=[0.25, 0.25, 0.5])
model = classifier.fit(features, target)

### See Also
* How the Naive Bayes Classifier Works (http://dataaspirant.com/2017/02/06/naive-bayes-classifier-machine-learning/)

## 18.2  Training a Classifier for Discrete and Count Features


In [6]:
import numpy as np
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer

text_data = np.array(['I love Brazil. Brazil!', 'Brazil is best', 'Germany beats both'])

count = CountVectorizer()
bag_of_words = count.fit_transform(text_data)

features = bag_of_words.toarray()

target = np.array([0, 0, 1])

classifier = MultinomialNB(class_prior=[0.25, 0.5])
model = classifier.fit(features, target)

new_observation = [[0, 0, 0, 1, 0, 1, 0]]
model.predict(new_observation)

array([0])

### Discussion

Multinomial naive Bayes works similarly to Gaussian naive Bayes, but the features are assumed to be multinomial distributed. In practice this means that this classifier is commonly used when we have discrete data. One of the most common uses is text classification using bag of words or tf-idf approaches

## 18.3 Training a Naive Bayes Classifier for Binary Features

In [7]:
import numpy as np
from sklearn.naive_bayes import BernoulliNB

features = np.random.randint(2, size=(100, 3))
target = np.random.randint(2, size=(100, 1)).ravel()

classifier = BernoulliNB(class_prior=[0.25, 0.5])
model = classifier.fit(features, target)

The Bernoulli naive Bayes classifier assumes that all our features are binary such that they take only two values (e.g. a nominal categorical feature that has been one-hot encoded). Like its multinomial cousin, Bernoulli naive Bayes is often used in text classification, when our feature matrix is simply the presence or absence of a word in a document